Semantic Caching Demo on SparkSQL and HDFS

February 24th, 2020

Overview: A caching system management used in SparkSQL/Apache Spark and HDFS

Requirements: SparkSQL, ApacheSpark, HDFS, Data caching algorithm and English skills.

Motivation: We are now using the HDFS to store and manage the data. And we do also the data analytics using SparkSQL in Apache Spark.

Normally, the Application Driver of Apache Spark will load the distributed data in HDFS into memory and do the processing. Finally, Spark will return the results to the HDFS. Sometime, the results we got from previous queries could be used again once or more time by next queries. Then, we want to cache these result in our memory long enough by using a mechanism of caching, it called semantic caching.

What we want is: an implementation of semantic caching program in Apache Spark. The program should be done by Scala/Java language but Scala is preferable.