Please enable JavaScript.
Coggle requires JavaScript to display documents.
Apache Spark, Spark SQL, DataFrame - Coggle Diagram
Apache Spark
What
is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers.
Components
-
-
MLlib Machine learning
provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering etc.
-
-
-
-
-
When
-
Processing streaming, real-time data from sensors, IoT, or financial systems, especially in combination with static data
-
-
Why
-
-
Support
programming langauages : java, python, R and Scala
-
-
-
Comparison with Hadoop
Spark
lightning fast cluster computing technology, which extends the MapReduce model to efficiently use with more type of computations
-
-
-
process real-time data, from real time events like twitter, facebook
-
-
Spark SQL
Why
-
-
Hive cannot drop encrypted databases in cascade when the trash is enabled and leads to an execution error
-
MapReduce lags in the performance when it comes to the analysis of medium-sized datasets (10 to 200 GB)
What
-
is not a database but a module that is used for structured data processing integrated with Spark’s functional programming
-
DataFrame
What
-
equivalent to a table in a relational database or a data frame in R/Python but richer optimizations under the hood
-
-