Apache Spark
Using
Kubernetes or K8s
IBM cloud
what ?
open-source
highly scalable
automated deployments
provides flexible
is a popular framework for running containerized application on a cluster
how ?
to manage containers that run distributed systems in a more resilient and flexible way
network service discovry
cluster load balancing
automated scaleup and down
orchestrating storage
portable, so can be run in the same way whether in the cloud or on-premises
sparkSQL
can use dataframe function or an SQL query +table view for data aggregation
sparkSQL supports parquet files,json datasets and hive tables
spark modules for structured data processing can run SQL queries on spark dataFrames and are usable in java, scala ,python
supports both temporary views and global temporary view
dataFrames compare to DataSets
dataFrames
Not typesafe ,use APIs in java, scala,python and R
built on top of RDDs and added in earlier spark version
DataSets
Strongly-typed
use unified java and scala APIs
built on top of dataframes and the latest data abstraction added to spark