Apache Spark

Using

Kubernetes or K8s

IBM cloud

what ?

open-source

highly scalable

automated deployments

provides flexible

is a popular framework for running containerized application on a cluster

how ?

to manage containers that run distributed systems in a more resilient and flexible way

network service discovry

cluster load balancing

automated scaleup and down

orchestrating storage

portable, so can be run in the same way whether in the cloud or on-premises

sparkSQL

can use dataframe function or an SQL query +table view for data aggregation

sparkSQL supports parquet files,json datasets and hive tables

spark modules for structured data processing can run SQL queries on spark dataFrames and are usable in java, scala ,python

supports both temporary views and global temporary view

dataFrames compare to DataSets

dataFrames

Not typesafe ,use APIs in java, scala,python and R

built on top of RDDs and added in earlier spark version

DataSets

Strongly-typed

use unified java and scala APIs

built on top of dataframes and the latest data abstraction added to spark