Please enable JavaScript.
Coggle requires JavaScript to display documents.
RDDs in parallel computing and spark, Data Frames & Datasets in spark,…
RDDs in parallel computing and spark
create an RDD from existing one
Transformation
filter(func)
distinct([numTasks])
filtermap(func)
map(func)
evaluate a transformation we use actions
Action
take(n)
collect()
takeOrdered(n,key=func)
reduce(func)
How transformation and actions happen ?
DAG (directed cyclic graph)
A graphical data structure with
edges
and
vertices
Vertices is RDDs
edges is operations as transformations or actions
if a node goes down, spark replicates the DAG and restore the node
Data Frames & Datasets in spark
Benefits
immutable
can convert JVM objects to a tabular representation
work with scala and Java APIs
how to create ?
From a sequence of primitive Datatype
from a text file
from a json file
datasets vs dataframes
Datasets
use API in Java, scala, python and R
Not Type-safe
build on top of RDDs and added in earlier spark version
DataFrames
Strongly-typed
Build on top of dataframes and the latest data abstraction added to spark
use unified Java and scala API
benefits of datasets in spark
provide
compile-Time type safety
can detect semantics and syntax errors before the deployment
compute faster than RDD
enable improved memory usage and caching
use dataset API functions to aggregate operations including sum, average, join and grouped By
what is Dataset ?
the new Spark data abstraction
collection of strongly typed JVM objects
provide benefits of both
RDD
and
SparkSQL
The Basic Dataframe operations
Read the Data
Create a datafram
create a datafram from an existing datafram
Load data into a DataBase
the final step of ETL pipeLine
export to another dataBase
export to Disk as JSON files
save the Data to a postgres Database
use an API to export Data
Transform the Data
keep only the relevant Data
Apply filters, joins, sources and tables columns operations, grouping and aggregations and other functions
filter the data, sorting, join with other datasets
Analyze the Data
View the schema
working with aggregated stats analysis and other operations
examining the columns/data types...
Spark SQL and memory optimization Using Catalyst & Tungsten
Rule-based query optimization
Catalyst
What ?
Based of functional programming constructs in Scala
support the addition of new optimization techniques and features
enable developers to add data source-specific rules and support new data types
Spark SQL build-in rule-Based query optimizer
How ?
1 : Analysis
analyze the query dataframe, catalogs, and the unresolved logical plan to create a
logical plan
2 : logical optimization
the logical plan evolved to an optimized
logical plan
Physical planning
catalyste generate multiple Physical plans(a computation on dataset) on a logical plan
the cost model then choose the physical plan with least cost(
Cost-Based optimization
)
4 : Code generation
catalyst deploys the selected physical plan and generate Java Byte code to run on each node
Cost-based Optimization
Tungsten
What ?
Spark's cost-based optimizer that maximizes CPU and memory performance
How ?
manage memory explicitly and does not rely on the JVM object model or garbage collection
create cache-friendly data structure instead of arranged that run easily and securely using stride-based memory access instead of random memory access
Goal
improve the query run time performance and saving memory consuming. which saves organizations time and money