Please enable JavaScript.
Coggle requires JavaScript to display documents.
Prerequisite - Coggle Diagram
Prerequisite
Spark core concepts
Spark SQL
Spark Streaming
Spark MLlib
-
- Statistics, Hypothesis testing
- Features extracting and transforming
- Supervised learning: Regression, Classification
- Unsupervised learning: clustering, PCA, etc
- Machine learning pipeline
- Hyper-parameter tuning
- Models deployment
- Streaming sources: Kafka, S3, HDFS, etc
- Micro-batch processing, DStream
- Transformations
- Windows functions
- Output operations
- Checkpointing
Columns
- Sorting: asc_nulls_last, desc_nulls_first, etc
- Type cast
- Null check: isNull, isNotNull, isin
- case when
DataFrame
- DataFrameReader, DataFrameWriter
- Schema, complex datatypes
- Common operations: select, drop, orberBy, union, withColumn, withColumnRenamed, etc
- Joining: broadcast join, crossJoin, inner join, etc
- Missing data: dropna, fillna
Aggregate / Window functions
- groupBy
- partitionBy, rangeBetween
SQL functions
- Datetime functions: add_months, date_add, current_date, datediff, months_between, etc
- Array functions: collect_list, collect_set, explode, array_contains
- Window functions: rank, dense_rank, lag, lead, first, last
- Aggregate functions: sum, mean, etc
Basic concepts
- Horizontal scaling, distributed computing in cluster
- Driver, worker nodes
- Spark app, Spark Session
- Spark job, stage, task
- Spark data types
RDD / Key-value RDD
- Partitions
- Lazy evaluation, DAG, fault tolerant
- Transformations: map, filter, distinct, union, intersection, etc
- Actions: reduce, collect, take, etc
- RDD caching
Data partitioning
- Data shuffling
- Co-locate partitions
- HashPartitioner / RangePartitioner
Spark IO
- Text files, CSV, JSON
- Parquet, ORC
- HDFS, AWS S3
- JDBC, ODBC, Cassandra, Elasticsearch
Advanced topics
- Broadcast variable / Accumulators
- Get & set Spark configurations
- Spark UI, monitoring
Python programming
- Data types
- Operators
- If/else, Loops
- List, Set, Tupple, Dictionary
- Class/Object, Function
- Exception (Try/except)
SQL
- Data types and Operators
- SQL queries: SELECT, INSERT, UPDATE
- SQL tables & views: create, rename, drop, update, etc
- SQL clauses: WHERE, ORDER BY, GROUP BY, PARTITION BY, HAVING, CASE WHEN, BETWEEN, UNION, etc
- SQL Joins