Please enable JavaScript.

Coggle requires JavaScript to display documents.

RDDs in parallel computing and spark, Data Frames & Datasets in spark,…

- - - - filter(func)
      - distinct([numTasks])
      - filtermap(func)
      - map(func)
- - - - take(n)
      - collect()
      - takeOrdered(n,key=func)
      - reduce(func)
- - - - A graphical data structure with edges and vertices
        
        Vertices is RDDs
        
        edges is operations as transformations or actions
      - if a node goes down, spark replicates the DAG and restore the node
- - - - use API in Java, scala, python and R
      - Not Type-safe
      - build on top of RDDs and added in earlier spark version
    - - Strongly-typed
      - Build on top of dataframes and the latest data abstraction added to spark
      - use unified Java and scala API
- - - - can detect semantics and syntax errors before the deployment
- - - - What ?
        
        Based of functional programming constructs in Scala
        
        support the addition of new optimization techniques and features
        
        enable developers to add data source-specific rules and support new data types
        
        Spark SQL build-in rule-Based query optimizer
      - How ?
        
        1 : Analysis
        
        analyze the query dataframe, catalogs, and the unresolved logical plan to create a logical plan
        
        2 : logical optimization
        
        the logical plan evolved to an optimized logical plan
        
        Physical planning
        
        catalyste generate multiple Physical plans(a computation on dataset) on a logical plan
        
        the cost model then choose the physical plan with least cost(Cost-Based optimization)
        
        4 : Code generation
        
        catalyst deploys the selected physical plan and generate Java Byte code to run on each node
- - - - What ?
        
        Spark's cost-based optimizer that maximizes CPU and memory performance
      - How ?
        
        manage memory explicitly and does not rely on the JVM object model or garbage collection
        
        create cache-friendly data structure instead of arranged that run easily and securely using stride-based memory access instead of random memory access