Please enable JavaScript.

Coggle requires JavaScript to display documents.

Apache Spark, Spark SQL, DataFrame - Coggle Diagram

- - - - supports querying data either via
        
        SQL
        
        HiveQL
    - - real time processing of streaming data
        
        production web server log files
        
        Social Media
        
        Messaging queues (ie: Kafka)
    - - provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering etc.
    - - manipulating graphs and performing graph-parallel operations
      - provides a uniform tool for ETL
      - exploratory analysis and iterative graph computations
      - common graph algorithms such as PageRank
    - - base engine for large-scale parallel and distributed data processing
        
        memory management and fault recovery
        
        scheduling, distributing and monitoring jobs on a cluster
        
        interacting with storage systems
  - - - loading an external dataset
      - distributing a collection from the driver program
    - - Transformations
        
        operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result
      - Actions
        
        operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
- - - - 100X faster in memory
      - 10X faster on disk
  - - - HDFS
      - HBase
      - Apache Cassandra
      - Amazon S3
    - - run on clusters managed by Hadoop YARN or Apache Mesos
      - standalone
- - - - assigns tasks to workers, one task per partition
        
        A task applies its unit of work to the dataset in its partition and outputs a new partition dataset
        
        Results
        
        Sent back to the driver application
        
        Saved to the disk
        
        iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations
      - Spark Standalone (included)
      - Apache Mesos
        
        general cluster manager also run Hadoop applications
      - Apache Hadoop YARN
      - Kubernetes
        
        automating deployment, scaling, and management of containerized applications
- - - - faster-processing speed
    - - process data interactively
    - - increasing the cluster and hence cost
  - - - slow down the processing speed
    - - No interactive mode
- - - - a universal API for loading and storing structured data
        
        built-in support for Hive, Avro, JSON, JDBC, Parquet
    - - distributed collection of data organized into named columns
      - equivalent to a relational table in SQL used for storing data into tables
    - - is based on functional programming constructed in Scala
      - It provides a general framework for transforming trees, which is used to perform analysis/evaluation, optimization, planning, and run time code spawning
      - Optimization
        
        cost-based optimization (run time and resource utilization are termed as cost)
        
        rule-based optimization, making queries run much faster than their RDD (Resilient Distributed Dataset) counterparts
        
        Catalyst is a modular library that is made as a rule-based system. Each rule in the framework focuses on distinct optimization
    - - the entry point for working along with structured data in Spark
      - allows the creation of DataFrame objects as well as the execution of SQL queries