Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Processing - Coggle Diagram

- - - - How Spark Works? (image)
        
        Spark apps are run as independent processes on a cluster
        
        SparkContext object (driver program) coordinates them
        
        SparkContext works through a Cluster Manager
        
        Executors run computations and store data
        
        SparkContext sends application code and tasks to executors
      - Spark Components:
        
        Spark Core:
        
        Memory management, fault recovery, scheduling, distribute & monitor jobs, interact with storage
        
        Support APIs for Scala, Python, Java, R at the lowest level
        
        Spark SQL:
        
        Distributed query engine that provides low latency interactive queries up to 100x faster than MapReduce
        
        Includes a cost based optimizer, columnar storage, code generation for fast queries
        
        Support various data sources (e.g. JDBC, ODBC, JSON, HDFS, ORC, Parquet) for import data into Spark
        
        Support querying Hive tables using HiveQL
        
        Dataset: data structure in SparkSQL which is strongly typed and is a map to a relational schema
        
        Spark Streaming:
        
        Integrate with Spark SQL to use Datasets
        
        Leverages Spark Code fast scheduling capability to do real-time streaming analytics. It ingests data in mini batches
        
        Structured streaming
        
        Supports data from variety of streaming sources (e.g. Kafka, Flume, HDFS, ZeroMQ, Kinesis)
        
        Data received through Spark structured streaming will be added to the growing dataset (image)
        
        Integrate Kinesis Data Streams and Spark structured streaming (image)
        
        MLLib:
        
        A library of algorithms to do machine learning on data at large scale
        
        Algorithms provide the ability to do classification, regression, clustering, collaborative, filtering, pattern mining
        
        Can read data from HDFS, HBase, or any Hadoop data source, S3 on EMR
        
        Can write my MLLib applications with Scala, Java, Python or R
        
        GraphX:
        
        Distributed Graph Processing framework
        
        Graph in the data structure (e.g. graph of social network users that have lines representing the relationships between them)
        
        Provides ETL capabilities, exploratory analysis, iterative graph computation to enable users to interactively build and transform graph data structure at scale
        
        No longer widely used
    - - Hive Metastore:
        
        Hive maintains a Metastore that imparts a structure I define on the unstructured data that is stored on HDFS or EMRFS (image)
        
        Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database
        
        Metastore is stored in MySQL on the master node by default
      - External Hive Metastores: (image)
        
        External metastores offer better resiliency / integration
        
        AWS Glue Data Catalog:
        
        Shares schema across EMR and other AWS services
        
        Tie Glue to EMR using the console, CLI, or AP
        
        Amazon RDS / Aurora:
        
        Need to override default Hive configuration values for external database location
      - Other Hive / AWS integration points: (image)
        
        Load table partitions from S3
        
        Store data in folder as day, month, year => translated into table partitions
        
        alter table recover partitions => import tables concurrently into many clusters without having to maintain a shared metadata store
        
        Write tables directly to S3 using Hive extensions on EMR
        
        Load scripts from S3
        
        DynamoDB as an external table
        
        Use Hive to analyze the DynamoDB data and either load the results back into DynamoDB or archive them into S3
        
        R/W access
        
        Copy to/ from HDFS or EMRFS
        
        Perform JOIN’s on DynamoDB
    - - HBase vs DynamoDB:
        
        Both are NoSQL databases intended for the same sorts of things
        
        But if you’re all in with AWS anyhow, DynamoDB has advantages:
        
        Fully managed (auto-scaling)
        
        More integration with other AWS services
        
        Glue integration
        
        HBase has some advantages though:
        
        Efficient storage of sparse data
        
        Appropriate for high frequency counters (consistent reads & writes)
        
        High write & update throughput
        
        More integration with Hadoop