Please enable JavaScript.

Coggle requires JavaScript to display documents.

Big Data & Analytics - Coggle Diagram

- - - - When you request a cluster you can select the number and the instance type, along with open-source technology that you want
      - An EMR cluster lives inside a VPC
      - EC2 rules apply: you can use RIs, Spot Instances to reduce cost
      - EMR process data and move results to S3 buckets
      - It is a managed fleet of EC2 instances running open source tools
- - - - RANDOM_CUT_FOREST
        
        SQL function used for anomaly detection on numeric columns in a stream
        
        adapts over time, it only uses recent history to compute the model
      - HOTSPOTS
        
        locate and return information about relatively dense regions in your data
  - - - Hard to Use you are responsible for scaling the streams (aka shards) and develop consumers. Send data to Lambda, Kinesis Data analytics, Kinesis Data Firehose and Kinesis Client Library
      - Real-Time streaming for ingesting data
      - Preserve ordering each data record has a sequence number that is assigned by Kinesis Data Streams
      - Data retention:
        
        default 24 hours
        
        extended up to 7 days
        
        long-term up to 365 days
    - - Fully-managed service for Near-Real-Time (within 60 seconds) data transfer tool
      - Send data into S3, Redshift, OpenSearch, Splunk, and any custom HTTP endpoint or HTTP endpoints by third-party service providers, Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic and Sumo Logic
      - With data transformation feature, you can specify a Lambda to perform transformations directly on the stream
      - Retains the records for 24 hours
      - Sources: Kinesis Data Stream, Kinesis Agent, Kinesis Data Firehose API, CloudWatch Log, EventBridge Event, AWS IoT and SNS
- - - - Data cleaning steps in DataBrew are stored as a recipe
        
        A recipe is connected to a project by default
        
        An existing recipe with no associated project could also be applied to a project and Datasets
    - - A visual data preparation tool for cleaning and normalizing data to prepare it for analytics and ML
        
        Explore data in columns with 40+ quality statistics - find anomalies/patterns
        
        +250 ready-made transformations (e.g. filtering anomalies, data conversion, correct invalid values, ...)
        
        Data sources include S3, Redshift, Aurora, Glue Data Catalog, ...