Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Processing - Coggle Diagram
Data Processing
AWS Glue
Features:
- Glue is a Serverless, fully managed ETL service that crawls your data, builds a data catalog, performs data preparation, data transformation, and data ingestion to make my data immediately query-able
- Serve as a central metadata repository for my data lake in S3
- Able to discover schemas or table definitions and publish them for use with analysis tools such as Athena or Redshift or EMR
Glue Crawler and Data Catalog: (image)
- Glue Crawler:
- Glue Crawler (component of Glue) scans my data in S3
- Crawler also infer a schema automatically just based on the structure of the data that it finds from the S3 buckets
- Can schedule Crawler to run periodically
- Data Catalog:
- Crawler populates Glue Data Catalog (central metadata repository)
- Data Catalog uses by other AWS tools (Redshift, Athena, EMR) that might analyze that data
- Data itself remains where it was originally in S3
- Only Table Definition itself (column names, column data types) is stored in Data Catalog. Tell other services how to interpret that data and how it is structured
- Populating the AWS Glue Data Catalog (image)
- Glue and S3 Partitions: (image)
- Crawler will extract partitions based on how my S3 data is organized
- Think up front about how I will be querying my data lake in S3
- e.g. Devices send sensor data every hour
- organize my buckets as yyyy/mm/dd/device if query primarily by time ranges
- organize my buckets as device/yyyy/mm/dd/if query primarily by device
Glue + Hive:
- Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale
- Hive runs on Elastic MapReduce (EMR)
- Hive allows me to issue SQL like queries (HiveQL) on data accessible to my EMR cluster
- Glue can integrate with Hive
- Can configure Glue Data Catalog as the metastore for Hive
- Can import a Hive metastore into Glue
Glue ETL:
- Transform, Clean, Enrich Data before doing analysis
- Can automatically generate code for me to perform that ETL after I define the transformations I want to make to my data in a graphical manner
- Generate ETL code in Scala or Python, I can modify the code
- Can provide my own Spark or PySpark scripts (choose to start from scatch)
- Target (output) of ETL job can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
- Fully managed, cost effective, pay only for the resources consumed
- Jobs are run on a Serverless Spark platform
- Store transformed data in a encrypted manner:
- Use SSE to encrypt data at rest
- Use SSL for data in transit
- Custom ETL jobs on my data can be trigger-driven, schedule, on-demand
- Event-driven: As it is finding new data in S3, it can trigger off jobs to transform that data into more structured format for later processing
- Glue Scheduler to schedule the jobs
- Can provision additional Data Processing Units (DPUs) to increase performance of underlying spark jobs that are running my ETL jobs
- ETL job errors can be reported to CloudWatch (further integrate with SNS to notify me those errors automatically)
- DynamicFrame:
- Very much like a Spark DataFrame but with more ETL stuff
- Is a collection of DynamicRecords
- DynamicRecords are self describing, have a schema
Glue ETL Transformations:
- Bundled Transformations:
- DropFields, DropNullFields: remove (null) fields
- Filter: specify a function to filter records
- Join: to enrich data
- Map: add fields, delete fields, perform external lookups
- Machine Learning Transformations:
- FindMatches ML: identify duplicate or matching records in my dataset, even when the records do not have a common unique identifier and no fields match exactly
- Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark transformations (e.g. K-Means)
- Can convert between Spark DataFrame and Glue DynamicFrame
AWS Glue Development Endpoints:
- Develop ETL scripts using a notebook
- Create an ETL job that runs my script (using Spark and Glue)
- Endpoint is in a VPC controlled by security groups, connect via:
- Apache Zeppelin on my local machine
- Zeppelin notebook server on EC2 (via Glue console)
- SageMaker notebook
- Terminal window
- PyCharm professional edition
- Use Elastic IP’s to access a private endpoint address
Running Glue jobs:
- Glue Scheduler - Time-based schedules
- Can define a time-based schedule for my crawlers and jobs in Glue
- Definition of these schedules uses the Unix-like cron
- Minimum precision for a schedule is 5 minutes
- Job bookmarks:
- Glue tracks data that has been processed during a previous run of an ETL job by storing state information from the job run
- This persisted state information is called a Job Bookmark
- Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data
- Allows me to process new data only when re-running on a schedule
- Works with S3 sources in a variety of formats
- Works with relational databases via JDBC (if PK’s are in sequential order)
- Only handles new rows, not updated rows
- Can integrate with CloudWatch Events:
- Fire off a Lambda function or SNS notification when ETL succeeds or fails
- Invoke EC2 run, send event to Kinesis, activate a Step Function to move on the next stage in the pipeline once the data has been transformed using Glue ETL
Glue cost model:
- Billed by the minute for crawler and ETL jobs
- First million objects stored and accesses are free for the Glue Data Catalog
- Development endpoints for developing ETL code charged by the minute
Glue Anti-patterns:
- Multiple ETL engines:
- Glue ETL is based on Spark
- If you want to use other engines (Hive, Pig) Data Pipeline EMR would be a better fit
AWS Glue Studio:
- Visual interface for ETL workflows
- Visual job editor
- Create directed acyclic graph for complex workflows
- Sources include S3, Kinesis, Kafka, JDBC
- Transform / sample / join data
- Target to S3 or Glue Data Catalog
- Support partitioning
- Visual job dashboard
- Overviews, status, run times
AWS Glue DataBrew:
- A visual data preparation tool
- UI for pre processing large data sets
- Input from S3, data warehouse, or database
- Output to S3
- Over 250 ready made transformations
- Create “recipes” of transformations that can be saved as jobs within a larger project
- Security:
- Can integrate with KMS (with customer master keys only)
- SSL in transit
- IAM can restrict who can do what
- CloudWatch & CloudTrail
AWS Glue Elastic Views:
- Builds materialized views from Aurora, RDS, DynamoDB
- Those views can be used by Redshift, Elasticsearch, S3, DynamoDB, Aurora, RDS
- SQL interface
- Handles any copying or combining / replicating data needed
- Monitors for changes and continuously updates
- Serverless
AWS Lake Formation: (image)
- Built on top of Glue
- Makes it easy to set up a secure data lake in days
- Loading data & monitoring data flows
- Setting up partitions
- Encryption & managing keys
- Defining transformation jobs & monitoring them
- Access control
- Auditing
- Pricing:
- No cost for Lake Formation itself but underlying services incur charges (Glue, S3, EMR, Athena, Redshift)
- AWS Lake Formation Building a Data Lake (image)
- Cross account Lake Formation permission
- Recipient must be set up as a data lake administrator
- Can use AWS Resource Access Manager for accounts external to your organization
- IAM permissions for cross account access
- Lake Formation does not support manifests in Athena or Redshift queries
- IAM permissions on the KMS encryption key are needed for encrypted data catalogs in Lake Formation
- IAM permissions needed to create blueprints and workflows
Elastic MapReduce (EMR):
- Managed Hadoop framework on EC2 instances
- Includes tools like Spark, HBase, Presto, Flink, Hive & more
- EMR Notebooks: query data on EMR cluster using Python from web browser
- Offers multi-integration points with other AWS services
EMR Cluster: (image)
- Master node/ Leader node:
- Manages the cluster
- Tracks status of tasks, monitors cluster health
- Single EC2 instance
- Can be a single node cluster
- Core node:
- Hosts HDFS data and runs tasks
- Can be scaled up & down, but with some risk
- At least 1 core node in a multi-node cluster
- Task node:
- Runs tasks, does not host data
- Optional
- No risk of data loss when removing
- Good use of spot instances
- Add task node as needed as the traffic into my cluster increases
EMR Usage:
- Transient Cluster:
- Transient clusters terminate once all steps are complete
- Loading data, processing, storing, then shut down
- Low costly
- Long-Running Cluster:
- Long running clusters must be manually terminated
- Basically a data warehouse with periodic processing on large datasets
- Can spin up task nodes using Spot instances for temporary capacity
- Can use reserved instances on long running clusters to save $
- Termination protection on by default, auto termination off
- Frameworks and applications are specified at cluster launch
- Connect directly to master to run jobs directly OR
- Submit ordered steps via the console
- Process data in S3 or HDFS
- Output data to S3 or somewhere
- Once defined, steps can be invoked via the console
EMR / AWS Integration:
- EC2 for the instances that comprise the nodes in the cluster
- VPC to configure the virtual network in which I launch my instances for security
- S3 to store input and output data
- IAM to configure permissions for clusters and integrate with other services
- CloudWatch to monitor cluster performance and configure alarms
- CloudTrail to audit requests made to the service
- Data Pipeline to schedule and start my clusters and shut them down
EMR Storage Options:
- HDFS:
- Hadoop Distributed File System
- Distributed scalable file system for Hadoop
- Multiple copies of data stored across cluster instances for redundancy
- Files stored as blocks (128MB default size)
- Ephemeral: HDFS data is lost when cluster is terminated
- Useful for caching intermediate results during MapReduce processing or workloads with significant random I/O
- Hadoop tries to process data where it is stored on HDFS
- EMRFS:
- Access S3 as if it were HDFS
- Allows persistent storage after cluster termination
- Use S3 for input/ output data
- Use HDFS to store intermediate results
- EMRFS Consistent View:
- To solve consistency issue in S3 when I have a bunch of nodes in EMR cluster trying to write/ read data from S3 at the same time
- Use DynamoDB to store object metadata and track consistency for S3
- May need to tinker with read/write capacity on DynamoDB
- Local file system:
- Suitable only for temporary data (buffers, caches)
- EBS for HDFS:
- Allows use of EMR on EBS only types (M4, C4)
- Deleted when cluster is terminated
- EBS volumes can only be attached when launching a cluster
- If you manually detach an EBS volume, EMR treats that as a failure and replaces it
EMR promises:
- EMR charges by the hour + EC2 charges
- Provisions new nodes if a Core node fails
- Can resize a running cluster’s Core nodes (increases both processing and HDFS capacity)
- Core nodes can also be added or removed (but removing risks data loss)
- Can add and remove Tasks nodes on the fly (increase processing capacity but not HDFS capacity)
EMR Managed Scaling:
- EMR Automatic Scaling:
- Old way of doing it
- Custom scaling rules based on CloudWatch metrics
- Supports instance groups only
- EMR Managed Scaling:
- Support instance groups and instance fleets
- Scales spot, on demand, and instances in a Savings Plan within the same cluster
- Available for Spark, Hive, YARN workloads
- Scale-up Strategy:
- First adds core nodes, then task nodes, up to max units specified
- Scale down-Strategy:
- First removes task nodes, then core nodes, no further than minimum constraints
- Spot nodes always removed before on demand instances
EMR Tools
Apache Spark: (image)
- Distributed processing framework for big data
- In-memory caching, optimized query execution
- Supports Java, Scala, Python, and R
- Supports code reuse across
- Batch processing
- Interactive Queries (Spark SQL)
- Real time Analytics
- Machine Learning (MLLib)
- Graph Processing
- Spark Streaming:
- Integrated with Kinesis, Kafka, on EMR
- Spark is NOT meant for OLTP or batch processing
- Can do streaming analytics and it is performed in a fault tolerant way and can write the analytic results to HDFS or S3
- Can integrate Spark with Redshift (image)
- e.g. Airline flight data residing in S3 data lake. Deploy Redshift spectrum on top of that S3 data. Use Spark Redshift package to perform ETL on that S3 data. Redshift is a SQL data source for EMR Spark. Process the DataSet in Spark and place it back to other Redshift table for further processing like Machine Learning
How Spark Works? (image)
- Spark apps are run as independent processes on a cluster
- SparkContext object (driver program) coordinates them
- SparkContext works through a Cluster Manager
- Executors run computations and store data
- SparkContext sends application code and tasks to executors
Spark Components:
- Spark Core:
- Memory management, fault recovery, scheduling, distribute & monitor jobs, interact with storage
- Support APIs for Scala, Python, Java, R at the lowest level
- Spark SQL:
- Distributed query engine that provides low latency interactive queries up to 100x faster than MapReduce
- Includes a cost based optimizer, columnar storage, code generation for fast queries
- Support various data sources (e.g. JDBC, ODBC, JSON, HDFS, ORC, Parquet) for import data into Spark
- Support querying Hive tables using HiveQL
- Dataset: data structure in SparkSQL which is strongly typed and is a map to a relational schema
- Spark Streaming:
- Integrate with Spark SQL to use Datasets
- Leverages Spark Code fast scheduling capability to do real-time streaming analytics. It ingests data in mini batches
- Structured streaming
- Supports data from variety of streaming sources (e.g. Kafka, Flume, HDFS, ZeroMQ, Kinesis)
- Data received through Spark structured streaming will be added to the growing dataset (image)
- Integrate Kinesis Data Streams and Spark structured streaming (image)
- MLLib:
- A library of algorithms to do machine learning on data at large scale
- Algorithms provide the ability to do classification, regression, clustering, collaborative, filtering, pattern mining
- Can read data from HDFS, HBase, or any Hadoop data source, S3 on EMR
- Can write my MLLib applications with Scala, Java, Python or R
- GraphX:
- Distributed Graph Processing framework
- Graph in the data structure (e.g. graph of social network users that have lines representing the relationships between them)
- Provides ETL capabilities, exploratory analysis, iterative graph computation to enable users to interactively build and transform graph data structure at scale
- No longer widely used
Apache Hive on EMR: (image)
- Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface
- SQL like code query underlying unstructured data that might live in Hadoop, Yarn or S3
- EMR Hive sits on top of MapReduce to figure out how to distribute the processing of the SQL on the underlying data
- Tez allows for a in memory directed-acyclic-graph of tasks for processing data
- Why Hive?
- Uses familiar SQL syntax (HiveQL)
- Interactive UI to run HiveSQL
- Scalable works with big data on a cluster (suitable for data warehouse applications)
- Easy OLAP queries (way easier than writing MapReduce in Java)
- Highly optimized
- Highly extensible
- User defined functions (extend HiveQL)
- Thrift server (allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results)
- JDBC / ODBC driver
Hive Metastore:
- Hive maintains a Metastore that imparts a structure I define on the unstructured data that is stored on HDFS or EMRFS (image)
- Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database
- Metastore is stored in MySQL on the master node by default
External Hive Metastores: (image)
- External metastores offer better resiliency / integration
- AWS Glue Data Catalog:
- Shares schema across EMR and other AWS services
- Tie Glue to EMR using the console, CLI, or AP
- Amazon RDS / Aurora:
- Need to override default Hive configuration values for external database location
Other Hive / AWS integration points: (image)
- Load table partitions from S3
- Store data in folder as day, month, year => translated into table partitions
- alter table recover partitions => import tables concurrently into many clusters without having to maintain a shared metadata store
- Write tables directly to S3 using Hive extensions on EMR
- Load scripts from S3
- DynamoDB as an external table
- Use Hive to analyze the DynamoDB data and either load the results back into DynamoDB or archive them into S3
- R/W access
- Copy to/ from HDFS or EMRFS
- Perform JOIN’s on DynamoDB
Apache Pig on EMR:
- Writing mappers and reducers by hand takes a long time (MapReduce)
- Pig Latin: a scripting language that lets me use SQL like syntax to define your map and reduce steps
- Highly extensible with user defined functions
- How Pig Works? (image)
- Pig / AWS Integration:
- Ability to use multiple file systems
- HDFS
- Query S3 data through EMRFS
- Load JARs and scripts from S3
Apache HBase on EMR:
- Non-relational, petabyte scale database
- Based on Google’s BigTable, on top of HDFS
- In-memory
- Hive integration
- HBase / AWS integration:
- Can store data (StoreFiles and metadata) on S3 via EMRFS
- Can back up HBase data to S3 on EMR
- Apache HBase is a column-oriented, NoSQL database built on top of HDFS
HBase vs DynamoDB:
- Both are NoSQL databases intended for the same sorts of things
- But if you’re all in with AWS anyhow, DynamoDB has advantages:
- Fully managed (auto-scaling)
- More integration with other AWS services
- Glue integration
- HBase has some advantages though:
- Efficient storage of sparse data
- Appropriate for high frequency counters (consistent reads & writes)
- High write & update throughput
- More integration with Hadoop
Apache Hadoop: (image)
- MapReduce:
- Framework for distributed data processing
- Maps data to key/value pairs
- Reduces intermediate results to final output
- Largely supplanted by Spark these days
- Yet Another Resource Negotiator (YARN):
- Manages cluster resources for multiple data processing frameworks
- Hadoop Distributed File System (HDFS):
- Distributes data blocks across instances in a cluster in a redundant manner
- Ephemeral in EMR; data lost on termination
Presto on EMR:
- Presto (or PrestoDB) is an open source, distributed SQL query engine for fast analytic queries against data of any size. It supports both non-relational sources, such as the HDFS, S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Redshift, Microsoft SQL Server, and Teradata
- Can connect to many different Big data databases and data stores at once, and query across them
- Interactive queries at petabyte scale
- Familiar SQL syntax
- Optimized for OLAP analytical queries, data warehousing
- Developed, and still partially maintained by Facebook
- This is what Amazon Athena uses under the hood
- Exposes JDBC, Command Line, and Tableau interfaces
- Not suitable for OLTP or batch processing
Zeppelin:
- If you’re familiar with iPython notebooks it’s like that:
- Lets me interactively run Python scripts / code against my data
- Can interleave with nicely formatted notes
- Can share notebooks with others on your cluster
- Spark, Python, JDBC, HBase, Elasticsearch
- Zeppelin + Spark integration:
- Can run Spark code interactively (like I can in the Spark shell)
- Speeds up your development cycle
- Allows easy experimentation and exploration of your big data
- Can execute SQL queries directly against SparkSQL
- Query results may be visualized in charts and graphs
- Makes Spark feel more like a data science tool
- EMR Notebook:
- Similar concept to Zeppelin, with more AWS integration
- Notebooks backed up to S3
- Provision clusters from the notebook
- Hosted inside a VPC
- Accessed only via AWS console
Hue:
- Hadoop User Experience
- Graphical front-end for applications that run on my cluster, allowing me to interact with applications using an interface that may be more familiar or user-friendly
- IAM integration: Hue Super users inherit IAM roles
- S3: Can browse & move data between HDFS, EMRFS and S3
- A management tool and it has web based front end dashboard for the entire cluster
Splunk:
- Splunk/ Hunk makes machine data accessible, usable, and valuable to everyone
- Operational tool can be used to visualize EMR and S3 data using your EMR Hadoop cluster
- Reserved instances on 64 bit OS recommended (a public AMI of Splunk Enterprise is available)
Flume:
- Another way to stream log data into my cluster
- Web server act as an external source that provides events to a Flume Source
- Event is stored in 1 or more Channels (act as a passive store that keep the event until it is consumed by a Flume Sink)
- Flume Sink removes the event from Channel and place it into an external repository like HDFS on EMR cluster
- Made from the start with Hadoop in mind (Built in-sinks for HDFS and HBase)
- Originally made to handle log aggregation
MXNet:
- Like Tensorflow, a library for building and accelerating neural networks
- Included on EMR
- Framework that is used to build deep learning applications. A library that make it easy to write deep learning that is distributed across entire EMR cluster
S3DistCP:
- Tool for copying large amounts of data
- From S3 into HDFS
- From HDFS into S3
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large numbers of objects (Across buckets, across accounts)
Other EMR/ Hadoop Tools:
- Ganglia (Monitoring tool pre-installed on EMR. Monitoring the cluster status)
- Mahout (Machine leaning library on EMR cluster)
- Accumulo (another NoSQL database)
- Sqoop (Relational database connector. Importing data from external database into my cluster in a scalable manner)
- HCatalog (table and storage management for Hive Metastore)
- Kinesis Connector (directly access Kinesis Streams from my scripts)
- Tachyon (accelerator for Spark)
- Derby (open source relational database in Java)
- Ranger (data security manager for Hadoop)
EMR Security:
- IAM policies:
- Grant or deny permissions
- Allow user actions
- Combine with tagging to control access per cluster
- IAM rules for EMRFS requests to S3 (to control whether cluster users can access files from within EMR based on user group or location of EMRFS data within S3
- Kerberos:
- Secure user authentication through secret key cryptography (network authentication protocol that ensures passwords or other credentials are not sent over the network in an unencrypted format)
- SSH:
- Secure connection to command line on cluster instances
- Tunneling for web interface (can view web interfaces that are hosted on my master node of my cluster from outside of the cluster itself)
- Can use Kerberos or EC2 key pairs
- Can encrypting data in transit
- IAM roles:
- Control access to EMRFS data based on user, group, location of data
- Each cluster must have a service role and a role for the EC2 instance profile (we use an instance profile to pass an IAM role to an EC2 instance)
- IAM policies attached to roles
- Auto scaling role
- Service linked roles
- Block public:
- Easy way to prevent public access to data stored on my EMR cluster
- Can set at the account level before creating the cluster
Choosing EMR Instance Types:
- Master node:
- m5.xlarge if < 50 nodes, m4.xlarge if > 50 nodes
- Core & Task nodes:
- m5.xlarge is usually good
- If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
- Improved performance: m4.xlarge
- Computation intensive applications: high CPU instances
- Database, memory caching applications: high memory instances
- Network / CPU intensive (NLP, ML) cluster computer instances
- Spot instances:
- Good choice for task nodes
- Only use on core & master if you’re testing or very cost sensitive; I am risking partial data loss
Lambda
Features:
- Serverless way to run code snippets in the cloud
- Often used to process data as it is moved around
- Use cases:
- Real time file processing
- Real time stream processing
- ETL
- CRON replacement
- Process AWS events
- Supported languages: Node.js, Python, Java, C#, Go, Powershell, Ruby
- High availability
- No scheduled downtime
- Retries failed code 3 times
- Continuous scaling
- Safety throttle of 1,000 concurrent executions per region
- High performance
- New functions callable in seconds
- Events processed in milliseconds
- Code is cached automatically
- Timeout max is 900 seconds (15 min)
- Stateless
Lambda triggers: (image)
- Examples:
- Change in DynamoDB table that can trigger event data that invokes a Lambda function and that allows for real time event drive data processing for data arriving in DynamoDB tables
- Device send data to the IoT service that can then invoke Lambda and process it and it integrates with Kinesis firehose. In that case Lambda can transform the data and deliver that transformed data to S3 or Redshift or or ElasticSearch or Splunk
- More examples:
- Serverless website (image)
- Order history app (image)
- Transaction rate alarm (image)
- S3 + Lambda + Amazon Elasticsearch Service (image)
- S3 + Lambda + Data Pipeline** (image)
- S3 + Lambda + Redshift + DynamoDB (use COPY command to load data into Redshift) (image)
- Lambda + Kinesis Data Stream:
- Lambda read records (pulling) from a Kinesis stream and be processed accordingly
- Lambda code receives an event with a batch of stream records
- Specify a batch size when setting up the trigger (up to 10,000 records)
- Too large a batch size can cause timeouts
- Batches may also be split beyond Lambda’s payload limit (6 MB)
- Lambda will retry the batch until it succeeds or the data expires
- This can stall the shard if you don’t handle errors properly
- Use more shards to ensure processing isn’t totally held up by errors
- Lambda processes shard data synchronously
- Need to set up IAM role for Lambda to access other AWS services
Cost Model:
- Pay for what I use (how many requests send to Lambda and how much memory consumed in Lambda function)
- Generous free tier (1 million requests / month, 400K GB seconds compute time)
- $0.20 / million requests
- $.00001667 per GB/ second
Anti-patterns:
- Long running applications: Use EC2 instead, or chain of Lambda functions
- Dynamic websites: Although Lambda can be used to develop Serverless apps that rely on client side AJAX
- Stateful applications: But I can work in DynamoDB or S3 to keep track of state
AWS Data Pipeline:
- A web service that helps me reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals
- Destinations include S3, RDS, DynamoDB , Redshift and EMR
- Manages task dependencies
- Retries and notifies on failures (retry 3 times)
- Cross region pipelines
- Precondition checks, for examples
- DynamoDBDataExists => A precondition to check that data exists in a DynamoDB table
- DynamoDBTableExists => A precondition to check that the DynamoDB table exists
- S3KeyExists => Checks whether a key exists in an Amazon S3 data node
- S3PrefixExists => Checks for at least one file existing within a specific path
- Data sources may be on premises (need to install Task Runner)
- Highly available
- Data Pipeline Activities (actions):
- EMR (spin up EMR instance, run sequence of steps and automatically terminate the cluster when it is done)
- Hive (runs a Hive query on an EMR cluster)
- Copy (copy data between S3 and JDBC data source or run SQL query and copy the output into S3)
- SQL
- Scripts (run Linux shell command or programs)
AWS Step Functions:
- A low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications
- Use to design workflows
- Easy visualizations
- Advanced Error Handling and Retry mechanism outside the code
- Audit of the history of workflows
- Ability to Wait for an arbitrary amount of time
- Max execution time of a State Machine is 1 year