Please enable JavaScript.
Coggle requires JavaScript to display documents.
AWS Glue - Coggle Diagram
AWS Glue
Data Catalog
central repository and persistent metadata store
disparate systems can store and find metadata
to query and transform the data.
store
for a given dataset
its table definition
available for ETL
readly for querying with
Athena
EMR
Redshift Spectrum
physical location
business relevant attributes
track how data has changed over time
Apache Hive Metastore
replacement for Big Data applications on EMR
integration with
Athena
EMR
Redshift Spectrum
provides
comprehensive audit
governance capabilities
with
schema change tracking
data access controls
Each AWS account has one AWS Glue Data Catalog per region
Crawlers
extract
schema of the data
automatically infer
schemas
partition structure
other statistics
populates the Glue Data Catalog with this metadata.
can be scheduled to run periodically
automatically add to existing table
new tables
new partitions
new versions of table definition
Dynamic Frames
used in ETL scripts
is a distributed table
supports nested data
structures
array
similar to an Apache Spark dataframe
can be converted one to each other
Each record is self-describing
contains
data
schema
no schema is required initially
provide
schema flexibility
advanced trasformations
Streaming ETL
ETL operations on streaming data
using continuously-running jobs.
built on the
Apache Spark Structured Streaming
engine
ingests streams from
Kinesis Data Streams
Apache Kafka
using Amazaon Managed Streaming
process event data like
IoT streams
clickstreams
network logs
Job Bookmark
tracks data that has already been processed
during a previous run of an ETL job
maintain state information
prevent the reprocessing of old data.
help process new data when rerunning on a scheduled interval
composed by
sources
transformations
targets