Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Engineering with GCP - Coggle Diagram
Data Engineering with GCP
Preface
Section 1: Getting started with Data Engineering with GCP
Fundamentals of Data Engineering
Understanding the data life cycle
Understanding the need for a data warehouse
From operational to decision making
Problems
Data silos
Multiple OS
to store the data to process
Typical warehouse stack
Storage
Compute
Single monolithic software
Data lakes
Schema
SQL Interface
:check: Answer these questions if you want to display information to your end user
Who will consume the data?
What data sources should I use?
Where should I store the data?
When should the data arrive?
Why does the data need to be stored in this place?
How should the data be processed?
Data lakes vs warehouses
Data lakes can store unstructured data
Data Lake
Schema is not mandatory
Compute with different technologies for the same underlying storage
First focus - store as much data as possible
Business relevancy and data model are defined later
Data Warehouse
Schema is mandatory
Only SQL, dev does not have control over compute
First focus is business relevancy and data models
Only store data based on the business needs
Data life cycle flow
Application
Databases
Data lake
Data warehouse
Data Mart
2 more items...
ML modelling
ML pipeline
Knowing the roles of a data engineer before starting
Data engineer versus data scientist
Data Scientist Before
How to handle big data infra
Properly design and built ETL pipelines
Train ML models
Understand deeply about the business
Now the roles is split into mulitple roles
Data analyst
ML Engineer
Business Analyst
Data Engineer
The focus of data engineers
Build data warehouse
Build Data Lake
Orchestrate ETL jobs
Data Engineer Responbilities
Handle all big data infrastructures and software installation
Handle application databases
Design the data warehouse data model
Analyze big data to transform raw data into meaningful information
Create a data pipeline for ML
Data mart person should understand Business Needs
Foundational concepts for data engineering
ETL concept in data engineering
ETL is the core of Data Engineering
The difference between ETL and ELT
The transformation is done in the downstream system for ex BigQuery is good to transform
What is NOT big data?
Two concepts
Data is big
Big data technology
How questions?
How to store 1 PB of data when common size HD is in TBs?
How to average a list of nos when they are stored in multiple computers
Continuous extraction and aggregate as a streaming process?
A quick look at how big data technologies store data
Size of the data relative to the system
Large file is distributed into multiple systems
Optimize performance
Further questions?
How do i process the files?
How to aggregate?
How different parts know other parts?
A quick look at how to process multiple files using MapReduce
2 definitions
MapReduce as a tech
As a concept
Concept is important
1, Map the file parts
Add a static value - transform it to a key value pair
Shuffle
Move data between machines to form groups
Reduce
Produce the desired result
Result is stored in a single machine
:check: Questions?
Extract Transform Load (ETL) :star:
ETL vs ELT?
Big Data?
How do you handle large volumes of data?
End
Summary
Exercise
Answer:
https://trello.com/c/oEylBM7i
See also
Big Data Capabilities on GCP
Technical requirements
GCP Console
Cloud Shell
Cloud Editor
Understanding what the cloud is
The difference between the cloud and non-cloud era
The on-demand nature of the cloud
Hardoop pattern
Hardoop File System
Process thousands of jobs in Spark
In GCP its common to create a spark cluster for each individual job
Ephemeral cluster
Getting started with Google Cloud Platform
Introduction to the GCP console
Practicing pinning services
Creating your first GCP Project
Using GCP Cloud Shell
A quick overview of GCP services for data engineering
Understanding the GCP serverless service
Three groups of services
VM based
Managed services
Serverless (fully managed services)
Service mapping and prioritization
Identity and Management Tools
IAM & Admin
Logging
Data Catalog
Monitoring
Storage & DB
Cloud Storage
BigTable
SQL
Datastore
Big Data
BigQuery
DataProc
DataFlow
Pub/Sub
ML & BI
Vertex
Data Studio
Looker
ETL Orchestrator
Cloud Composer
Data Fusion
Dataprep
The concept of quotas on GCP services
https://cloud.google.com/bigquery/quotas
User account versus service account
To automate tasks
Summary
:check: Important decision factors
Choose services that are serverless
fully managed
SaaS
Easy to use
Understand mapping bn services and the data engineering areas
More than one option in one area, choose the most popular service in the market
Long run support
Hire experts more easily
Section 2: Building Solutions with GCP Components
Building a Data Warehouse in BigQuery
Technical requirements
python3
SQL
Linux commands
Git
Introduction to Google Cloud Storage and BigQuery
BigQuery data location
Wow features
Streaming
Machine Learning
SQL Interface
Not a transactional database
GCS is fully managed object storage
No need to worry about infra
Large files - Images, videos, large csv, dump storage from database, exporting historical data from BigQuery, ML Model files
BigQuery
Fully managed data warehouse
store and analyse data immediately
4 parts
Storage
Processing
Metadata
SQL Interface
Google Colossus filesystem
Distributed SQL execution engine
inspired by Dremel SQL
Introduction to the BigQuery console
Creating a dataset in BigQuery using the console
Datasets are group of tables
Should have meaningful names
Ex: Website Logs
Database : Technical purposes vs Dataset: Business perspective purposes
Loading a local CSV file into the BigQuery table
Using public data in BigQuery
Data types in BigQuery compared to other databases
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
Timestamp data in BigQuery compared to other databases
Only allowed format is UTC
Preparing the prerequisites before developing our data warehouse
Step 1: Access your Cloud shell
Step 2: Check the current setup using the command line
Step 3: The gcloud init command
Step 4: Download example data from Git
Step 5: Upload data to GCS from Git
Practicing developing a data warehouse
Data warehouse in BigQuery - Requirements for scenario 1
Steps and planning for handling scenario 1
Data pipeline
Create MySQL DB
Extract MySQL to GCS
Load GCS to BigQuery
Create BigQuery Data Mart
Extractions happen from the clone instance, so they dont interrupt the production db
Data warehouse in BigQuery - Requirements for scenario 2
Steps and planning for handling scenario 2
https://trello.com/c/BOFAGvl7
Design data modeling for BigQuery
Mistakes
Data is duplicated in many locations
Values not consistent
The cost of processing is highly inefficient
End user does not understand how to use data warehouse
The business does not trust data
Need to spend more time in data modelling in warehouse than in data lake
:check:Data modeling
Process of representing database objects in business perspective
Objects in BigQuery
Datasets
Tables
Views
End users are humans
Vs for app db end users are apps
Common end users
Business Analysts
Data Analysts
Data Scientists
BI Users
Any other user for business purposes
:check: Store information in multiple tables, so you can write JOIN queries to get necessary data
Other purposes of Data Model
Data consistency
Query performance
https://trello.com/c/ttqrELK5
- Normalized Table
1 more item...
Storage efficiency
Inmon versus the Kimball data model
Inmon - Data Driven
Build central data warehouse
1 more item...
Data model is Highly normalized to the lowest level
Enterprise Data Warehouse - Single data source of all data marts
1 more item...
Top down
Kimball - User Driven
Focuses on answering user questions
Goal: Ease of use, high level of performance improvement
Fact table
1 more item...
Dimension Table
1 more item...
Star schema
2 more items...
:check:The best data model for BigQuery
Data model needs
Real world scenario for the end user
High data consistency
Query speed
Store data efficiency
Highly accurate without need to develop faster
Inmon
Kimball - most of the times
Agile
User driven mindset
Good enough even if there are inconsistencies and inefficiencies here and there
Creating fact and dimensions
Since there are clear questions from business users
End
Summary
Exercise - Scenario 3
See also
Intro
:check:Starting point
requirements
business questions
:check:Thinking process
Intro
Challenges in making data driven decisions
Technology bottlenecks
Data consistency
Ability to serve multiple business puposes
3 services
Big Query
GCS
Cloud SQL
Building Orchestration for Batch Data Loading Using Cloud Composer
Technical requirements
Cloud composer environment
Cloud SQL Instance
GCS bucket
BigQuery datasets
Cloud shell
Code editor
Introduction to Cloud Composer
focus only on the development and deployment of our data pipeline and not on infra, installation and software management
Airflow is a workflow management tool
Use a python script to manage workflows
3 main components
Handling task dependencies
tasks dependent on other tasks
Scheduler
routines everday
System integration
Tools - GCS, BigQuery, Py, bash scripts, an email API
Understanding the working of Airflow
Provisioning Cloud Composer in a GCP project
Exercise: Build data pipeline orchestration using Cloud Composer
Level 1 DAG - Creating dummy workflows
Level 2 DAG - Scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
Level 3 DAG - Parameterized variables
Level 4 DAG - Guaranteeing task idempotency in Cloud Composer
Level 5 DAG - Handling late data using a sensor
Summary
Intro
Orchestration
Automate tasks, jobs and their dependencies
Problems to solve
pipeline issues
data duplication
task dependencies
managing connections
handling late data
Building a Data Lake Using Dataproc
Technical requirements
Introduction to Dataproc
A brief history of the data lake and Hardoop ecosystem
A deeper look into Hardoop components
How much Hadoop related knowledge do you need on GCP?
Introducing the Spark RDD and the DataFrame concept
Introducing the data lake concept
Hardoop and Dataproc positioning on GCP
Exercise - Building a data lake on a Dataproc cluster
Creating a Dataproc cluster on GCP
Using Cloud Storage as an underlying Dataproc file system
Exercise: Creating and running jobs on a Dataproc cluster
Preparing log data in GCS and HDFS
Developing Spark ETL from HDFS to HDFS
Developing Spark ETL from GSC to GCS
Developing Spar ETL from GSC to BigQuery
Understanding the concept of the ephemeral cluster
Practicing using a workflow template on Dataproc
Building an ephemeral cluster using Dataproc and Cloud Composer
Summary
Processing Streaming Data with Pub/Sub and Dataflow
Technical requirements
Dataflow
Pub/Sub
GCS
BigQuery
Processing streaming data
Streaming data for data engineers
Data flow from upstream to downstream as soon as the data is created
But in batches, like every hour or every day depending on the use case
Pub/Sub to control the incoming and outgoing data streams as messages
DataFlow will process that accepts the messages to process the data as a streaming process
Introduction to Pub/Sub
Messaging system
4 main technologies in Pub/Sub
Publisher
Control incoming messages
Code to publish msgs from their apps
Pub/Sub will store the msgs as topics
Topic
Analogy - Topics -> Tables; Msgs - Rows in tables
Subscription
Interested in receiving msgs from topics
Each topic can have one or many subscriptions
subscriptions will get identical
messages from the topic.
Subscriber
points
Each subscription can have one or
many subscribers. The idea of having multiple subscribers in a subscription is to
split the loads. For example, for one subscription that has two subscribers, the two
subscribers will get partial messages from the subscription.
Acknowledge (ack)
Will stop sending after ack is received
Introduction to Dataflow
Can handle
Batch data
Streaming data
Write code using Apache Beam SDK, this will run data pipeline using DataFlow
Exercise - Publishing event streams to cloud Pub/Sub
Creating a Pub/Sub topic
Creating and running a Pub/Sub publisher using Python
Creating a Pub/Sub subscription
Exercise - Using Cloud Dataflow to stream data from Pub/Sub to GCS
Creating a HelloWorld application using Apache Beam
Creating a Dataflow streaming job without aggreation
Creating a streaming job with aggregation
Summary
Visualizing Data for Making Data Driven Decisions with Data Studio
Technical requirements
Unlocking the power of your data with Data Studio
From data to metrics in minutes with an illustrative use case
Understanding what BigQuery INFORMATION_SCHEMA is
Exercise - Exploring the BigQuery INFORMATION_Schema table using Data Studio
Exercise - Creating a Data Studio report using data from a bike-sharing data warehouse
Understanding how Data Studio can impact the cost of BigQuery
What kind of table could be 1 TB in size?
How can a table be accessed 10,000 times in a month?
How to create materialized views and understanding how BI Engine works
Understanding BI Engine
Summary
Building Machine Learning Solutions on Google Cloud Platform
Technical requirements
A quick look at machine learning
Exercise - Practicing ML code using Python
Preparing the ML dataset by using a table from the BigQuery public dataset
Training the ML model using Random Forest in Python
Creating batch prediction using the training dataset's output
The MLOps landscape in GCP
Understanding the basic principles of MLOps
Introducing GCP services related to MLOps
Exercise - leveraging pre-built GCP models as a service
Uploading the mage to a GCS bucket
Creating a detect text function in Python
Exercise - using GCP in AutoML to train an ML Model
Exercise - deploying a dummy workflow with Vertex AI Pipeline
Creating a dedicated regional GCS bucket
Developing the pipeline on Python
Monitoring the pipeline on the Vertex AI Pipeline console
Exercise - deploying a scikit-learn model pipeline with Vertex AI
Creating the first pipeline, which will result in an ML model file in GCS
Running the first pipeline in Vertex AI pipeline
Creating the second pipeline which will use the model file from the prediction results as a CSV file in GCS
Running the second pipeline in Vertex AI Pipeline
Summary
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
User and Project Management in GCP
Technical requirements
Understanding IAM in GCP
Planning a GCP project structure
Deciding how many projects we should have in a GCP organization
Understanding the GCP organisation folder, and project hierarchy
Controlling user access to our data warehouse
Use-case scenario - planning a BigQuery ACL on an e-commerce organization
Column level security in BigQuery
Practicing the concept of IaC using Terraform
Exercise - creating and running basic Terraform scripts
Self - exercise - managing a GCP project and resources using Terraform
Summary
Cost Strategy in GCP
Technical requirements
Estimating the cost of your end-to-end solution in GCP
Comparing BigQuery on-demand and flat-rate
Example - estimating data engineering use case
Tips for optimizing BigQuery using partitioned and clustered tables
Clustered tables
Exercise - optimizing BigQuery on demand cost
Partitioned tables
Summary
CI/CD on Google Cloud Platform for Data Engineers
Technical requirements
Introduction to CI/CD
Understanding the data engineer's relationship with CI/CD practices
Understanding CI/CD components with GCP services
Exercise - implementing continuous integration using Cloud Build
Creating a Github repository using Cloud Source Repository
Developing the code and Cloud Build scripts
Creating the Cloud Build Trigger
Pushing the code to the Github repository
Exercise - deploying Cloud Composer jobs using Cloud Build
Preparing the Ci/CD environment
Preparing the cloudbuild.yaml configuration file
Pushing the DAG to our Github repository
Checking the CI/CD result in the GCS bucket and Coud Composer
Summary
Further reading
12 Boosting Your Confidence as a Data Engineer
Overviewing the Google Cloud Certification
Exam preparation tips
Extra GCP services material
Quiz - reviewing all the concepts you've learned about
Questions
Answers
The past, present and future of Data Engineering
Boosting and confidence and final thoughts
Summary