Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Engineering with GCP - Coggle Diagram

- - - - Understanding the need for a data warehouse
        
        From operational to decision making
        
        Problems
        
        Data silos
        
        Multiple OS
        
        to store the data to process
        
        Typical warehouse stack
        
        Storage
        
        Compute
        
        Single monolithic software
        
        Data lakes
        
        Schema
        
        SQL Interface
      - :check: Answer these questions if you want to display information to your end user
        
        Who will consume the data?
        
        What data sources should I use?
        
        Where should I store the data?
        
        When should the data arrive?
        
        Why does the data need to be stored in this place?
        
        How should the data be processed?
      - Data lakes vs warehouses
        
        Data lakes can store unstructured data
        
        Data Lake
        
        Schema is not mandatory
        
        Compute with different technologies for the same underlying storage
        
        First focus - store as much data as possible
        
        Business relevancy and data model are defined later
        
        Data Warehouse
        
        Schema is mandatory
        
        Only SQL, dev does not have control over compute
        
        First focus is business relevancy and data models
        
        Only store data based on the business needs
      - Data life cycle flow
        
        Application
        
        Databases
        
        Data lake
        
        Data warehouse
        
        Data Mart
        
        2 more items...
        
        ML modelling
        
        ML pipeline
    - - Data engineer versus data scientist
        
        Data Scientist Before
        
        How to handle big data infra
        
        Properly design and built ETL pipelines
        
        Train ML models
        
        Understand deeply about the business
        
        Now the roles is split into mulitple roles
        
        Data analyst
        
        ML Engineer
        
        Business Analyst
        
        Data Engineer
      - The focus of data engineers
        
        Build data warehouse
        
        Build Data Lake
        
        Orchestrate ETL jobs
      - Data Engineer Responbilities
        
        Handle all big data infrastructures and software installation
        
        Handle application databases
        
        Design the data warehouse data model
        
        Analyze big data to transform raw data into meaningful information
        
        Create a data pipeline for ML
      - Data mart person should understand Business Needs
    - - ETL concept in data engineering
        
        ETL is the core of Data Engineering
      - The difference between ETL and ELT
        
        The transformation is done in the downstream system for ex BigQuery is good to transform
      - What is NOT big data?
        
        Two concepts
        
        Data is big
        
        Big data technology
        
        How questions?
        
        How to store 1 PB of data when common size HD is in TBs?
        
        How to average a list of nos when they are stored in multiple computers
        
        Continuous extraction and aggregate as a streaming process?
      - A quick look at how big data technologies store data
        
        Size of the data relative to the system
        
        Large file is distributed into multiple systems
        
        Optimize performance
        
        Further questions?
        
        How do i process the files?
        
        How to aggregate?
        
        How different parts know other parts?
      - A quick look at how to process multiple files using MapReduce
        
        2 definitions
        
        MapReduce as a tech
        
        As a concept
        
        Concept is important
        
        1, Map the file parts
        
        Add a static value - transform it to a key value pair
        
        Shuffle
        
        Move data between machines to form groups
        
        Reduce
        
        Produce the desired result
        
        Result is stored in a single machine
      - :check: Questions?
        
        Extract Transform Load (ETL) :star:
        
        ETL vs ELT?
        
        Big Data?
        
        How do you handle large volumes of data?
    - - Summary
      - Exercise
        
        Answer: https://trello.com/c/oEylBM7i
      - See also
  - - - GCP Console
      - Cloud Shell
      - Cloud Editor
    - - The difference between the cloud and non-cloud era
      - The on-demand nature of the cloud
        
        Hardoop pattern
        
        Hardoop File System
        
        Process thousands of jobs in Spark
        
        In GCP its common to create a spark cluster for each individual job
        
        Ephemeral cluster
    - - Introduction to the GCP console
      - Practicing pinning services
      - Creating your first GCP Project
      - Using GCP Cloud Shell
    - - Understanding the GCP serverless service
        
        Three groups of services
        
        VM based
        
        Managed services
        
        Serverless (fully managed services)
      - Service mapping and prioritization
        
        Identity and Management Tools
        
        IAM & Admin
        
        Logging
        
        Data Catalog
        
        Monitoring
        
        Storage & DB
        
        Cloud Storage
        
        BigTable
        
        SQL
        
        Datastore
        
        Big Data
        
        BigQuery
        
        DataProc
        
        DataFlow
        
        Pub/Sub
        
        ML & BI
        
        Vertex
        
        Data Studio
        
        Looker
        
        ETL Orchestrator
        
        Cloud Composer
        
        Data Fusion
        
        Dataprep
      - The concept of quotas on GCP services
        
        https://cloud.google.com/bigquery/quotas
      - User account versus service account
        
        To automate tasks
      - Summary
      - :check: Important decision factors
        
        Choose services that are serverless
        
        fully managed
        
        SaaS
        
        Easy to use
        
        Understand mapping bn services and the data engineering areas
        
        More than one option in one area, choose the most popular service in the market
        
        Long run support
        
        Hire experts more easily
- - - - python3
      - SQL
      - Linux commands
      - Git
    - - BigQuery data location
        
        Wow features
        
        Streaming
        
        Machine Learning
        
        SQL Interface
        
        Not a transactional database
      - GCS is fully managed object storage
        
        No need to worry about infra
        
        Large files - Images, videos, large csv, dump storage from database, exporting historical data from BigQuery, ML Model files
      - BigQuery
        
        Fully managed data warehouse
        
        store and analyse data immediately
        
        4 parts
        
        Storage
        
        Processing
        
        Metadata
        
        SQL Interface
        
        Google Colossus filesystem
        
        Distributed SQL execution engine
        
        inspired by Dremel SQL
    - - Creating a dataset in BigQuery using the console
        
        Datasets are group of tables
        
        Should have meaningful names
        
        Ex: Website Logs
        
        Database : Technical purposes vs Dataset: Business perspective purposes
      - Loading a local CSV file into the BigQuery table
      - Using public data in BigQuery
      - Data types in BigQuery compared to other databases
        
        https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
      - Timestamp data in BigQuery compared to other databases
        
        Only allowed format is UTC
    - - Step 1: Access your Cloud shell
      - Step 2: Check the current setup using the command line
      - Step 3: The gcloud init command
      - Step 4: Download example data from Git
      - Step 5: Upload data to GCS from Git
    - - Data warehouse in BigQuery - Requirements for scenario 1
      - Steps and planning for handling scenario 1
        
        Data pipeline
        
        Create MySQL DB
        
        Extract MySQL to GCS
        
        Load GCS to BigQuery
        
        Create BigQuery Data Mart
        
        Extractions happen from the clone instance, so they dont interrupt the production db
      - Data warehouse in BigQuery - Requirements for scenario 2
      - Steps and planning for handling scenario 2
        
        https://trello.com/c/BOFAGvl7
        
        Design data modeling for BigQuery
        
        Mistakes
        
        Data is duplicated in many locations
        
        Values not consistent
        
        The cost of processing is highly inefficient
        
        End user does not understand how to use data warehouse
        
        The business does not trust data
        
        Need to spend more time in data modelling in warehouse than in data lake
        
        :check:Data modeling
        
        Process of representing database objects in business perspective
        
        Objects in BigQuery
        
        Datasets
        
        Tables
        
        Views
        
        End users are humans
        
        Vs for app db end users are apps
        
        Common end users
        
        Business Analysts
        
        Data Analysts
        
        Data Scientists
        
        BI Users
        
        Any other user for business purposes
        
        :check: Store information in multiple tables, so you can write JOIN queries to get necessary data
        
        Other purposes of Data Model
        
        Data consistency
        
        Query performance
        
        https://trello.com/c/ttqrELK5 - Normalized Table
        
        1 more item...
        
        Storage efficiency
        
        Inmon versus the Kimball data model
        
        Inmon - Data Driven
        
        Build central data warehouse
        
        1 more item...
        
        Data model is Highly normalized to the lowest level
        
        Enterprise Data Warehouse - Single data source of all data marts
        
        1 more item...
        
        Top down
        
        Kimball - User Driven
        
        Focuses on answering user questions
        
        Goal: Ease of use, high level of performance improvement
        
        Fact table
        
        1 more item...
        
        Dimension Table
        
        1 more item...
        
        Star schema
        
        2 more items...
        
        :check:The best data model for BigQuery
        
        Data model needs
        
        Real world scenario for the end user
        
        High data consistency
        
        Query speed
        
        Store data efficiency
        
        Highly accurate without need to develop faster
        
        Inmon
        
        Kimball - most of the times
        
        Agile
        
        User driven mindset
        
        Good enough even if there are inconsistencies and inefficiencies here and there
        
        Creating fact and dimensions
        
        Since there are clear questions from business users
      - End
        
        Summary
        
        Exercise - Scenario 3
        
        See also
      - Intro
        
        :check:Starting point
        
        requirements
        
        business questions
        
        :check:Thinking process
    - - Challenges in making data driven decisions
        
        Technology bottlenecks
        
        Data consistency
        
        Ability to serve multiple business puposes
      - 3 services
        
        Big Query
        
        GCS
        
        Cloud SQL
  - - - Cloud composer environment
      - Cloud SQL Instance
      - GCS bucket
      - BigQuery datasets
      - Cloud shell
      - Code editor
    - - focus only on the development and deployment of our data pipeline and not on infra, installation and software management
      - Airflow is a workflow management tool
        
        Use a python script to manage workflows
      - 3 main components
        
        Handling task dependencies
        
        tasks dependent on other tasks
        
        Scheduler
        
        routines everday
        
        System integration
        
        Tools - GCS, BigQuery, Py, bash scripts, an email API
    - - Provisioning Cloud Composer in a GCP project
    - - Level 1 DAG - Creating dummy workflows
      - Level 2 DAG - Scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
      - Level 3 DAG - Parameterized variables
      - Level 4 DAG - Guaranteeing task idempotency in Cloud Composer
      - Level 5 DAG - Handling late data using a sensor
    - - Orchestration
        
        Automate tasks, jobs and their dependencies
      - Problems to solve
        
        pipeline issues
        
        data duplication
        
        task dependencies
        
        managing connections
        
        handling late data
  - - - A brief history of the data lake and Hardoop ecosystem
      - A deeper look into Hardoop components
      - How much Hadoop related knowledge do you need on GCP?
      - Introducing the Spark RDD and the DataFrame concept
      - Introducing the data lake concept
      - Hardoop and Dataproc positioning on GCP
    - - Creating a Dataproc cluster on GCP
      - Using Cloud Storage as an underlying Dataproc file system
    - - Preparing log data in GCS and HDFS
      - Developing Spark ETL from HDFS to HDFS
      - Developing Spark ETL from GSC to GCS
      - Developing Spar ETL from GSC to BigQuery
    - - Practicing using a workflow template on Dataproc
  - - - Dataflow
      - Pub/Sub
      - GCS
      - BigQuery
    - - Streaming data for data engineers
        
        Data flow from upstream to downstream as soon as the data is created
        
        But in batches, like every hour or every day depending on the use case
        
        Pub/Sub to control the incoming and outgoing data streams as messages
        
        DataFlow will process that accepts the messages to process the data as a streaming process
      - Introduction to Pub/Sub
        
        Messaging system
        
        4 main technologies in Pub/Sub
        
        Publisher
        
        Control incoming messages
        
        Code to publish msgs from their apps
        
        Pub/Sub will store the msgs as topics
        
        Topic
        
        Analogy - Topics -> Tables; Msgs - Rows in tables
        
        Subscription
        
        Interested in receiving msgs from topics
        
        Each topic can have one or many subscriptions
        
        subscriptions will get identical
        messages from the topic.
        
        Subscriber
        
        points
        
        Each subscription can have one or
        
        many subscribers. The idea of having multiple subscribers in a subscription is to
        
        split the loads. For example, for one subscription that has two subscribers, the two
        
        subscribers will get partial messages from the subscription.
        
        Acknowledge (ack)
        
        Will stop sending after ack is received
      - Introduction to Dataflow
        
        Can handle
        
        Batch data
        
        Streaming data
        
        Write code using Apache Beam SDK, this will run data pipeline using DataFlow
    - - Creating a Pub/Sub topic
      - Creating and running a Pub/Sub publisher using Python
      - Creating a Pub/Sub subscription
    - - Creating a HelloWorld application using Apache Beam
      - Creating a Dataflow streaming job without aggreation
      - Creating a streaming job with aggregation
      - Summary
  - - - Understanding what BigQuery INFORMATION_SCHEMA is
      - Exercise - Exploring the BigQuery INFORMATION_Schema table using Data Studio
      - Exercise - Creating a Data Studio report using data from a bike-sharing data warehouse
    - - What kind of table could be 1 TB in size?
      - How can a table be accessed 10,000 times in a month?
    - - Understanding BI Engine
  - - - Preparing the ML dataset by using a table from the BigQuery public dataset
      - Training the ML model using Random Forest in Python
      - Creating batch prediction using the training dataset's output
    - - Understanding the basic principles of MLOps
      - Introducing GCP services related to MLOps
    - - Uploading the mage to a GCS bucket
      - Creating a detect text function in Python
    - - Creating a dedicated regional GCS bucket
      - Developing the pipeline on Python
      - Monitoring the pipeline on the Vertex AI Pipeline console
    - - Creating the first pipeline, which will result in an ML model file in GCS
      - Running the first pipeline in Vertex AI pipeline
      - Creating the second pipeline which will use the model file from the prediction results as a CSV file in GCS
      - Running the second pipeline in Vertex AI Pipeline
- - - - Deciding how many projects we should have in a GCP organization
      - Understanding the GCP organisation folder, and project hierarchy
    - - Use-case scenario - planning a BigQuery ACL on an e-commerce organization
      - Column level security in BigQuery
    - - Exercise - creating and running basic Terraform scripts
      - Self - exercise - managing a GCP project and resources using Terraform
  - - - Comparing BigQuery on-demand and flat-rate
      - Example - estimating data engineering use case
    - - Clustered tables
      - Exercise - optimizing BigQuery on demand cost
      - Partitioned tables
  - - - Understanding the data engineer's relationship with CI/CD practices
    - - Creating a Github repository using Cloud Source Repository
      - Developing the code and Cloud Build scripts
      - Creating the Cloud Build Trigger
      - Pushing the code to the Github repository
    - - Preparing the Ci/CD environment
      - Preparing the cloudbuild.yaml configuration file
      - Pushing the DAG to our Github repository
      - Checking the CI/CD result in the GCS bucket and Coud Composer
      - Summary
      - Further reading
  - - - Exam preparation tips
      - Extra GCP services material
    - - Questions
      - Answers