Please enable JavaScript.
Coggle requires JavaScript to display documents.
ahgfhgvcht, Container = code packaged up with all its dependencies # -…
ahgfhgvcht
Layers (3)
Big data and ML products
history
Challenges
Large datasets
Fast-changing data
Varied data
Need
Indexing World Wide Web
Sol: inventing new data processing methods
2002:
GFS
: handle data sharing and petabyte storage at scale - served as the foundation of Cloud Storage and also what would become the "managed storage functionality' in BigQuery
2004: (prob; indexing the exploding volume of content in the web) - sol: Introducing
MapReduce
: new style of data processing designed to manage large-scale data processing across big clusters of commodity servers.
2005: (prob: challenges as recording and retrieving millions of streaming user actions with high throughput) - sol: release of
Cloud Bigtable
2008: (prob: With MapReduce available, some developers were restricted by the need to write code to manage their infrastructure, which prevented them from focusing on application logic), sol: introducing
Dremel
: new approach of big-fata processing by breaking the data into smaller chunks called shards, and then compressing them. Dremel then uses a query optimizer to share tasks between the many shards of data and the Google data centers, which processed queries and delivered results. (The big innovation was that Dremel autoscaled to meet query demands.)
2010:
Colossus
(cluster-level file system and successor to the Google File System) &
BigQuery
(see def, announed may 2010, generally available Nov 2011)
2012:
Spanner
(see def)
2015:
Pub/Sub
(def: a service used for streaming analytics and data integration pipelines to ingest and distribute data) &
TensorFlow
2018:
TPU
&
AutoML
(a suite of ML products)
2021:
Vertex AI
(a unified ML platform)
Big data and ML product line
Cloud Storage, Dataproc, Cloud Bigtable, BigQuery, Dataflow, Firestore, Pub/Sub, Looker, Cloud Spanner, AutoML, and Vertex AI
Compute & storage
Compute
Requirements
Prob
Since 2012, required computational power required for ML applications follow no more Moore Law, and it doubles every 3.5 months (far more than CPU & GPU development scales)
Sol
The use of TPUs
Specs
2016
ASICs (application-specific integrated circuits)
Domain-specific hardware (vs GPUs and CPUs : general-purpose hardware)
Accelerate ML workloads (tailoring architecture to meet the computation need in a domain as the matrix multiplication in ML)
Faster & more enerdy-efficient for AI apps and ML than CPUs and GPUs
Included in google cloud products and services
Storage
What
Reduce and effort needed for storing data
How
by creating an elastic storage bucket directly in a web interface or through a command line ((for example on cloud storage))
Offering relational DB & non-relational DB & Worldwide object storage
Networking & security
Services
Compute & Storage
Compute
Compute engine
Specs
Iaas offering
C&S&Network virtually as data center
Max flexibility
Run in individual VM
Google Kubernetes engine (GKE)
Specs
Containerized apps
Run in cloud env
App Engine
Specs
Fully managed PaaS offering
Bind code to libraries needed (a
Allowing ressources to focus on the app logic
Cloud functions
Specs
Faas offering
Execute code in response to events
Cloud run
Specs
Fully managed compute platform
Enables to run requests or event-driven stateless workloads
Without having to worry about servers (Abstracts awal all infra management) : automaticcaly scales up and down, etc
Charges only for resources used
Storage
Cloud Bigtable
Specs
best for real-time, high-throughput applications that require only millisecond latency
Cloud SQL
Specs
Cloud storage
Specs
Managed service for storing unstructured data (object files)
object: an immutable piece of data consisting of a file of any format
Objects are stored in containers called "buckets"
Buckets are associated with a project
Projects can be grouped under an organization
Each project, bucket, and object in GC is a resource in GC (as are things such as Compute Engine instances)
App examples
serving website content
storing data for archival
disaster recovery
distributing large data objects to end users via Direct Download
Storage classes
Standard storage (Hot data): 1- best for frequently accessed or "hot" data, 2- also great for data that is stored for only brief periods of time
Nearline storage (Once per month): expl: data backups, long-tail multimedia content, data archiving
Coldline storage (Once every 90 days at most): another low-cost option for storing infrequently accessed data
Archive storage (Once a year or less): lowest-cost option (higher costs for data access and operations and 365-day minimum storage duration) - used ideally for data archiving, online backup and disaster recovery
Cloud Spanner
Specs
def: is a globally available and scalable relational database.
Firestore
Specs
def: transactional NoSQL, document-oriented database
BigQuery
Specs
def: is Google's data warehouse solution: a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data
is a PaaS that supports querying useing ANSI SQL
has built-in ML capabilities
Dremel is the query engine behind BigQuery
Big data and ML products
Categories
(according to the Data-to-AI workflow)
Storage (5)
Cloud Storage, + relational: Cloud SQL, Cloud Spanner + unrelational: Cloud Bigtable, Firestore
(?? where is BigQuery (the warehouse))
Analytics (3)
BigQuery
Looker
LookerStudio
ML
ML&AI solutions (4)
(built on the ML developement platform)
Document AI
Contact Center AI
Retail Product Discovery
Healthcare Data Engine
ML developement platform (4)
Vertex AI
(the primary product including the 3 others)
AutoML
Vertex AI Workbench
Tensorflow
Ingestion & process
(products for digesting both realtime and batch data)
Dataproc
Dataflow
Cloud Data Fusion
Pub/Sub
Infrastructure
Geographical managment
Repartition
Regions (34)
Zones
Locations (5)
Caracteristics
Compute & storage are decoupled (contrary to desktop computing) for proper scaling capabilities
Mutliple compute & storage services available to meet each specific app compute & storage specific needs
Choosing the services to use
criterias
Type of data
Stuctured
Def
Tables - rows - columns
Types
Transactional workloads:
stem from Online Transaction Processing systems (used when fast data inserts and updates are required to build row-based records) (Usually to maintain a system snapshot)
Require relatively standardized queries that only impact a few records
Accessed using SQL
better for local/regional scalability
Cloud SQL
global scalability
Cloud Spanner
Accessed without SQL(NoSQL)
Firestore
Analytical workloads:
stem from Online Analytical Processing systems (used when entire datasets need to be read)
Often require complex queries (expl: aggregations)
Accessed using SQL
BigQuery
Accessed without SQL (NoSQL)
Cloud Bigtable
(best for real-time, high-throughput applications that require only millisecond latency)
Unstructured
Services to use
Usually suited to
Cloud Storage
BigQuery
now offers possibility to store it as well
Def
Documents
Images
Audio files
Business need
Container = code packaged up with all its dependencies
#