Please enable JavaScript.
Coggle requires JavaScript to display documents.
AWS ML Cetrification - Coggle Diagram
AWS ML Cetrification
Modeling
Frame business problems as machine learning problems.
Determine when to use/when not to use ML
Generalization vs Memorization. Algorithms mush unambiguous and must guaranty the reproducible outcome
Know the difference between supervised and unsupervised learning
Selecting from among classification, regression, forecasting, clustering, recommendation, etc
Select the appropriate model(s) for a given machine learning problem.
Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN,
CNN, Ensemble, Transfer learning
Linear Learner can be used for both regression and classification. Use validation metric for parameter tuning
Factorization Machines can be used for binary classification or regression on high dimensional sparce datasets. Only considers pair-wise features. Supports only Float32 tensors. Needs 10k+ samples (a lot of data). CPUs are preferable
K-Means, needs Tabular data, feature can be selected. CPU are preferable
KNN classification (lazy). The whole dataset lives in memory.
Image
Image Classification. Utilize open source ImageNET labeled dataset.
Object Detection. CNN. Use case: Metadata extraction
Semantic Segmentation. Identifies edges. Use case: Computer vision (damaged details for example)
AWS managed. Recognition (has many features including facial expressions)
Anomaly Detection
Random Cut Forest. Finds deviation > 3 STDs from mean. Assigns anomaly score to samples. CPU is preferable. Use case: quality control of the speaker (noise frequency detection), fraud detection
IP Insights. Flags odd online behavior using NN. Returns anomaly score. GPU for training is recommended. CPU for inference is ok.
Text
Topic modeling. Unsupervised. Can be used for text comparison
Latent Dirichlet Allocation (LDA).
Neural Topic Model (NTM) requires bag of words (token counts)
Seq2seq (NN). Can be used for translation. Only GPU supported for training. Commonly used pretrained. Can be used for speach to text (Transcribe)
AWS BlazingText. Implementation of Word2vec (unsupervised). Determine semantic relationships between words. Text classification (supervised). Expects single file with 1 line / sentence. Scalable, 20x faster than FastText.
AWS Comprehend (managed) - sentiment analysis. AWS Macie (managed) - document classification.
Object2Vec. Maps similar things (costumers, movies, reviews, music albums, etc). Expects pairs of things. Can be used for feature engineering. Training required. In File mode the samples are lines and labels should be prefexed with label_ . In Pipe mode it should be JSON per line with sentence in source tag and label in label tag.
Reinforsment
Agent + Environment. Goal: maximize reward via Markov decision process depending on state - coming up with successful policy.
Deep racer. Use case: Self driving cars. Learn heating in buildings
Forecasting
DeepAR. Use time-series (>3000 observations). Use case: action value, sales prediction.
Ensemble
XGBoost. Gradient boosted trees. 2 required, 35 optional hyperparameters. Accepts CSV. Only trains on CPU and needs to hold the whole dataset in memory (a lot of memory recomended). Spark integration. Can be used for regression, classification and ranking, anomaly detection
Express intuition behind models
Train machine learning models.
Train validation test split, cross-validation
For time series just slice last couple of month for testing
Avoid data bias - NEVER do sequential split
If k-Fold cross-validation errors are roughly equal the data is randomized well
BIAS
Data (wrong split or exclusion)
Feedback (heuristic assumption can skew possible outcomes)
Calculate: RMSE (residual is actual - predicted)
Optimizer, gradient descent, loss functions, local minima, convergence, batches, probability,
etc
Stochastic Gradient Descent
Compute choice (GPU vs. CPU, distributed vs. non-distributed, platform [Spark vs. non-Spark])
Hardware
application specific integrated circuits ASIC (most performant) > GPU > fixed programmable gate arrays FPGA > CPU (least)
CPU, GPU (most fexible) > FPGA > ASIC (least)
FPGA, ASIC (most costly) > GPU > CPU
at least 2 EC2 instances for availability
Elastic inference adds GPU
Model updates and retraining
In transfer learning you keep the NN features, remove only last layer (classifier) and retrain it with your data set
Batch vs. real-time/online
! Your script MUST be on the notebook instance in the SageMaker
Can be done either by high level Python library of by Python SDK. After you send the CreateTraioningJob API it spins up the algorithm container from container registry (:1 for stable :newest for latest). It makes sence to use the big size instance for training. Alternatively it can spin up your custom container from S3 (for GPU use Nvidia Docker). The container access data from S3 and saves the model artifact to S3. CreateInferenceJob API call spins up the inference algorithm container (small instance is ok), loads the model artifact there and exposes the API inference endpoint.
CloudWatch logs capture all important parameters and can be used for debugging
Training with Spark. First Spark DataFrame has to be converted to protobuf and loaded to S3. Then similar to conventional model training. The final inference model can also interact with Spark DataFrame
Perform hyperparameter optimization
Regularization
Drop out
L1/L2
Cross validation
Model initialization
Neural network architecture (layers/nodes), learning rate, activation functions
Tree-based models (# of trees, # of levels)
Linear models (learning rate)
Choose tunable hyperparameter -> choose range -> choose the objective metric. Use Bayesian Optimization for more productive tuning
Evaluate machine learning models
Avoid overfitting/underfitting (detect and handle bias and variance)
Underfitting
More data / train longer
both Train and Test data high
Overfitting
Early stopping / drop out / more data / data augmentation / regularization / add noise / feature selection
Train error low test error high
Metrics (AUC-ROC, accuracy, precision, recall, RMSE, F1 score)
Type 1 error = False Positive
F = 2
Pr
Rec/(Pr+Rec)
Macro average F1 score is class-averaged F1 score - for multiclass
For binary classification we want to maximize AUC-ROC
Confusion matrix
Offline and online model evaluation, A/B testing
Offline is K-fold. Online is real world (canary or A/B deployment, where a small portion is sent to the new version and compare them on real world results)
Compare models using metrics (time to train a model, quality of model, engineering costs)
Custom metrics from the custom algorithm can also be passed to cloud watch using metric_definitions parameter
Cross validation
Debug
Poor inference accuracy (DATA problem)
Ensure the extraction methods used to generate the training datasets are the same as for the production inference data.
Ensure the target variable used as the predictor during training represents the actual outcome that the machine learning model is trying to predict.
Ensure the training datasets are large, representative samples of the populations that the model needs to make predictions.
Ensure that data counts, duration, and precision distributions are the same in the training, testing, validation and inference data
Model training takes to long
Use Pipe input (recordIO-Protobuf)
Unbalanced dataset
Never use accuracy as metrics, use F1 score. Fix the unbalanced dataset using resampling (preferably oversampling using SMOTE (synthetic minority over-sample technique)).
Parameter tuning takes too long
Take less parameters over narrower range (help to find minimum)
Machine Learning
Implementation and Operations
Build machine learning solutions for performance, availability, scalability, resiliency, and fault
tolerance
AWS environment logging and monitoring
CloudTrail and CloudWatch
Build error monitoring
Multiple regions, Multiple AZs
AMI/golden image
Docker containers
Auto Scaling groups
Rightsizing
Instances
Provisioned IOPS
Volumes
Load balancing
AWS best practices
Recommend and implement the appropriate machine learning services and features for a given
problem
ML on AWS (application services)
Poly (text to speech, chat bot)
Lex (conversational AI, chat bot)
Transcribe (speech to text)
Comprehend (NLP, clustering, classification, sentiment analyses)
Forecast (time series)
Personalize (recommendation service)
Recognition (image / object detection, facial recognition, door lock)
Textract (as LDA extract data / metadata from scans)
Translate (seq2seq)
AWS service limits
Build your own model vs. SageMaker built-in algorithms
You can use your own container (the AMI of it should be on S3)
Infrastructure: (spot, instance types), cost considerations
Using spot instances to train deep learning models using AWS Batch
! Use elastic inference for adding GPU acceleration to EC2 sagemaker
EC2 instance types
General Purpose (T, M, A)
Compute Optimized (C)
Memory Optimized (R, X)
Accelerated Computing (P, G, F, Trn, Inf, VT)
Storage Optimized (I)
Apply basic AWS security practices to machine learning solutions.
IAM
S3 bucket policies
Identity based policy
Resource based policy
infrastructure
VPC endpoint / NACL / Security Groups provides endpoint visibility
IAM provides authentication and access control
Encryption is provided by KMS
VPC
S3 VPC endpoints can reduce egress cost and improve security
Encryption/anonymization
SageMaker notebook instance can be encrypted in supplied with KMS key upon creation. The same can be done for training and inference jobs
Notebook instances are internet facing by default. Notebook can be without internet access using NAT gateways. !!! AWS SageMaker need iam:PassRole in order to createModel
If we want to create API for general public we can set up API gateway, that redirects InvokeEndpoint method to the model
Monitoring
CloudWatch! metrics available for 15 month (2 weeks in console), near real-time (1min delay),
stdout + stderr in sagemaker logs
Possible to trigger event for model retraining
CloudTrail - keeps 90 days (but if you send to S3 than indefinitely)
Deploy and operationalize machine learning solutions.
Exposing endpoints and interacting with them
Create inference job -> model artifact on S3 -> CreateModel API (specify the artifact) -> Create Endpoint Config (list of type of instances + initial weights) -> CreateEndpoint -> InvokeEndpoint()
Models can be versioned
Batch Transform Job (for example for transformation of the entire dataset): Create Model -> CreateTransformJob same, but the result dropped in S3 (containers can be combined with Inference Pipelines to chain the containers)
SageMaker Neo enables optimization of ML model per target system architecture ()
Elastic Inference speeds up throughput of real time CPU inferences (specify when you deploy the model)
Automatic Scaling (same as EC2 ASG), min, max, targe, cool-down
High availability - at least 2 instances in diff AZ
ML model versioning
AWS allows to generate new model version if we retrain the model with new data
Deployment
Big Bang (high risk low time)
Phased Rollout (A/B testing, canary), optimal. For example if LB serves requests to X old servers and Y new
Parallel adoption (2 running systems, low risk, most time)
Self-deployment
EC2
ECS
Fargate
EMR
Locally
Retrain pipelines
ML debugging/troubleshooting
Detect and mitigate drop in performance
Monitor performance of the model
CI / CD
Continuous integration. Often merge with main branch.
Continuous delivery. Release by a single click
Continuous Deployment. Automatic release - all unit test, integration tests, acceptance tests, deployment and smoke testing are done automatically
Exploratory Data Analysis /
Data Preparation
Sanitize and prepare data for modeling
Identify and handle missing data,
corrupt data, stop words, etc
Types
Missing Completely at Random (MCAR)
Missing not at Random (MNAR)
Missing at random (MAR)
Remedy
Supervised learning (predict based on present features) (least amount of bias), known as multiple data imputations
Mean / median / mode (most common)
Drop (may bias the dataset)
Formatting, normalizing, augmenting, and scaling data
numeric features
Standartization (x-mean)/sd = z - score
Normalization (0 ... 1) ! be aware of outliers
Binning (group by feature intervals of equal length)
Quantile binning (group in parts of equal size)
ETL
Glue.
Input data can be in S3, DynamoDB, RDS, Redshift, EC2 etc. Then crawler creates data catalog (finds schema). Then you can run Spark job (one-time / scheduled) using custom Scala / Python transformation code and save the results. Alternative you can run Python shell job using custom Python code (traditional). Ad hoc transformations can be done with Jupyter or Zeppelin notebooks.
! Crawler requires a role to access S3 (also needs correct credentials in case of a JDBC)
Encryption. Data store metadata can be encrypted as well as S3 and CloudWatch log entries made by Glue. Each ETL job has encryption options. It is possible to require SSL for JDBC. Supports only symmetric master keys (CMK)
SageMaker. Jupyter notebooks can do python transformations. Can be patched with pip
EMR. As in any Hadoop cluster
Athena. Can do SQL like transformation of the data catalog (Glue)
Data Pipeline can run custom scripts (Java) on EC2 by directing data there and storing the results
Labeled data (recognizing when you have enough labeled data and identifying mitigation
strategies [Data labeling tools (Mechanical Turk, manual labor)])
SageMaker Ground Truth - for creation of labeled data. It is important to include the default / generic classification class for cases that can not be properly classified. Mechanical Turk - for data labeling by AWS workers.
Ad hoc rule: Number of samples should exceed 10x number of model parameters
Perform feature engineering.
Identify and extract features from data sets, including from data sources such as text, speech,
image, public datasets, etc.
text
Bag-of-Words (each word is associated with its count)
N-Gram (text splited in groups of N consequtive words, called representations), for example bigram = 2-gram
Orthogonal Sparce Bigram (OSB, takes first word and than other word separated by 1...N with number of underscores = to skip)
Term Frequency Inverce Document Frequency (Tf-idf, TF = how frequent term comes up, DF = in how many documents). Vector (2,7) means that the terms comes up in 2 documents in 7 n-grams. Helps to filter out common stop words as "the" and "and"
Transform: REMOVE punctuation, REMOVE stop words and make LOWERCASE and TOKENIZE text
Transform: Cartesian Product (concatenates features of two text columns)
Transform: Standartization of dates, info about (weekday, month, etc.)
image / speech
Convolution (as in CNN) using filters
AWS Recognize
Analyze/evaluate feature engineering concepts (binning, tokenization, outliers, synthetic
features, 1 hot encoding, reducing dimensionality of data)
categorical encoding
nominal (colors, con-comparable red, green, etc). Use 0 and 1 for binary and one-hot encoding for multiclass. Other ways for multiclass - grouping (many cases) and mapping rare values to other
ordinal (L, M, S sizes) - replace by placeholder values (can be optimized)
Dimensionality reduction
Heuristic removal of features (using common sence)
PCA (too many features). Will change present features
use Random Cut Forest to detect outliers
Analyze and visualize data for machine learning
Graphing (scatter plot, time series, histogram, box plot)
Relationships
Scatter plot. Correlations can be identified
Bubble plot (3d)
QuickSide (managed)
Comparison
Bar chart
line cart (over-time change)
Distributions
Histogram
Box plot (multiple histograms)
Scatter Plots / heat map
Compositions
Pie chart
Stacked area chart (time - series)
Stacked column chart
Interpreting descriptive statistics (correlation, summary statistics, p value)
Clustering (hierarchical, diagnosing, elbow plot, cluster size)
Data Engineering
Create data repository
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
S3, unlimited, max 5TB / file, bucket names unique, upload via Console, CLI or SDK
RDS (schema), DynamoDB (key-value), Red Shift (data warehouse, can be queried by Redshift spectrum), Timestream (Time-series DB), DocumentDB (instead of mongoDB)
Identify data sources (e.g., content and location, primary sources such as user data)
DATA TYPES: Structured (DB), unstructured, semi-structured (CSV, JSON, XML), also labeled / unlabeled, categorical / continuous, image / text / quantity, fixed / time-series
Database (schema, transactional), Data warehouse (processing important with user in mind, ready for BI), data lake (no processing, for historical, not clear data)
Migration
Data Pipeline (can transfer data between RDS, DynamoDB and Redshift, can also convert data types with SQL queries) ! not serverless
DMS (migrate data between relational DBs and can also store in S3). DBs can be on premise, EC2 or AWS RDS. Does not transform - only rename columns ! serverless
Glue (ETL service). Includes crawlers that determine crawlers (as Regular Expressions) that determine schema and datatype.
Athena (serverless tool for complex SQL queries from any DBs) - no need for Redshift
Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Batch (offline in portions), steam (online)
DS type
File (S3, CSV, JSON, Parquet, image, etc will be copied on EC2 during training)
Pipe (will be streamed from S3, no copying). Faster! Use the recordIO-protobuf (tensor)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads)
Kinesis
Kinesis Data Streams
Pipeline: Data Producers (JSON), Streams (Shards), Consumers (EC2, Lambda, Kinesis Data Analytics, EMR, etc) -> Storage (optional)
! can not store directly to S3
Shard (Partition key, sequence, data (up to 1mb)), max 500 (can be upgraded), max 1000 records / s, retention 24h (up to 7 days) ! DATA RETENTION
KPL (producer lib for retrial, aggregation, delay, Java wrapper), KCL (client lib, process), API (hard, low level, PutRecords, GetRecords, real-time)
Use cases: real-time log or cleak stream analytics
Encryption via KPL and KCL using KMS service
Firehose
Pipeline: Data Producers, Processing (Lambda, optional), Storing (S3, RedSchift). S3 event may be used to push in DynamoDB, data retention is not important
Data Analytics
Real-time complex SQL/ Apache queries and store in S3 or Redshift. Can create metrics, dashboards, monitoring, notification and alarms
Video Streams
Producers (Video, Image, Audio) - Consumers (EC2, EMR) - Storage (S3, DynamoDB, etc). For example web cam for door security. ! pairs well with AWS Recognition
EMR
Look out for POWER, THROUGHPUT, as it is a cluster
Supports all major frameworks for distributed workloads
Glue
Look out for operational flexibility
Go to tool for data transformation jobs
Redshift
Add data with either COPY (faster) or INSERT from S3, EMR, DynamoDB or SSH
Job scheduling
Can be done as a recurent CloudWatch event with Lambda or with AWS Batch