Please enable JavaScript.

Coggle requires JavaScript to display documents.

AWS ML Cetrification - Coggle Diagram

- - - - Generalization vs Memorization. Algorithms mush unambiguous and must guaranty the reproducible outcome
  - - - Linear Learner can be used for both regression and classification. Use validation metric for parameter tuning
      - Factorization Machines can be used for binary classification or regression on high dimensional sparce datasets. Only considers pair-wise features. Supports only Float32 tensors. Needs 10k+ samples (a lot of data). CPUs are preferable
      - K-Means, needs Tabular data, feature can be selected. CPU are preferable
      - KNN classification (lazy). The whole dataset lives in memory.
      - Image
        
        Image Classification. Utilize open source ImageNET labeled dataset.
        
        Object Detection. CNN. Use case: Metadata extraction
        
        Semantic Segmentation. Identifies edges. Use case: Computer vision (damaged details for example)
        
        AWS managed. Recognition (has many features including facial expressions)
      - Anomaly Detection
        
        Random Cut Forest. Finds deviation > 3 STDs from mean. Assigns anomaly score to samples. CPU is preferable. Use case: quality control of the speaker (noise frequency detection), fraud detection
        
        IP Insights. Flags odd online behavior using NN. Returns anomaly score. GPU for training is recommended. CPU for inference is ok.
      - Text
        
        Topic modeling. Unsupervised. Can be used for text comparison
        
        Latent Dirichlet Allocation (LDA).
        
        Neural Topic Model (NTM) requires bag of words (token counts)
        
        Seq2seq (NN). Can be used for translation. Only GPU supported for training. Commonly used pretrained. Can be used for speach to text (Transcribe)
        
        AWS BlazingText. Implementation of Word2vec (unsupervised). Determine semantic relationships between words. Text classification (supervised). Expects single file with 1 line / sentence. Scalable, 20x faster than FastText.
        
        AWS Comprehend (managed) - sentiment analysis. AWS Macie (managed) - document classification.
        
        Object2Vec. Maps similar things (costumers, movies, reviews, music albums, etc). Expects pairs of things. Can be used for feature engineering. Training required. In File mode the samples are lines and labels should be prefexed with label_ . In Pipe mode it should be JSON per line with sentence in source tag and label in label tag.
      - Reinforsment
        
        Agent + Environment. Goal: maximize reward via Markov decision process depending on state - coming up with successful policy.
        
        Deep racer. Use case: Self driving cars. Learn heating in buildings
      - Forecasting
        
        DeepAR. Use time-series (>3000 observations). Use case: action value, sales prediction.
      - Ensemble
        
        XGBoost. Gradient boosted trees. 2 required, 35 optional hyperparameters. Accepts CSV. Only trains on CPU and needs to hold the whole dataset in memory (a lot of memory recomended). Spark integration. Can be used for regression, classification and ranking, anomaly detection
  - - - For time series just slice last couple of month for testing
      - Avoid data bias - NEVER do sequential split
      - If k-Fold cross-validation errors are roughly equal the data is randomized well
      - BIAS
        
        Data (wrong split or exclusion)
        
        Feedback (heuristic assumption can skew possible outcomes)
        
        Calculate: RMSE (residual is actual - predicted)
    - - Stochastic Gradient Descent
    - - Hardware
        
        application specific integrated circuits ASIC (most performant) > GPU > fixed programmable gate arrays FPGA > CPU (least)
        
        CPU, GPU (most fexible) > FPGA > ASIC (least)
        
        FPGA, ASIC (most costly) > GPU > CPU
      - at least 2 EC2 instances for availability
      - Elastic inference adds GPU
    - - In transfer learning you keep the NN features, remove only last layer (classifier) and retrain it with your data set
  - - - Drop out
      - L1/L2
  - - - Underfitting
        
        More data / train longer
        
        both Train and Test data high
      - Overfitting
        
        Early stopping / drop out / more data / data augmentation / regularization / add noise / feature selection
        
        Train error low test error high
    - - Type 1 error = False Positive
      - F = 2PrRec/(Pr+Rec)
      - Macro average F1 score is class-averaged F1 score - for multiclass
      - For binary classification we want to maximize AUC-ROC
    - - Offline is K-fold. Online is real world (canary or A/B deployment, where a small portion is sent to the new version and compare them on real world results)
    - - Custom metrics from the custom algorithm can also be passed to cloud watch using metric_definitions parameter
  - - - Ensure the extraction methods used to generate the training datasets are the same as for the production inference data.
      - Ensure the target variable used as the predictor during training represents the actual outcome that the machine learning model is trying to predict.
      - Ensure the training datasets are large, representative samples of the populations that the model needs to make predictions.
      - Ensure that data counts, duration, and precision distributions are the same in the training, testing, validation and inference data
    - - Use Pipe input (recordIO-Protobuf)
    - - Never use accuracy as metrics, use F1 score. Fix the unbalanced dataset using resampling (preferably oversampling using SMOTE (synthetic minority over-sample technique)).
    - - Take less parameters over narrower range (help to find minimum)
- - - - CloudTrail and CloudWatch
      - Build error monitoring
    - - Instances
      - Provisioned IOPS
      - Volumes
  - - - Poly (text to speech, chat bot)
      - Lex (conversational AI, chat bot)
      - Transcribe (speech to text)
      - Comprehend (NLP, clustering, classification, sentiment analyses)
      - Forecast (time series)
      - Personalize (recommendation service)
      - Recognition (image / object detection, facial recognition, door lock)
      - Textract (as LDA extract data / metadata from scans)
      - Translate (seq2seq)
    - - You can use your own container (the AMI of it should be on S3)
    - - Using spot instances to train deep learning models using AWS Batch
      - ! Use elastic inference for adding GPU acceleration to EC2 sagemaker
      - EC2 instance types
        
        General Purpose (T, M, A)
        
        Compute Optimized (C)
        
        Memory Optimized (R, X)
        
        Accelerated Computing (P, G, F, Trn, Inf, VT)
        
        Storage Optimized (I)
  - - - Identity based policy
      - Resource based policy
    - - VPC endpoint / NACL / Security Groups provides endpoint visibility
      - IAM provides authentication and access control
      - Encryption is provided by KMS
    - - S3 VPC endpoints can reduce egress cost and improve security
    - - SageMaker notebook instance can be encrypted in supplied with KMS key upon creation. The same can be done for training and inference jobs
    - - CloudWatch! metrics available for 15 month (2 weeks in console), near real-time (1min delay),
      - stdout + stderr in sagemaker logs
      - Possible to trigger event for model retraining
      - CloudTrail - keeps 90 days (but if you send to S3 than indefinitely)
  - - - Create inference job -> model artifact on S3 -> CreateModel API (specify the artifact) -> Create Endpoint Config (list of type of instances + initial weights) -> CreateEndpoint -> InvokeEndpoint()
      - Models can be versioned
      - Batch Transform Job (for example for transformation of the entire dataset): Create Model -> CreateTransformJob same, but the result dropped in S3 (containers can be combined with Inference Pipelines to chain the containers)
      - SageMaker Neo enables optimization of ML model per target system architecture ()
      - Elastic Inference speeds up throughput of real time CPU inferences (specify when you deploy the model)
      - Automatic Scaling (same as EC2 ASG), min, max, targe, cool-down
      - High availability - at least 2 instances in diff AZ
    - - AWS allows to generate new model version if we retrain the model with new data
    - - Big Bang (high risk low time)
      - Phased Rollout (A/B testing, canary), optimal. For example if LB serves requests to X old servers and Y new
      - Parallel adoption (2 running systems, low risk, most time)
      - Self-deployment
        
        EC2
        
        ECS
        
        Fargate
        
        EMR
        
        Locally
    - - Detect and mitigate drop in performance
      - Monitor performance of the model
    - - Continuous integration. Often merge with main branch.
      - Continuous delivery. Release by a single click
      - Continuous Deployment. Automatic release - all unit test, integration tests, acceptance tests, deployment and smoke testing are done automatically
- - - - Types
        
        Missing Completely at Random (MCAR)
        
        Missing not at Random (MNAR)
        
        Missing at random (MAR)
      - Remedy
        
        Supervised learning (predict based on present features) (least amount of bias), known as multiple data imputations
        
        Mean / median / mode (most common)
        
        Drop (may bias the dataset)
    - - numeric features
        
        Standartization (x-mean)/sd = z - score
        
        Normalization (0 ... 1) ! be aware of outliers
        
        Binning (group by feature intervals of equal length)
        
        Quantile binning (group in parts of equal size)
      - ETL
        
        Glue.
        
        Input data can be in S3, DynamoDB, RDS, Redshift, EC2 etc. Then crawler creates data catalog (finds schema). Then you can run Spark job (one-time / scheduled) using custom Scala / Python transformation code and save the results. Alternative you can run Python shell job using custom Python code (traditional). Ad hoc transformations can be done with Jupyter or Zeppelin notebooks.
        
        ! Crawler requires a role to access S3 (also needs correct credentials in case of a JDBC)
        
        Encryption. Data store metadata can be encrypted as well as S3 and CloudWatch log entries made by Glue. Each ETL job has encryption options. It is possible to require SSL for JDBC. Supports only symmetric master keys (CMK)
        
        SageMaker. Jupyter notebooks can do python transformations. Can be patched with pip
        
        EMR. As in any Hadoop cluster
        
        Athena. Can do SQL like transformation of the data catalog (Glue)
        
        Data Pipeline can run custom scripts (Java) on EC2 by directing data there and storing the results
    - - SageMaker Ground Truth - for creation of labeled data. It is important to include the default / generic classification class for cases that can not be properly classified. Mechanical Turk - for data labeling by AWS workers.
      - Ad hoc rule: Number of samples should exceed 10x number of model parameters
  - - - text
        
        Bag-of-Words (each word is associated with its count)
        
        N-Gram (text splited in groups of N consequtive words, called representations), for example bigram = 2-gram
        
        Orthogonal Sparce Bigram (OSB, takes first word and than other word separated by 1...N with number of underscores = to skip)
        
        Term Frequency Inverce Document Frequency (Tf-idf, TF = how frequent term comes up, DF = in how many documents). Vector (2,7) means that the terms comes up in 2 documents in 7 n-grams. Helps to filter out common stop words as "the" and "and"
        
        Transform: REMOVE punctuation, REMOVE stop words and make LOWERCASE and TOKENIZE text
        
        Transform: Cartesian Product (concatenates features of two text columns)
        
        Transform: Standartization of dates, info about (weekday, month, etc.)
      - image / speech
        
        Convolution (as in CNN) using filters
        
        AWS Recognize
    - - categorical encoding
        
        nominal (colors, con-comparable red, green, etc). Use 0 and 1 for binary and one-hot encoding for multiclass. Other ways for multiclass - grouping (many cases) and mapping rare values to other
        
        ordinal (L, M, S sizes) - replace by placeholder values (can be optimized)
      - Dimensionality reduction
        
        Heuristic removal of features (using common sence)
        
        PCA (too many features). Will change present features
      - use Random Cut Forest to detect outliers
  - - - Relationships
        
        Scatter plot. Correlations can be identified
        
        Bubble plot (3d)
      - QuickSide (managed)
      - Comparison
        
        Bar chart
        
        line cart (over-time change)
      - Distributions
        
        Histogram
        
        Box plot (multiple histograms)
        
        Scatter Plots / heat map
      - Compositions
        
        Pie chart
        
        Stacked area chart (time - series)
        
        Stacked column chart
- - - - S3, unlimited, max 5TB / file, bucket names unique, upload via Console, CLI or SDK
      - RDS (schema), DynamoDB (key-value), Red Shift (data warehouse, can be queried by Redshift spectrum), Timestream (Time-series DB), DocumentDB (instead of mongoDB)
    - - DATA TYPES: Structured (DB), unstructured, semi-structured (CSV, JSON, XML), also labeled / unlabeled, categorical / continuous, image / text / quantity, fixed / time-series
      - Database (schema, transactional), Data warehouse (processing important with user in mind, ready for BI), data lake (no processing, for historical, not clear data)
    - - Data Pipeline (can transfer data between RDS, DynamoDB and Redshift, can also convert data types with SQL queries) ! not serverless
      - DMS (migrate data between relational DBs and can also store in S3). DBs can be on premise, EC2 or AWS RDS. Does not transform - only rename columns ! serverless
      - Glue (ETL service). Includes crawlers that determine crawlers (as Regular Expressions) that determine schema and datatype.
      - Athena (serverless tool for complex SQL queries from any DBs) - no need for Redshift
  - - - Batch (offline in portions), steam (online)
      - DS type
        
        File (S3, CSV, JSON, Parquet, image, etc will be copied on EC2 during training)
        
        Pipe (will be streamed from S3, no copying). Faster! Use the recordIO-protobuf (tensor)
    - - Kinesis
        
        Kinesis Data Streams
        
        Pipeline: Data Producers (JSON), Streams (Shards), Consumers (EC2, Lambda, Kinesis Data Analytics, EMR, etc) -> Storage (optional)
        
        ! can not store directly to S3
        
        Shard (Partition key, sequence, data (up to 1mb)), max 500 (can be upgraded), max 1000 records / s, retention 24h (up to 7 days) ! DATA RETENTION
        
        KPL (producer lib for retrial, aggregation, delay, Java wrapper), KCL (client lib, process), API (hard, low level, PutRecords, GetRecords, real-time)
        
        Use cases: real-time log or cleak stream analytics
        
        Encryption via KPL and KCL using KMS service
        
        Firehose
        
        Pipeline: Data Producers, Processing (Lambda, optional), Storing (S3, RedSchift). S3 event may be used to push in DynamoDB, data retention is not important
        
        Data Analytics
        
        Real-time complex SQL/ Apache queries and store in S3 or Redshift. Can create metrics, dashboards, monitoring, notification and alarms
        
        Video Streams
        
        Producers (Video, Image, Audio) - Consumers (EC2, EMR) - Storage (S3, DynamoDB, etc). For example web cam for door security. ! pairs well with AWS Recognition
      - EMR
        
        Look out for POWER, THROUGHPUT, as it is a cluster
        
        Supports all major frameworks for distributed workloads
      - Glue
        
        Look out for operational flexibility
        
        Go to tool for data transformation jobs
      - Redshift
        
        Add data with either COPY (faster) or INSERT from S3, EMR, DynamoDB or SSH
    - - Can be done as a recurent CloudWatch event with Lambda or with AWS Batch