Coggle requires JavaScript to display documents.
AutoGluon-Tabular, CatBoost, Factorization Machines Algorithm, K-Nearest Neighbors (k-NN) Algorithm, LightGBM, Linear Learner Algorithm, TabTransformer, XGBoost Algorithm
DeepAR Forecasting Algorithm
Object2Vec Algorithm
Random Cut Forest (RCF) Algorithm
IP Insights
Principal Component Analysis (PCA) Algorithm
K-Means Algorithm
Latent Dirichlet Allocation (LDA) Algorithm, Neural Topic Model (NTM) Algorithm
Sequence-to-Sequence Algorithm
BlazingText algorithm, Text Classification - TensorFlow
Image Classification - TensorFlow
Object Detection - MXNet, Object Detection - TensorFlow
Image Classification - MXNet
Semantic Segmentation Algorithm
Fit a line to your training data Supervised Problem: Regression Classification (binary, multi-class) Use a linear threshold function Input Tabular recordIO-wrapped protobuf (Float 32 only) CSV (first column assumed to be the label) Trainig data access mode: File Pipe How it works: Preprocessing Trainign data to be normalized Linear Learner can do this automatically Shuffle data Training Use SGD Choose and optimization algorithm: Adam, AdaGrad, SDG ... Validation Trains multiple models in parallel and chooses the most optimal one during the validation step (needs validation channel) Continuous objectives: mean square error, cross entropy loss, absolute error Discrete objectives (classification): F1 measure, precision, recall, or accuracy Required Hyperparameters predictor_type (binary_classifier, multiclass_classifier, or regressor) num_classes (when above is multiclass_classifier) Important Hyperparameters balance_multiclass_weights - true/false to use class weights, which give each class equal importance in the loss function learning_rate mini_batch_size l1 (L1) weight decay (L2) target_precision target_recall Training instance types Single or multi-instance CPU or GPU Multi-GPU instance does not help
Regression Classification (binary, multi-class) Use a linear threshold function
Use a linear threshold function
Tabular recordIO-wrapped protobuf (Float 32 only) CSV (first column assumed to be the label)
recordIO-wrapped protobuf (Float 32 only) CSV (first column assumed to be the label)
File Pipe
Preprocessing Trainign data to be normalized Linear Learner can do this automatically Shuffle data Training Use SGD Choose and optimization algorithm: Adam, AdaGrad, SDG ... Validation Trains multiple models in parallel and chooses the most optimal one during the validation step (needs validation channel) Continuous objectives: mean square error, cross entropy loss, absolute error Discrete objectives (classification): F1 measure, precision, recall, or accuracy
Trainign data to be normalized Linear Learner can do this automatically Shuffle data
Use SGD Choose and optimization algorithm: Adam, AdaGrad, SDG ...
Trains multiple models in parallel and chooses the most optimal one during the validation step (needs validation channel) Continuous objectives: mean square error, cross entropy loss, absolute error Discrete objectives (classification): F1 measure, precision, recall, or accuracy
predictor_type (binary_classifier, multiclass_classifier, or regressor) num_classes (when above is multiclass_classifier)
balance_multiclass_weights - true/false to use class weights, which give each class equal importance in the loss function learning_rate mini_batch_size l1 (L1) weight decay (L2) target_precision target_recall
Single or multi-instance CPU or GPU Multi-GPU instance does not help
In a nutshel: Boosted group of decision trees Predicts a target variable combining an ensemble of estimates from a set of simpler and weaker models New trees made to correct the errors of previous trees Uses gradient descent to minimize loss as new trees are added Supervised Problem Regression Uses regression trees Classification Ranking Examples Use Cases: Credit scoring Fraud prevention Marketing campaign effectiveness Product categorisation Input: Tabular libsvm (default) csv parquet recordio-protobuf How is it used framework within your SageMaker notebook (SageMaker.xgboost) built-in SageMaker algorithm (XGBoost image in ECR) Required Hyperparameters num_round - number of rounds to run the training num_class (if objective multi:softmax or multi:softprob) Important Hyperparameters subsample (prevent overfitting) - subsample ratio of the training instance to grow trees eta (prevent overfitting) - step size shrinkage. Shrinks the feature weights to make the boosting process more conservative gamma - min loss reduction required to make a further partition on a leaf node of the tree alpha (L1) lambda (L2) eval_metric - evaluation metric for validation data, defaults: rmse: for regression error: for classification (Binary classification error rate = #(wrong cases)/#(all cases)) map: for ranking (Mean Average Prediction) scale_pos_weight - balance pos and neg weights, sum(neg cases)/sum(pos cases). Useful for unbalanced classes (e.g. frauds) max_depth - maximum depth of a tree [objective - specifies the learning task and related learning objective, e.g. reg:logistic, multi:softmax, reg:squarederror] Instance Types Memory bound Supports CPU and GPU training and inference, choose based on needs and version m5 - if not using GPU p2, p3, p4, g4dn, g5 XGBoost 1.5+ support distributed GPU training with Dask that allows use multiple GPUs per instance use_dask_gpu_training = true distribution = fully_replicated csv or parquet
Boosted group of decision trees Predicts a target variable combining an ensemble of estimates from a set of simpler and weaker models New trees made to correct the errors of previous trees Uses gradient descent to minimize loss as new trees are added
Regression Uses regression trees Classification Ranking
Uses regression trees
Credit scoring Fraud prevention Marketing campaign effectiveness Product categorisation
Tabular libsvm (default) csv parquet recordio-protobuf
libsvm (default) csv parquet recordio-protobuf
framework within your SageMaker notebook (SageMaker.xgboost) built-in SageMaker algorithm (XGBoost image in ECR)
num_round - number of rounds to run the training num_class (if objective multi:softmax or multi:softprob)
subsample (prevent overfitting) - subsample ratio of the training instance to grow trees eta (prevent overfitting) - step size shrinkage. Shrinks the feature weights to make the boosting process more conservative gamma - min loss reduction required to make a further partition on a leaf node of the tree alpha (L1) lambda (L2) eval_metric - evaluation metric for validation data, defaults: rmse: for regression error: for classification (Binary classification error rate = #(wrong cases)/#(all cases)) map: for ranking (Mean Average Prediction) scale_pos_weight - balance pos and neg weights, sum(neg cases)/sum(pos cases). Useful for unbalanced classes (e.g. frauds) max_depth - maximum depth of a tree [objective - specifies the learning task and related learning objective, e.g. reg:logistic, multi:softmax, reg:squarederror]
rmse: for regression error: for classification (Binary classification error rate = #(wrong cases)/#(all cases)) map: for ranking (Mean Average Prediction)
Memory bound Supports CPU and GPU training and inference, choose based on needs and version m5 - if not using GPU p2, p3, p4, g4dn, g5 XGBoost 1.5+ support distributed GPU training with Dask that allows use multiple GPUs per instance use_dask_gpu_training = true distribution = fully_replicated csv or parquet
use_dask_gpu_training = true distribution = fully_replicated csv or parquet
In a nutshell Input is a sequence of tokens, output is a sequence of tokens Pronlem: Machine Translation Text summarization Speech to text Implemented with RNN’s and CNN’s with attention Supervised Input Text RecordIO-Protobuf Start with tokenized text files Tokens must be integers (unusual, as most algorithms want floating point) Convert to protobuf (packs into integer tensors with vocabulary files) Must provide training data, validation data, and vocabulary files How it used Training for machine translation can take days Pre-trained models are available Public training datasets are available for specific translation tasks No Required Hyperparameters Important Hyperparameters batch_size learning_rate optimizer_type (adam, sgd, rmsprop) num_layers_encoder num_layers_decoder optimized_metric - metric to optimize with early stopping - (perplexity, accuracy, or bleu) BLEU score compares against multiple reference translations Perplexity measure uncertainty in the value of a sample from a discrete probability distribution Instance Type Only use GPU instance types Only use a single machine for training Can use multi-GPU’s on one machine
Input is a sequence of tokens, output is a sequence of tokens Pronlem: Machine Translation Text summarization Speech to text Implemented with RNN’s and CNN’s with attention
Machine Translation Text summarization Speech to text
Text RecordIO-Protobuf Start with tokenized text files Tokens must be integers (unusual, as most algorithms want floating point) Convert to protobuf (packs into integer tensors with vocabulary files) Must provide training data, validation data, and vocabulary files
RecordIO-Protobuf Start with tokenized text files Tokens must be integers (unusual, as most algorithms want floating point) Convert to protobuf (packs into integer tensors with vocabulary files) Must provide training data, validation data, and vocabulary files
Start with tokenized text files Tokens must be integers (unusual, as most algorithms want floating point) Convert to protobuf (packs into integer tensors with vocabulary files) Must provide training data, validation data, and vocabulary files
Training for machine translation can take days Pre-trained models are available Public training datasets are available for specific translation tasks
batch_size learning_rate optimizer_type (adam, sgd, rmsprop) num_layers_encoder num_layers_decoder optimized_metric - metric to optimize with early stopping - (perplexity, accuracy, or bleu) BLEU score compares against multiple reference translations Perplexity measure uncertainty in the value of a sample from a discrete probability distribution
BLEU score compares against multiple reference translations Perplexity measure uncertainty in the value of a sample from a discrete probability distribution
Only use GPU instance types Only use a single machine for training Can use multi-GPU’s on one machine
In a nutshel Forecasting one-dimensional (scalar) time series data Uses RNN Allows to train the same model over several related time series Finds frequencies and seasonality Use trained model to forecast for new time series that are similar to the ones used for training Supervised Input Time Series JSON (JSON Lines = Each Line is a Valid JSON Value) Parquet (gzip or snappy compression) Record structure: (must) Start: the starting time stamp (must) Target: the time series values Dynamic_feat: dynamic features (e.g. was a promotion applied to a product in a time series of product purchases) Cat: categorical features (all time series must have the same number of categorical features) 0-based sequence of positive integers (0,1,2,3...) How it is used Accepts a training dataset and an optional test dataset Can use a model trained on a training set to forecasts also for other time series Always include entire time series for training, testing, and inference Use entire dataset as training set, remove last time points (prediction_length) for testing and evaluate on withheld values Don’t use very large values for prediction length (> 400) Train on many time series and not just one when possible Required Hyperparameters context_length - number of time points the model sees before making a prediction, can be smaller than seasonalities; the model will lag one year anyhow (~ prediction_lenght) epochs - maximum number of passes over the training data prediction_length time_freq - granularity of the time series in the dataset (M, W, D, H, min) Important Hyperparameters mini_batch_size learning_rate num_cells - number of cells to use in each hidden layer of the RNN Instance Type Training: Can use CPU or GPU Single or multi machine Start with CPU (c4.2xlarge, c4.4xlarge) Move up to GPU if necessary (only helps with larger models or with large mini-batch sizes (>512)) May need larger instances for tuning Inference: CPU-only for inference
Forecasting one-dimensional (scalar) time series data Uses RNN Allows to train the same model over several related time series Finds frequencies and seasonality Use trained model to forecast for new time series that are similar to the ones used for training
Time Series JSON (JSON Lines = Each Line is a Valid JSON Value) Parquet (gzip or snappy compression) Record structure: (must) Start: the starting time stamp (must) Target: the time series values Dynamic_feat: dynamic features (e.g. was a promotion applied to a product in a time series of product purchases) Cat: categorical features (all time series must have the same number of categorical features) 0-based sequence of positive integers (0,1,2,3...)
JSON (JSON Lines = Each Line is a Valid JSON Value) Parquet (gzip or snappy compression) Record structure: (must) Start: the starting time stamp (must) Target: the time series values Dynamic_feat: dynamic features (e.g. was a promotion applied to a product in a time series of product purchases) Cat: categorical features (all time series must have the same number of categorical features) 0-based sequence of positive integers (0,1,2,3...)
(must) Start: the starting time stamp (must) Target: the time series values Dynamic_feat: dynamic features (e.g. was a promotion applied to a product in a time series of product purchases) Cat: categorical features (all time series must have the same number of categorical features) 0-based sequence of positive integers (0,1,2,3...)
Accepts a training dataset and an optional test dataset Can use a model trained on a training set to forecasts also for other time series Always include entire time series for training, testing, and inference Use entire dataset as training set, remove last time points (prediction_length) for testing and evaluate on withheld values Don’t use very large values for prediction length (> 400) Train on many time series and not just one when possible
context_length - number of time points the model sees before making a prediction, can be smaller than seasonalities; the model will lag one year anyhow (~ prediction_lenght) epochs - maximum number of passes over the training data prediction_length time_freq - granularity of the time series in the dataset (M, W, D, H, min)
mini_batch_size learning_rate num_cells - number of cells to use in each hidden layer of the RNN
Training: Can use CPU or GPU Single or multi machine Start with CPU (c4.2xlarge, c4.4xlarge) Move up to GPU if necessary (only helps with larger models or with large mini-batch sizes (>512)) May need larger instances for tuning Inference: CPU-only for inference
Can use CPU or GPU Single or multi machine Start with CPU (c4.2xlarge, c4.4xlarge) Move up to GPU if necessary (only helps with larger models or with large mini-batch sizes (>512)) May need larger instances for tuning
CPU-only for inference
In a nutshell Text classification Supervised Work on individual sentences and not for an entire document Predict labels for a sentence Useful in web searches, information retrieval Word2vec Creates a vector representation of words (aka word embedding) Semantically similar words are represented by vectors close to each other It is useful for NLP, but is not an NLP algorithm itself sentiment analysis entity recognition translation Used in machine translation, sentiment analysis Work on individual words, not sentences or entire documents Input Text Expects a single preprocessed text file with space-separated tokens Text classification File Mode: One sentence per line First “word” in the sentence is the string __label__etichetta followed by the label augmented manifest text format Word2vec just wants a text file with one training sentence per line How it is used Word2vec has multiple modes Cbow (Continuous Bag of Words) Skip-gram Batch skip-gram (distributed computation over many CPU nodes) Required Hyperparameters Text Classification mode (supervised) Word2vec mode (batch_skipgram, skipgram, cbow) Important Hyperparameters Text classification learning_rate vector_dim epochs word_ngrams - number of n-gram (sequence of n adjacent symbols in particular order) to use Word2vec learning_rate vector_dim window_size - number of words surrounding the target word used for training negative_samples - number of negative samples for the negative sample sharing strategy (instances that the model should learn to identify as not belonging to the target class) Instance Types For cbow and skipgram, recommend a single p3.2xlarge For batch_skipgram, can use single or multiple CPU instances For text classification, c5 recommended if < 2GB training data. For larger data sets, single GPU instance (p2.xlarge or p3.2xlarge)
Text classification Supervised Work on individual sentences and not for an entire document Predict labels for a sentence Useful in web searches, information retrieval Word2vec Creates a vector representation of words (aka word embedding) Semantically similar words are represented by vectors close to each other It is useful for NLP, but is not an NLP algorithm itself sentiment analysis entity recognition translation Used in machine translation, sentiment analysis Work on individual words, not sentences or entire documents
Supervised Work on individual sentences and not for an entire document Predict labels for a sentence Useful in web searches, information retrieval
Creates a vector representation of words (aka word embedding) Semantically similar words are represented by vectors close to each other It is useful for NLP, but is not an NLP algorithm itself sentiment analysis entity recognition translation Used in machine translation, sentiment analysis Work on individual words, not sentences or entire documents
sentiment analysis entity recognition translation
Text Expects a single preprocessed text file with space-separated tokens Text classification File Mode: One sentence per line First “word” in the sentence is the string __label__etichetta followed by the label augmented manifest text format Word2vec just wants a text file with one training sentence per line
Text classification File Mode: One sentence per line First “word” in the sentence is the string __label__etichetta followed by the label augmented manifest text format Word2vec just wants a text file with one training sentence per line
File Mode: One sentence per line First “word” in the sentence is the string __label__etichetta followed by the label augmented manifest text format
One sentence per line First “word” in the sentence is the string __label__etichetta followed by the label
just wants a text file with one training sentence per line
Word2vec has multiple modes Cbow (Continuous Bag of Words) Skip-gram Batch skip-gram (distributed computation over many CPU nodes)
Cbow (Continuous Bag of Words) Skip-gram Batch skip-gram (distributed computation over many CPU nodes)
Text Classification mode (supervised) Word2vec mode (batch_skipgram, skipgram, cbow)
mode (supervised)
mode (batch_skipgram, skipgram, cbow)
Text classification learning_rate vector_dim epochs word_ngrams - number of n-gram (sequence of n adjacent symbols in particular order) to use Word2vec learning_rate vector_dim window_size - number of words surrounding the target word used for training negative_samples - number of negative samples for the negative sample sharing strategy (instances that the model should learn to identify as not belonging to the target class)
learning_rate vector_dim epochs word_ngrams - number of n-gram (sequence of n adjacent symbols in particular order) to use
learning_rate vector_dim window_size - number of words surrounding the target word used for training negative_samples - number of negative samples for the negative sample sharing strategy (instances that the model should learn to identify as not belonging to the target class)
For cbow and skipgram, recommend a single p3.2xlarge For batch_skipgram, can use single or multiple CPU instances For text classification, c5 recommended if < 2GB training data. For larger data sets, single GPU instance (p2.xlarge or p3.2xlarge)
In a nutshell Create vector (low-dimensional) from a pair objects (high-dimensional) Vectors are low-dimensional dense embeddings Embeddings are learned preserving semantics of the relationship of pairs of objects Problem Compute nearest neighbors of objects Visualize clusters Genere prediction Recommendations (similar items or users) Identify duplicate support tickets Find the correct ticket routing based on similarity of text in the tickets Supervised Input Text JSONLine - {"label":1, "in0": [5, 7, 12, 34], "in1": [12]} Data must be tokenized into integers Discrete token - list of a single integer-id, e.g. [10] Sequences of discrete tokens - list of integer-ids, e.g. [0,12,10,13] The object in each pair can be asymmetric, e.g. (token, sequence) or (token, token) or (sequence, sequence) Examples: Sentence-sentence pairs, Labels-sequence pairs, Customer-customer pairs, Product-product pairs, Item review user-item pairs The input label for each pair can be: A categorical label that expresses the relationship between the objects in the pair A score that expresses the strength of the similarity between the two objects How it is used Process data into JSON Lines and shuffle it Train with two input channels, two encoders, and a comparator Encoder choices: Average-pooled embeddings, CNN, Bidirectional LSTM Encoders generate embeddings that are then compared by Comparator Comparator is followed by a feed-forward neural network Feedforward network receives a combination of the two vectors The output is the strenght of the relationship (regression) or a label of the relationship (classification) Required Hyperparameters [enc0_max_seq_len - maximum sequence length for the enc0 encoder] [enc0_vocab_size - vocabulary size of enc0 tokens] Important Hyperparameters enc0_network - network model for the enc0 encoder (hcnn, bilstm, or pooled_embedding) enc1_network - enc0 if you want use same model as enc0 enc_dim - dimension of the output of the embedding layer (encoder) output_layer - softmax (classification) rmse (regression) The usual deep learning: dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay (L2) No support for L1 Instance Types Trining Single machine CPU or GPU Multi-GPU is supported: m5.2xlarge, p2.xlarge or go up to m5.4xlarge, m5.12xlarge GPU options: p2, p3, g4dn, g5 Inference: p3.2xlarge (use INFERENCE_PREFERRED_MODE env variable to optimize encoder embeddings rather than classification or regression)
Create vector (low-dimensional) from a pair objects (high-dimensional) Vectors are low-dimensional dense embeddings Embeddings are learned preserving semantics of the relationship of pairs of objects
Compute nearest neighbors of objects Visualize clusters Genere prediction Recommendations (similar items or users) Identify duplicate support tickets Find the correct ticket routing based on similarity of text in the tickets
Text JSONLine - {"label":1, "in0": [5, 7, 12, 34], "in1": [12]} Data must be tokenized into integers Discrete token - list of a single integer-id, e.g. [10] Sequences of discrete tokens - list of integer-ids, e.g. [0,12,10,13] The object in each pair can be asymmetric, e.g. (token, sequence) or (token, token) or (sequence, sequence) Examples: Sentence-sentence pairs, Labels-sequence pairs, Customer-customer pairs, Product-product pairs, Item review user-item pairs The input label for each pair can be: A categorical label that expresses the relationship between the objects in the pair A score that expresses the strength of the similarity between the two objects
JSONLine - {"label":1, "in0": [5, 7, 12, 34], "in1": [12]} Data must be tokenized into integers Discrete token - list of a single integer-id, e.g. [10] Sequences of discrete tokens - list of integer-ids, e.g. [0,12,10,13] The object in each pair can be asymmetric, e.g. (token, sequence) or (token, token) or (sequence, sequence) Examples: Sentence-sentence pairs, Labels-sequence pairs, Customer-customer pairs, Product-product pairs, Item review user-item pairs The input label for each pair can be: A categorical label that expresses the relationship between the objects in the pair A score that expresses the strength of the similarity between the two objects
A categorical label that expresses the relationship between the objects in the pair A score that expresses the strength of the similarity between the two objects
Process data into JSON Lines and shuffle it Train with two input channels, two encoders, and a comparator Encoder choices: Average-pooled embeddings, CNN, Bidirectional LSTM Encoders generate embeddings that are then compared by Comparator Comparator is followed by a feed-forward neural network Feedforward network receives a combination of the two vectors The output is the strenght of the relationship (regression) or a label of the relationship (classification)
Average-pooled embeddings, CNN, Bidirectional LSTM
[enc0_max_seq_len - maximum sequence length for the enc0 encoder] [enc0_vocab_size - vocabulary size of enc0 tokens]
enc0_network - network model for the enc0 encoder (hcnn, bilstm, or pooled_embedding) enc1_network - enc0 if you want use same model as enc0 enc_dim - dimension of the output of the embedding layer (encoder) output_layer - softmax (classification) rmse (regression) The usual deep learning: dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay (L2) No support for L1
Trining Single machine CPU or GPU Multi-GPU is supported: m5.2xlarge, p2.xlarge or go up to m5.4xlarge, m5.12xlarge GPU options: p2, p3, g4dn, g5 Inference: p3.2xlarge (use INFERENCE_PREFERRED_MODE env variable to optimize encoder embeddings rather than classification or regression)
Single machine CPU or GPU Multi-GPU is supported: m5.2xlarge, p2.xlarge or go up to m5.4xlarge, m5.12xlarge GPU options: p2, p3, g4dn, g5
p3.2xlarge (use INFERENCE_PREFERRED_MODE env variable to optimize encoder embeddings rather than classification or regression)
In a nutshell Identify all objects in an image with bounding boxes Detects and classifies objects with a single deep neural network Classes are accompanied by confidence scores Can train from scratch, or use pre-trained models based on ImageNet Supervised Input Vision MXNet RecordIO or image (.jpg or .png) File mode and pipe mode (RecordIO) Each image needs a .json file for annotation, and the .json should have the same name as the corresponding image Tensorflow Training and Inference dataset based on .jpg, .jpeg, or .png Any dataset with any number of image classes In the input_directory put an annotations.json file and a images directory with all your training data input_directory |-- images |-- image1.png |-- image2.png |-- annotations.json How it is used Takes an image as input, outputs all instances of objects in the image with categories and confidence scores MXNet Uses a CNN with the Single Shot multibox Detector (SSD) algorithm (base CNN can be VGG-16 or ResNet-50) Full training (random initialized) Transfer learning mode / incremental training (use pre-trained model for the base network weights, instead of random initial weights) Uses flip, rescale, and jitter internally to avoid overfitting Tensorflow Supports transfer learning using any of the compatible pretrained TensorFlow models Uses ResNet, EfficientNet, MobileNet models from the TensorFlow Model Garden Required Hyperparameters MXNet [num_classes - number of output classes] [num_training_samples - number of training examples in the input dataset] Important Hyperparameters mini_batch_size learning_rate optimizer (sgd, adam, rmsprop, adadelta) [MXNet - augmentation_type - Data augmentation type] Instance Type Training: GPU (multi-GPU and multi-machine) p2.xlarge, p2.16xlarge, p3.2xlarge, p3.16xlarge, g4dn, g5 Inference: CPU or GPU m5, p2, p3, g4dn
Identify all objects in an image with bounding boxes Detects and classifies objects with a single deep neural network Classes are accompanied by confidence scores Can train from scratch, or use pre-trained models based on ImageNet
Vision MXNet RecordIO or image (.jpg or .png) File mode and pipe mode (RecordIO) Each image needs a .json file for annotation, and the .json should have the same name as the corresponding image Tensorflow Training and Inference dataset based on .jpg, .jpeg, or .png Any dataset with any number of image classes In the input_directory put an annotations.json file and a images directory with all your training data input_directory |-- images |-- image1.png |-- image2.png |-- annotations.json
MXNet RecordIO or image (.jpg or .png) File mode and pipe mode (RecordIO) Each image needs a .json file for annotation, and the .json should have the same name as the corresponding image Tensorflow Training and Inference dataset based on .jpg, .jpeg, or .png Any dataset with any number of image classes In the input_directory put an annotations.json file and a images directory with all your training data input_directory |-- images |-- image1.png |-- image2.png |-- annotations.json
RecordIO or image (.jpg or .png) File mode and pipe mode (RecordIO) Each image needs a .json file for annotation, and the .json should have the same name as the corresponding image
Training and Inference dataset based on .jpg, .jpeg, or .png Any dataset with any number of image classes In the input_directory put an annotations.json file and a images directory with all your training data input_directory |-- images |-- image1.png |-- image2.png |-- annotations.json
|-- image1.png
|-- image2.png
Takes an image as input, outputs all instances of objects in the image with categories and confidence scores MXNet Uses a CNN with the Single Shot multibox Detector (SSD) algorithm (base CNN can be VGG-16 or ResNet-50) Full training (random initialized) Transfer learning mode / incremental training (use pre-trained model for the base network weights, instead of random initial weights) Uses flip, rescale, and jitter internally to avoid overfitting Tensorflow Supports transfer learning using any of the compatible pretrained TensorFlow models Uses ResNet, EfficientNet, MobileNet models from the TensorFlow Model Garden
Uses a CNN with the Single Shot multibox Detector (SSD) algorithm (base CNN can be VGG-16 or ResNet-50) Full training (random initialized) Transfer learning mode / incremental training (use pre-trained model for the base network weights, instead of random initial weights) Uses flip, rescale, and jitter internally to avoid overfitting
Supports transfer learning using any of the compatible pretrained TensorFlow models Uses ResNet, EfficientNet, MobileNet models from the TensorFlow Model Garden
MXNet [num_classes - number of output classes] [num_training_samples - number of training examples in the input dataset]
[num_classes - number of output classes] [num_training_samples - number of training examples in the input dataset]
mini_batch_size learning_rate optimizer (sgd, adam, rmsprop, adadelta) [MXNet - augmentation_type - Data augmentation type]
Training: GPU (multi-GPU and multi-machine) p2.xlarge, p2.16xlarge, p3.2xlarge, p3.16xlarge, g4dn, g5 Inference: CPU or GPU m5, p2, p3, g4dn
p2.xlarge, p2.16xlarge, p3.2xlarge, p3.16xlarge, g4dn, g5
m5, p2, p3, g4dn
In a nutshell Assign one or more labels to an image Doesn’t tell you where objects are, just what objects are in the image Supervised How it is used MXNet Full training mode - network initialized with random weights Transfer learning mode Initialized with pre-trained weights Top fully-connected layer is initialized with random weights Network is fine-tuned with new training data Tensorflow Uses various Tensorflow Hub models (MobileNet, Inception, ResNet, EfficientNet) Pretrained models Top classification layer is available for fine tuning or further training Inout Vision MXNet RecordIO (recommended) .png and .jpeg TensorFlow .jpg, .jpeg, and .png input_directory with a subfoldor for any class containing images Required Hyperparameters MXNet num_classes - defines network output dimensions and (typically) set to the dataset number of classes num_training_samples Important Hyperparameters batch size learning rate optimizer Instance Type Training: GPU, multi-GPU and multi-instance (p2, p3, g4dn, g5) Inference: CPU or GPU (m5, p2, p3, g4dn, g5)
Assign one or more labels to an image Doesn’t tell you where objects are, just what objects are in the image
MXNet Full training mode - network initialized with random weights Transfer learning mode Initialized with pre-trained weights Top fully-connected layer is initialized with random weights Network is fine-tuned with new training data Tensorflow Uses various Tensorflow Hub models (MobileNet, Inception, ResNet, EfficientNet) Pretrained models Top classification layer is available for fine tuning or further training
Full training mode - network initialized with random weights Transfer learning mode Initialized with pre-trained weights Top fully-connected layer is initialized with random weights Network is fine-tuned with new training data
Initialized with pre-trained weights Top fully-connected layer is initialized with random weights Network is fine-tuned with new training data
Uses various Tensorflow Hub models (MobileNet, Inception, ResNet, EfficientNet) Pretrained models Top classification layer is available for fine tuning or further training
Vision MXNet RecordIO (recommended) .png and .jpeg TensorFlow .jpg, .jpeg, and .png input_directory with a subfoldor for any class containing images
MXNet RecordIO (recommended) .png and .jpeg TensorFlow .jpg, .jpeg, and .png input_directory with a subfoldor for any class containing images
RecordIO (recommended) .png and .jpeg
.jpg, .jpeg, and .png input_directory with a subfoldor for any class containing images
MXNet num_classes - defines network output dimensions and (typically) set to the dataset number of classes num_training_samples
num_classes - defines network output dimensions and (typically) set to the dataset number of classes num_training_samples
batch size learning rate optimizer
Training: GPU, multi-GPU and multi-instance (p2, p3, g4dn, g5) Inference: CPU or GPU (m5, p2, p3, g4dn, g5)
In a nutshell Pixel-level object classification (tags every pixel with a class label from a predefined set of classes) Provides information about the shapes of the objects contained in the image Produces a segmentation mask (grayscale image) – that maps individual pixels to labels Different from image classification – that assigns labels to whole images Different from object detection – that assigns labels to bounding boxes Problem: Self-driving vehicles Medical imaging diagnostics Robot sensing Supervised Input Vision JPG images and PNG annotations Requires Training and Validation datasets Dataset on S3 with 4 directories, 2 for images and 2 for annotations Augmented manifest image format (JSON Line) supported for Pipe mode Inference .jpg images accepted (.png - outputs a .png file with segmentation mask in the same format as the labels themselves) (recordio-protobuf - returns class probabilities encoded in recordio-protobuf format) How it is used Built on MXNet Gluon framework and Gluon CV toolkit Choice of 3 algorithms Fully-Convolutional Network (FCN) Pyramid Scene Parsing (PSP) DeepLabV3 Each of the three algorithms has two distinct components The backbone (or encoder)—network that produces reliable activation maps of features The decoder—a network that constructs the segmentation mask from the encoded activation maps Choice of backbones ResNet50 ResNet101 Both trained on ImageNet dataset Full training Incremental training [Required Hyperparameters] [num_classes] [num_training_samples] Important Hyperparameters algorithm (fcn, psp, deeplab) backbone (resnet-50, resnet-101) use_pretrained_model (True, False) Instance Types Training: only GPU instance (p2, p3, g4dn and g5) Inference: either CPU instances (c5 and m5) and GPU instances (p3 and g4dn) or both
Pixel-level object classification (tags every pixel with a class label from a predefined set of classes) Provides information about the shapes of the objects contained in the image Produces a segmentation mask (grayscale image) – that maps individual pixels to labels Different from image classification – that assigns labels to whole images Different from object detection – that assigns labels to bounding boxes
Self-driving vehicles Medical imaging diagnostics Robot sensing
Vision JPG images and PNG annotations Requires Training and Validation datasets Dataset on S3 with 4 directories, 2 for images and 2 for annotations Augmented manifest image format (JSON Line) supported for Pipe mode Inference .jpg images accepted (.png - outputs a .png file with segmentation mask in the same format as the labels themselves) (recordio-protobuf - returns class probabilities encoded in recordio-protobuf format)
JPG images and PNG annotations Requires Training and Validation datasets Dataset on S3 with 4 directories, 2 for images and 2 for annotations Augmented manifest image format (JSON Line) supported for Pipe mode Inference .jpg images accepted (.png - outputs a .png file with segmentation mask in the same format as the labels themselves) (recordio-protobuf - returns class probabilities encoded in recordio-protobuf format)
.jpg images accepted (.png - outputs a .png file with segmentation mask in the same format as the labels themselves) (recordio-protobuf - returns class probabilities encoded in recordio-protobuf format)
Built on MXNet Gluon framework and Gluon CV toolkit Choice of 3 algorithms Fully-Convolutional Network (FCN) Pyramid Scene Parsing (PSP) DeepLabV3 Each of the three algorithms has two distinct components The backbone (or encoder)—network that produces reliable activation maps of features The decoder—a network that constructs the segmentation mask from the encoded activation maps Choice of backbones ResNet50 ResNet101 Both trained on ImageNet dataset Full training Incremental training
Fully-Convolutional Network (FCN) Pyramid Scene Parsing (PSP) DeepLabV3
The backbone (or encoder)—network that produces reliable activation maps of features The decoder—a network that constructs the segmentation mask from the encoded activation maps
ResNet50 ResNet101 Both trained on ImageNet dataset
[num_classes] [num_training_samples]
algorithm (fcn, psp, deeplab) backbone (resnet-50, resnet-101) use_pretrained_model (True, False)
Training: only GPU instance (p2, p3, g4dn and g5) Inference: either CPU instances (c5 and m5) and GPU instances (p3 and g4dn) or both
In a nutshell Anomaly detection Designed to work with arbitrary-dimensional input Assigns an anomaly score to each data point Defining "low" and "high" scores depend on the application Common practice: scores beyond 3 standard deviations from the mean score are considered anomalous Problem Unexpected spikes in time series data (e.g. traffic volume analysis, sound volume spike detection) Breaks in periodicity Unclassifiable data points Unsupervised Input RecordIO-protobuf CSV File or Pipe mode Optional test channel for computing accuracy, precision, recall, and F1 (anomaly or not) (As it is unsupervised, there is no real training. Still use labelled test channel to measure accuracy based on your knowledge of the test dataset having or not anomalies) How it is used Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it Data is sampled randomly Then partitioned according to the number of trees in the forest Each tree is given such a partition The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree Required Hyperparameters feature_dim - number of features in the data set (if using RCF estimator this value is calculated for you) Important Hyperparameters num_trees - number of trees in the forest (higher reduces noise) num_samples_per_tree - number of random samples given to each tree from the training data set (should be chosen such that 1/num_samples_per_tree ~ the ratio of anomalous to normal data when this info is known) Instance Types Does not take advantage of GPU Training: m4, c4, c5 Inference: c5.xl
Anomaly detection Designed to work with arbitrary-dimensional input Assigns an anomaly score to each data point Defining "low" and "high" scores depend on the application Common practice: scores beyond 3 standard deviations from the mean score are considered anomalous
Unexpected spikes in time series data (e.g. traffic volume analysis, sound volume spike detection) Breaks in periodicity Unclassifiable data points
RecordIO-protobuf CSV File or Pipe mode Optional test channel for computing accuracy, precision, recall, and F1 (anomaly or not) (As it is unsupervised, there is no real training. Still use labelled test channel to measure accuracy based on your knowledge of the test dataset having or not anomalies)
Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it Data is sampled randomly Then partitioned according to the number of trees in the forest Each tree is given such a partition The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree
Data is sampled randomly Then partitioned according to the number of trees in the forest Each tree is given such a partition The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree
feature_dim - number of features in the data set (if using RCF estimator this value is calculated for you)
num_trees - number of trees in the forest (higher reduces noise) num_samples_per_tree - number of random samples given to each tree from the training data set (should be chosen such that 1/num_samples_per_tree ~ the ratio of anomalous to normal data when this info is known)
Does not take advantage of GPU Training: m4, c4, c5 Inference: c5.xl
In a nutshell Organize documents into topics that contain word groupings based on their statistical distribution Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example Here topics are artificial topics, it doesn’t mean they are necessarily human readable with a meaning Problem Classify documents based on the topics detected Retrieve information based on topic detected Recommend content based on topic similarities Unsupervised (Neural Variational Inference) Input Text Supports 4 data channels: train, validation, test, and auxiliary. Only train is mandatory Words must be tokenized into integers (thus vocabulary is need) Auxiliary channel used to supply a text file that contains vocabulary Training: RecordIO-protobuf or CSV CSV: Every document must contain a count for every word in the vocabulary (zero for words not in the document) Inference: RecordIO-protobuf, CSV, JSON or JSON Lines File or pipe mode How it is used You define how many topics you want These topics are a latent representation based on top ranking words One of two topic modeling algorithms in SageMaker (you can try them both) Required Hyperparameters feature_dim - vocabulary size of the dataset num_topics - number of required topics Important Hyperparameters mini_batch_size learning_rate Instance Type Training: GPU recommended Inference: CPU should be sufficient
Organize documents into topics that contain word groupings based on their statistical distribution Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example Here topics are artificial topics, it doesn’t mean they are necessarily human readable with a meaning
Classify documents based on the topics detected Retrieve information based on topic detected Recommend content based on topic similarities
Text Supports 4 data channels: train, validation, test, and auxiliary. Only train is mandatory Words must be tokenized into integers (thus vocabulary is need) Auxiliary channel used to supply a text file that contains vocabulary Training: RecordIO-protobuf or CSV CSV: Every document must contain a count for every word in the vocabulary (zero for words not in the document) Inference: RecordIO-protobuf, CSV, JSON or JSON Lines File or pipe mode
Supports 4 data channels: train, validation, test, and auxiliary. Only train is mandatory Words must be tokenized into integers (thus vocabulary is need) Auxiliary channel used to supply a text file that contains vocabulary Training: RecordIO-protobuf or CSV CSV: Every document must contain a count for every word in the vocabulary (zero for words not in the document) Inference: RecordIO-protobuf, CSV, JSON or JSON Lines File or pipe mode
CSV: Every document must contain a count for every word in the vocabulary (zero for words not in the document)
You define how many topics you want These topics are a latent representation based on top ranking words One of two topic modeling algorithms in SageMaker (you can try them both)
feature_dim - vocabulary size of the dataset num_topics - number of required topics
mini_batch_size learning_rate
Training: GPU recommended Inference: CPU should be sufficient
In a nutshell Another topic learning model (not deep learning) The topics are unlabeled; they are just groupings of documents with a shared subset of words Problem Discover a user-specified number of topics shared by documents within a text corpus Can be used for things other than words Cluster customers based on purchases Harmonic analysis in music Unsupervised Input Text Train channel, optional test channel RecordIO-protobuf or CSV CSV requires each document has counts for every word in vocabulary Pipe mode only supported with RecordIO How it is used Observation is a document Features are the presence (or occurrence count) of each word Categories are the topics (not specified up front as it is unsupervised) Topics are not guaranteed to align with how a human may naturally categorize documents Topics are learned as a probability distribution over the words that occur in each document Each document is described as a mixture of topics Optional test channel can be used for scoring results: per-word log likelihood is the metric used to measure how well LDA works Required Hyperparameters num_topics feature_dim - size of the vocabulary of the input document corpus mini_batch_size - total number of documents in the input document corpus Important Hyperparameters alpha0 - Initial guess for the concentration parameter. Small values generate sparse topic mixtures Large values (greater than 1.0) produce uniform mixtures Instance Type Single-instance CPU training
Another topic learning model (not deep learning) The topics are unlabeled; they are just groupings of documents with a shared subset of words
Discover a user-specified number of topics shared by documents within a text corpus Can be used for things other than words Cluster customers based on purchases Harmonic analysis in music
Cluster customers based on purchases Harmonic analysis in music
Text Train channel, optional test channel RecordIO-protobuf or CSV CSV requires each document has counts for every word in vocabulary Pipe mode only supported with RecordIO
Train channel, optional test channel RecordIO-protobuf or CSV CSV requires each document has counts for every word in vocabulary Pipe mode only supported with RecordIO
Observation is a document Features are the presence (or occurrence count) of each word Categories are the topics (not specified up front as it is unsupervised) Topics are not guaranteed to align with how a human may naturally categorize documents Topics are learned as a probability distribution over the words that occur in each document Each document is described as a mixture of topics Optional test channel can be used for scoring results: per-word log likelihood is the metric used to measure how well LDA works
num_topics feature_dim - size of the vocabulary of the input document corpus mini_batch_size - total number of documents in the input document corpus
alpha0 - Initial guess for the concentration parameter. Small values generate sparse topic mixtures Large values (greater than 1.0) produce uniform mixtures
Small values generate sparse topic mixtures Large values (greater than 1.0) produce uniform mixtures
Single-instance CPU training
In a nutshell Simple classification or regression algorithm Idea behind KNN is that similar data points should have the same class, at least most of the time Problem Classification Find the K closest points to a sample point and return the most frequent label Ex. image classification based on a proper distance function between images. The class of an unlabeled image can be determined by the labels assigned to its nearest neighbors Regression Find the K closest points to a sample point and return the average value Supervised Input Tabular Train channel contains your data Test channel emits accuracy (classification) or MSE (regression) recordIO-protobuf or CSV training File or pipe mode on either How it is used Data is first sampled SageMaker includes a dimensionality reduction stage to avoid sparse data 2 methods of dimension reduction: sign or fjlt Build an index for looking up neighbors Serialize the model Query the model for a given K Required Hyperparameters feature_dim - number of features in the input data k predictor_type (classifier or regressor) sample_size - number of data points to be sampled from the training data set dimension_reduction_target (0 ÷ feature_dim) Important Hyperparameters (k, sample_size) Instance Type Training on CPU or GPU Inference CPU for lower latency GPU for higher throughput on large batches
Simple classification or regression algorithm Idea behind KNN is that similar data points should have the same class, at least most of the time
Classification Find the K closest points to a sample point and return the most frequent label Ex. image classification based on a proper distance function between images. The class of an unlabeled image can be determined by the labels assigned to its nearest neighbors Regression Find the K closest points to a sample point and return the average value
Find the K closest points to a sample point and return the most frequent label Ex. image classification based on a proper distance function between images. The class of an unlabeled image can be determined by the labels assigned to its nearest neighbors
Find the K closest points to a sample point and return the average value
Tabular Train channel contains your data Test channel emits accuracy (classification) or MSE (regression) recordIO-protobuf or CSV training File or pipe mode on either
Train channel contains your data Test channel emits accuracy (classification) or MSE (regression) recordIO-protobuf or CSV training File or pipe mode on either
Data is first sampled SageMaker includes a dimensionality reduction stage to avoid sparse data 2 methods of dimension reduction: sign or fjlt Build an index for looking up neighbors Serialize the model Query the model for a given K
2 methods of dimension reduction: sign or fjlt
feature_dim - number of features in the input data k predictor_type (classifier or regressor) sample_size - number of data points to be sampled from the training data set dimension_reduction_target (0 ÷ feature_dim)
Training on CPU or GPU Inference CPU for lower latency GPU for higher throughput on large batches
CPU for lower latency GPU for higher throughput on large batches
In a nutshell Divide data into k groups, where members of a group are as similar as possible to each other You define what “similar” means Measured by Euclidean distance (distorsion or inertia) Use elbow method to determine optimal k Problem Web-scale k-Means clustering Classifying customers by purchase history or clickstream activity (high-, medium-, and low-spending customers from their transaction histories) Detecting patterns for diseases or success healthcare treatment scenarios Grouping similar images for image detection Detecting fraud by detecting anomalies in the dataset. E.g. detecting credit card frauds by abnormal purchase patterns Unsupervised Input Unsupervised Train channel, optional test Train ShardedByS3Key, test FullyReplicated recordIO-protobuf or CSV File or Pipe on either How it is used Every observation mapped to n-dimensional space (n = number of features) Works to optimize the center of k clusters “extra cluster centers” may be specified to improve accuracy (which end up getting reduced to k) K=k*x Algorithm Determine initial cluster centers Random or k-means++ approach K-means++ tries to make initial clusters far apart Iterate over training data and calculate cluster centers Reduce clusters from K to k (using Lloyd’s method with kmeans++) Required Hyperparameters k feature_dim Important Hyperparameters mini_batch_size extra_center_factor init_method (random, kmeans++) Instance Type Training: GPU recommended No multi GPU per instance (g4dn.xlarge) supports p2, p3, g4dn, and g5 instances for training and inference
Divide data into k groups, where members of a group are as similar as possible to each other You define what “similar” means Measured by Euclidean distance (distorsion or inertia) Use elbow method to determine optimal k
Web-scale k-Means clustering Classifying customers by purchase history or clickstream activity (high-, medium-, and low-spending customers from their transaction histories) Detecting patterns for diseases or success healthcare treatment scenarios Grouping similar images for image detection Detecting fraud by detecting anomalies in the dataset. E.g. detecting credit card frauds by abnormal purchase patterns
Unsupervised Train channel, optional test Train ShardedByS3Key, test FullyReplicated recordIO-protobuf or CSV File or Pipe on either
Train channel, optional test Train ShardedByS3Key, test FullyReplicated recordIO-protobuf or CSV File or Pipe on either
Every observation mapped to n-dimensional space (n = number of features) Works to optimize the center of k clusters “extra cluster centers” may be specified to improve accuracy (which end up getting reduced to k) K=k*x Algorithm Determine initial cluster centers Random or k-means++ approach K-means++ tries to make initial clusters far apart Iterate over training data and calculate cluster centers Reduce clusters from K to k (using Lloyd’s method with kmeans++)
“extra cluster centers” may be specified to improve accuracy (which end up getting reduced to k) K=k*x
Determine initial cluster centers Random or k-means++ approach K-means++ tries to make initial clusters far apart Iterate over training data and calculate cluster centers Reduce clusters from K to k (using Lloyd’s method with kmeans++)
Random or k-means++ approach K-means++ tries to make initial clusters far apart
k feature_dim
mini_batch_size extra_center_factor init_method (random, kmeans++)
Training: GPU recommended No multi GPU per instance (g4dn.xlarge) supports p2, p3, g4dn, and g5 instances for training and inference
GPU recommended No multi GPU per instance (g4dn.xlarge) supports p2, p3, g4dn, and g5 instances for training and inference
In a nutshell Dimensionality reduction: project higher-dimensional data into lower-dimensional while minimizing loss of information The reduced dimensions are called components Finds a new set of features (components) which are composites of the original features Components are uncorrelated with one another First component has largest possible variability Second component has the next largest... Unsupervised Input Unsupervised recordIO-protobuf or CSV File or Pipe on either How it is used Covariance matrix is created, then singular value decomposition (SVD) 2 modes: Regular: for sparse data and moderate number of observations and features Randomized: for large number of observations and features (uses approximation algorithm) Required Hyperparameters feature_dim mini_batch_size num_components - the number of principal components to compute. Important Hyperparameters algorithm_mode (regular, randomized) subtract_mean (true, false) - if data should be unbiased both during training and at inference Instance Type Supports CPU and GPU instances for training and inference Optimal instance type depends on the specifics of the input data
Dimensionality reduction: project higher-dimensional data into lower-dimensional while minimizing loss of information The reduced dimensions are called components Finds a new set of features (components) which are composites of the original features Components are uncorrelated with one another First component has largest possible variability Second component has the next largest...
First component has largest possible variability Second component has the next largest...
Unsupervised recordIO-protobuf or CSV File or Pipe on either
recordIO-protobuf or CSV File or Pipe on either
Covariance matrix is created, then singular value decomposition (SVD) 2 modes: Regular: for sparse data and moderate number of observations and features Randomized: for large number of observations and features (uses approximation algorithm)
Regular: for sparse data and moderate number of observations and features Randomized: for large number of observations and features (uses approximation algorithm)
feature_dim mini_batch_size num_components - the number of principal components to compute.
algorithm_mode (regular, randomized) subtract_mean (true, false) - if data should be unbiased both during training and at inference
Supports CPU and GPU instances for training and inference Optimal instance type depends on the specifics of the input data
In a nutshell Dealing with sparse data (since an individual user doesn’t interact with most pages / products) Limited to pair-wise interactions (e.g. user -> item) Binary-Classification or regression Problem Regression Classification Example Use Case: Click prediction Item recommendations Ranking Supervised Input Tabular Training: RecordIO-protobuf with Float32 Inference: RecordIO-protobuf and JSON Sparse data means CSV isn’t practical Both File and Pipe mode training are supported for training How it is used Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?) Usually used in the context of recommender systems Required Hyperparameters feature_dim num_factors (2 ÷ 1000) dimensionality of factorization suggested 64 predictor_type (binary_classifier, regressor) Important Hyperparameters Initialization methods for bias, factors, and linear terms Uniform, normal, or constant Can tune properties of each method Instance Types Supports CPU and GPU also multi-instances CPU recommended GPU only works with dense data
Dealing with sparse data (since an individual user doesn’t interact with most pages / products) Limited to pair-wise interactions (e.g. user -> item) Binary-Classification or regression
Regression Classification
Click prediction Item recommendations Ranking
Tabular Training: RecordIO-protobuf with Float32 Inference: RecordIO-protobuf and JSON Sparse data means CSV isn’t practical Both File and Pipe mode training are supported for training
Training: RecordIO-protobuf with Float32 Inference: RecordIO-protobuf and JSON Sparse data means CSV isn’t practical Both File and Pipe mode training are supported for training
Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?) Usually used in the context of recommender systems
feature_dim num_factors (2 ÷ 1000) dimensionality of factorization suggested 64 predictor_type (binary_classifier, regressor)
Initialization methods for bias, factors, and linear terms Uniform, normal, or constant Can tune properties of each method
Uniform, normal, or constant Can tune properties of each method
Supports CPU and GPU also multi-instances CPU recommended GPU only works with dense data
Ina nutshell Learns the IP usage patterns of each entity and identifies anomalies Captures associations between IPv4 addresses and various entities e.g. user IDs or account numbers Problem Identifies suspicious behavior from IP addresses Identify logins from anomalous IP’s Identify accounts creating resources from anomalous IP’s In advanced solutions, feed the IP Insights score into another ML model e.g. combine the IP Insight score with other features to rank the findings of another security system (GuardDuty) Unsupervised Input Unsupervised User names, account ID’s can be fed in directly No need to pre-process Training channel, optional validation (computes AUC score) Training: CSV only (Entity, IP) Inference: CSV, JSON, JSON Line How it is used Uses a neural network to learn latent vector representations of entities and IP addresses Entities are hashed and embedded (need sufficiently large hash size) Automatically generates negative samples during training by randomly pairing entities and IP’s Required Hyperparameters num_entity_vectors - number of entity vector representations (entity embedding vectors) to train Hash size Set to twice the number of unique entity identifiers vector_dim - size of embedding vectors to represent entities and IP addresses model size scales linearly with this parameter and limits how large the dimension can be too large can cause the model to overfit, especially for small training datasets Important Hyperparameters epochs learning_rate batch_size Instance Type Both CPU and GPU are supported Multi-GPU instances supported Training: GPU recommended Inference: CPU recommended Size of CPU instance depends on vector_dim and num_entity_vectors
Learns the IP usage patterns of each entity and identifies anomalies Captures associations between IPv4 addresses and various entities e.g. user IDs or account numbers
Identifies suspicious behavior from IP addresses Identify logins from anomalous IP’s Identify accounts creating resources from anomalous IP’s In advanced solutions, feed the IP Insights score into another ML model e.g. combine the IP Insight score with other features to rank the findings of another security system (GuardDuty)
Identify logins from anomalous IP’s Identify accounts creating resources from anomalous IP’s
Unsupervised User names, account ID’s can be fed in directly No need to pre-process Training channel, optional validation (computes AUC score) Training: CSV only (Entity, IP) Inference: CSV, JSON, JSON Line
User names, account ID’s can be fed in directly No need to pre-process Training channel, optional validation (computes AUC score) Training: CSV only (Entity, IP) Inference: CSV, JSON, JSON Line
Uses a neural network to learn latent vector representations of entities and IP addresses Entities are hashed and embedded (need sufficiently large hash size) Automatically generates negative samples during training by randomly pairing entities and IP’s
num_entity_vectors - number of entity vector representations (entity embedding vectors) to train Hash size Set to twice the number of unique entity identifiers vector_dim - size of embedding vectors to represent entities and IP addresses model size scales linearly with this parameter and limits how large the dimension can be too large can cause the model to overfit, especially for small training datasets
Hash size Set to twice the number of unique entity identifiers
model size scales linearly with this parameter and limits how large the dimension can be too large can cause the model to overfit, especially for small training datasets
epochs learning_rate batch_size
Both CPU and GPU are supported Multi-GPU instances supported Training: GPU recommended Inference: CPU recommended Size of CPU instance depends on vector_dim and num_entity_vectors
Key Features of Amazon SageMaker RL RL Deep learning framework - Tensorflow and MXNet RL toolkit - Intel Coach and Ray Rllib (RL toolkit manages the interaction between agent and environment, provides selection of RL algorithms) RL Environments - custom, open-source, or commercial environments supported. Simulators are useful when it is not safe to train an agent in the real world (Commercial) MATLAB, Simulink (Open-Source) EnergyPlus, RoboSchool, PyBullet (AWS Simulation) AWS RoboMaker (robot simulation) (Custom) BYO Distributed Training Can distribute training and/or environment rollout Multi-core and multi-instance Single training instance and single instance rollout - all same instance type Single training instance and multiple instance rollout - different instance types for training and rollouts Single trainer instance that uses multiple cores for rollout Multiple instances for training and rollouts Key Terms Environment - layout of the board / maze / etc State - where the player / pieces are Action - move in a given direction, etc Reward - value associated with the action from that state Observation - surroundings in a maze, state of chess board (can be the entire state or a subset of the state) Hyperparameter Tuning Run a hyperparameter tuning job to optimize hyperparameters for Amazon SageMaker RL Parameters of your choosing may be abstracted Hyperparameter tuning in SageMaker can then optimize them Instance Types No specific guidance given in developer guide It’s deep learning – so GPU’s are helpful It supports multiple instances and cores
RL Deep learning framework - Tensorflow and MXNet RL toolkit - Intel Coach and Ray Rllib (RL toolkit manages the interaction between agent and environment, provides selection of RL algorithms) RL Environments - custom, open-source, or commercial environments supported. Simulators are useful when it is not safe to train an agent in the real world (Commercial) MATLAB, Simulink (Open-Source) EnergyPlus, RoboSchool, PyBullet (AWS Simulation) AWS RoboMaker (robot simulation) (Custom) BYO
(Commercial) MATLAB, Simulink (Open-Source) EnergyPlus, RoboSchool, PyBullet (AWS Simulation) AWS RoboMaker (robot simulation) (Custom) BYO
Can distribute training and/or environment rollout Multi-core and multi-instance Single training instance and single instance rollout - all same instance type Single training instance and multiple instance rollout - different instance types for training and rollouts Single trainer instance that uses multiple cores for rollout Multiple instances for training and rollouts
Single training instance and single instance rollout - all same instance type Single training instance and multiple instance rollout - different instance types for training and rollouts Single trainer instance that uses multiple cores for rollout Multiple instances for training and rollouts
Environment - layout of the board / maze / etc State - where the player / pieces are Action - move in a given direction, etc Reward - value associated with the action from that state Observation - surroundings in a maze, state of chess board (can be the entire state or a subset of the state)
Run a hyperparameter tuning job to optimize hyperparameters for Amazon SageMaker RL Parameters of your choosing may be abstracted Hyperparameter tuning in SageMaker can then optimize them
No specific guidance given in developer guide It’s deep learning – so GPU’s are helpful It supports multiple instances and cores
In a nutshell Open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm Kinda like XGBoost Supervised Problem: Classification (binary and multiclass) Regression Ranking Input txt/csv Training and optional validation channels may be provided Hyperparameters: num_boost_round = max number of boosting rounds early_stopping_rounds = stop training metric of one validation data point does not improve in the last early_stopping_rounds round num_leaves = max leaves per tree max_depth scale_pos_weight = weight of the labels with positive class. Used only for binary classification tasks Instance Types: Single or multi-instance CPU training Use instance_count > 1 when you define your Estimator for multi-instance Memory-bound algorithm Choose general-purpose (e.g. M5) instance over compute-optimized (e.g. C5)
Open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm Kinda like XGBoost
Classification (binary and multiclass) Regression Ranking
txt/csv Training and optional validation channels may be provided
num_boost_round = max number of boosting rounds early_stopping_rounds = stop training metric of one validation data point does not improve in the last early_stopping_rounds round num_leaves = max leaves per tree max_depth scale_pos_weight = weight of the labels with positive class. Used only for binary classification tasks
Single or multi-instance CPU training Use instance_count > 1 when you define your Estimator for multi-instance Memory-bound algorithm Choose general-purpose (e.g. M5) instance over compute-optimized (e.g. C5)
Use instance_count > 1 when you define your Estimator for multi-instance
Choose general-purpose (e.g. M5) instance over compute-optimized (e.g. C5)
Setup A set of environmental states s A set of possible actions in those states a A value of each state/action Q Start off with Q values of 0 Explore the space As bad things happen after a given state/action, reduce its Q As rewards happen after a given state/action, increase its Q
A set of environmental states s A set of possible actions in those states a A value of each state/action Q
Q-Learnng You can “look ahead” more than one step by using a discount factor when computing Q (here s is previous state, s’ is current state) Q(s,a) += discount * (reward(s,a) + max(Q(s’)) – Q(s,a))
You can “look ahead” more than one step by using a discount factor when computing Q (here s is previous state, s’ is current state) Q(s,a) += discount * (reward(s,a) + max(Q(s’)) – Q(s,a))
Exploration problem How do we efficiently explore all of the possible states? Simple approach: always choose the action for a given state with the highest Q. If there’s a tie, choose at random Inefficient, and you might miss a lot of paths that way Better way: introduce an epsilon term If a random number is less than epsilon, don’t follow the highest Q, but choose at random That way, exploration never totally stops Choosing epsilon can be tricky See Markov Decision Process: provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker MDP is a discrete time stochastic control process See Dynamic Programming
How do we efficiently explore all of the possible states? Simple approach: always choose the action for a given state with the highest Q. If there’s a tie, choose at random Inefficient, and you might miss a lot of paths that way Better way: introduce an epsilon term If a random number is less than epsilon, don’t follow the highest Q, but choose at random That way, exploration never totally stops Choosing epsilon can be tricky See Markov Decision Process: provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker MDP is a discrete time stochastic control process See Dynamic Programming
Inefficient, and you might miss a lot of paths that way
If a random number is less than epsilon, don’t follow the highest Q, but choose at random That way, exploration never totally stops Choosing epsilon can be tricky
An agent that “explores” some space As it goes, it learns the value of different state changes in different conditions Those values inform subsequent behavior of the agent Q-Learning specific implementation of reinforcement learning
Supervised Learning an external supervisor provides a training set of labeled examples. Each example contains information about a situation, belongs to a category, and has a label identifying the category to which it belongs. The goal of supervised learning is to generalize in order to predict correctly in situations that are not present in the training data Reinforcement Learning deals with interactive problems, making it infeasible to gather all possible examples of situations with correct labels that an agent might encounter. This type of learning is most promising when an agent is able to accurately learn from its own experience and adjust accordingly Unsupervised Learning an agent learns by uncovering structure within unlabeled data - aka learns from data without human supervision. While a RL agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to maximize a reward signal
Make sentences lower case Remove stop words (a, an, the, ...) using an English stopword dictionary Remove punctuations/symbols/numbers (but it is your choice) Normalize the words lemmatize and stem the words Tokenize the sentence
Lemmatization - goes beyond truncating words and analyzes the context of the sentence. After determining the word's context, the lemmatization algorithm returns the word's base form (lemma) from a dictionary reference. More accurate language representation Stemming - simple form of word reduction focuses on removing word endings (suffixes) to obtain a base form (can result in non-dictionary words.) Less precise than lemmatization, stemming is quick and efficient when processing large volumes of text. Example: “Walk,” “walking,” and “walks” will become “walk.”
Process of marking each word in a text with its corresponding part of speech Used in NLP helps to identify the roles that words play in a sentence Aid in tasks such as syntactic parsing and named entity recognition
N-gram = unigram, bigram Orthogonal Sparse Bigram (OSB) Transformation = alternative do bigram
RCF LDA
seq2seq
Linear learner XGBoost DeepAR (GPU if batch>512) Object2Vec Factorization Machine k-means PCA IP-Insight KNN
Image Classification Object Detection Semantic Segmentation
Neural Topic