Please enable JavaScript.
Coggle requires JavaScript to display documents.
SageMaker - Coggle Diagram
SageMaker
SageMaker Data Wrangler
- Data Flow – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline
- Edit data types
- Add transform
- Get data insights
- Join: Combine data to join two datasets and add the resulting dataset to the data flow
- Concatenate: Combine data to concatenate two datasets and add the resulting dataset to the data flow
- Transform – Clean and transform your dataset using standard transforms like string, vector, and numeric data formatting tools.
- Featurize your data using transforms like text and date/time embedding and categorical encoding
- Resize, enhance, corrupt images
- Balance data: oversampling, undersampling, SMOTE
- Generate Data Insights – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Quality and Insights Report
- Analyze – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation
- Create a Model – Create a Canvas model, where you directly start creating a model with your prepared data
- Export the data – Export your data preparation workflow to a different location:
- S3
- If dataset size > 5 GB, then Canvas initiates a remote job: EMR Serverless (default) or SageMaker Processing Job
- SageMaker Canvas Dataset (Canvas only)
- Automate data preparation – Create machine learning workflows from your data flow:
- SageMaker Pipelines (via Jupyter Notebook) – Build workflows that manage your SageMaker data preparation, model training, and model deployment jobs
- SageMaker Serial inference pipeline (from Jupyter Notebook) – Create a serial inference pipeline from your data flow. Use it to make predictions on new data
- Python script (Jupyter notebook) – Store the data and their transformations in a Python script for your custom workflows
- SageMaker Feature Store
- SageMaker Processing Job (via Jupyter Notebook or from UI)
- Schedule SageMaker Data Wrangler jobs to run periodically as SageMaker processing jobs directly from UI
Analyze and Visualize
Multicollinearity
Measures for the multicollinearity in your data:
- Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other.
- A VIF score is a positive number that is greater than or equal to 1
- VIF <= 5 moderately correlated with the other variables
- VIF >= highly correlated with the other variables
- PCA measures the variance of the data along different directions in the feature space. Also referred to Singular Value Decomposition (SVD)
- PCA generates an ordered list of variances (aka singular value) >=0
- When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity.
- Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.
Multicollinearity is a circumstance where two or more predictor variables (features) are related to each other. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other
Multicollinearity is a special case of correlation multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model is linearly predicted from the others with a substantial degree of accuracy
- Data Quality And Insights Report
- Histogram (with Color By, Facet By)
- Scatter Plot - only for numerical data(with Color By, Facet By)
- Table Summary
- Quick Model - evaluate your data and produce importance scores for each feature
- Model score: Classification -> F1 score, Regression -> MSE score
- Use Gini importance method to calculate feature importance for each feature
- Target Leakage - Target leakage occurs when there is data in training dataset that is strongly correlated with the target label, but is not available in real-world data
- Specify Target - aka label
- Problem type: Classification or Regression
- Uses AUC-ROC (classification) or R2 (regression)
- Multicollinearity (Studio) --> Feature Correlation (Canvas)
- Linear -> Paerson
- Numeric to categorical correlation is calculated by encoding the categorical features as the floating point numbers
- Linear categorical to categorical correlation is not supported
- Non-linear -> Spearman's rank correlation
- Numeric to categorical correlation is calculated by encoding the categorical features as the floating point numbers
- Categorical to categorical correlation is based on the normalized Cramer's V test
- Detect Anomalies in Time Series - see outliers in your time series data.
- Seasonal Trend Decomposition in Time Series - whether there's seasonality in your time series data. The seasonal component is a signal that recurs in a time period
- Bias Report (specify Label and Facet)
- Custom visualization
-
Data Splitting
- Split data into train, validation and test sets
- Split types:
- Random split – Splits data randomly into train, test, and, optionally validation datasets using the percentage specified for each dataset. If you do not need to preserve order
- Ordered split – Splits data in order, using the percentage specified for each dataset. An ordered split ensures that the data in each split is non-overlapping while preserving the order of the data
- Stratified split – Splits the dataset so that each split is similar with respect to a column specifying different categories for your data, for example, size or country. This split ensures that the train, test, and validation datasets have the same proportions for each category as the input dataset
- Split by key – Takes one or more columns as input (the key) and ensures that no combination of values across the input columns occurs in more than one of the splits (split by key). This is useful to avoid data leakage for unordered data. Choose this option if your data for key columns needs to be in the same split. For example, consider customer transactions split by customer ID; the split ensures that customer IDs don’t overlap across split datasets.
-
Transform Time Series
- Group by a Time Series
- Resample Time Series Data: establish regular intervals for the observations in your dataset with:
- Upsampling reduces the interval between observations in the dataset
- Downsampling increases the interval between observations in the dataset (use an interpolation to infer 1h observations from 2h samples)
- Handle Missing Time Series Data: Constant value, Most common, Forward fill, Backward fill, Interpolate
- Validate the Timestamp of Your Time Series Data
- Standardizing the Length of the Time Series
- Extract Features from Your Time Series Data
SageMaker Model Monitor
Types of monitoring
Data quality
- Monitor drift in data quality from a baseline based on data you provide
- Prebuilt containers compute KLL sketch (compact quantiles sketch)
- Emits metric for each feature/column in the dataset: Min, Max, Sum, SampleCount, Average, Completeness, BaselineDrift
- Violations: data_type_check, completeness_check, baseline_drift_check (distribution distance between the current and the baseline > threshold), missing_column_check, extra_column_check, categorical_values_check
Model quality
- Compares model predictions with actual Ground Truth labels you provide in a S3
- Need to periodically label data captured by endpoint or batch transform job and upload it to Amazon S3
- Monitor drift in model quality metrics, such as accuracy
- Quality metrics depend on type of ML: regression (e.g. rmse, mse,...), binary class. (e.g. recall, precision, accuracy, ...) and multi calss (e.g. accuracy, weighted recall, weighted precision, ...)
Bias Drift for Models in Production
- Bias can be introduced / exacerbated in deployed models if training data differs from the live data, this can be temporary or permanent
- Monitor bias metrics in your model's predictions continuously and alerts if the metrics exceed a threshold
- You specify allowed bias for metrics (e.g. DDPL), to ensure statistically significant data are used SageMaker Clarify use confidence intervals
- Baseline defined by data inputs, sensitive groups, captured predictions, post-training bias metrics
- Drift violations: facet, facet_value, metric_name, constraint_check_type = bias_drift_check
- Bias metric measure the level of equality in a distribution (close to 0 distribution is more balanced)
Feature Attribution Drift for Models in Production
- Drift in live data distribution can result in drift in the feature attribution just link a drift in bias metrics
- Leverage Clarify for feature attribution by comparing individual features ranking and raw value changes from training data to live data
- Clarify uses Shapley values and Shapley Additive Explanations (SHAP) to compute explanations
- Clarify computes SHAP for per-instance and aggregated instances (global), you choose aggregation method: mean_abs, median, mean_sq
- Use the Normalized Discounted Cumulative Gain (NDCG) score for comparing the feature attributions (global SHAP) rankings of training and live data
- If the NDCG < 0.90 SageMaker Clarify automatically raises an alert (this is not feature specific as NDCG combines all SHAP)
How it works
- Real-time endpoint - enable endpoint capture data from incoming requests and the resulting model predictions
- Batch transform job - enable data capture of the batch transform inputs and outputs
- Create baseline from training dataset (computes metrics and suggests constraints for the metrics)
- Compare model predictions to baseline constraints and reports violations
- Create monitoring schedule: data to collect, collect frequency, how to analyze it, and which reports to produce
- Computes model metrics and statistics on tabular data only. E.g. image classification model takes images as input and outputs a label can still be monitored. Model Monitor calculates metrics and statistics for the output, not the input
- Supports only endpoints that host a single model and does not support monitoring multi-model endpoints
- Supports monitoring inference pipelines, but it monitors the whole pipeline and not individual containers in the pipeline
- If SageMaker Studio in your VPC, you need VPC Endpoints to let Model Monitor communicate with S3 and CloudWatch
- CloudWatch Logs collects log files of monitoring the model status and alerts when thresholds are breached
- CloudWatch stores the log files to an Amazon S3 bucket you specify
- If emit_metrics option = Enabled metrics are available in CloudWatch
Basics
- Monitors the quality of Amazon SageMaker machine learning models in production
- Continuous monitoring with a real-time endpoint (or a batch transform job that runs regularly)
- On-schedule monitoring for asynchronous batch transform jobs
- Alerts via CloudWatch when there are deviations in the model quality
- Visualize data drift in SageMaker Studio and integrates with Tensorboard, QuickSight, Tableau
HyperPod
Basics
- Purpose-built infrastructure for distributed training at scale
- Reduces training time by up to 40%
- HyperPod is pre-configured with SageMaker’s distributed training libraries enabling split training across thousands of accelerators
- Ensure uninterrupted training/tuning by periodically saving checkpoints:
- In case of hw failure during training, HyperPod automatically detects the failure,
- repairs/replaces instance, and resumes the training from last checkpoint
Key Features
- Optimized distributed training libraries
- Preconfigured with SageMaker distributed libraries
- Automatically splitting your models and training datasets across AWS GPU instances
- Also use other distributed frameworks/packages: PyTorch DistributedDataParallel (DDP), torchrun, MPI (mpirun), and parameter server.
- Automatic cluster health check and repair
- Regularly runs an array of health checks for GPU and network integrity
- Debug and improve model performance
- Purpose-built HyperPod ML tools to improve training performance
- Integrates with TensorBoard to visualize model architecture to identify convergence issues: validation loss, not converging, or vanishing gradients.
- Workload scheduling and orchestration
- Based on OSS Slurm
- Install any needed frameworks or tools
- Clusters provisioned with the instance type and count you choose
Spec
- No specific instance types
- Deploy in VPC and access FSx for Lustre
- Give different IAM roles to cluster instance groups
- Use SageMaker HyperPod DLAMI (Deep Learning AMI)
- Customize to the DLAMI by providing lifecycle scripts
- Instance groups:
- Multiple instance group per cluster
- Instance group can be configured differently
- Cluster nodes:
- head or controller node
- log-in node
- worker node
- Steps:
- Set up you HyperPod cluster
- Schedule Job
- Monitor from Prometheus and Graphana
SageMaker Clarify
How it works
- Clarify processing job uses Clarify processing container to interact with S3 bucket containing input datasets and analysis configuration
- Interaction depends on the specific type of analysis:
- Clarify processing container access input dataset and analysis configuration from S3 bucket
- For certain analysis types, including feature analysis, Clarify processing container send requests to the model container
- Retrieves model predictions from model container responses
- Clarify processing container computes and saves analysis results to S3 bucket
- Clarify needs model predictions to compute post-training bias metrics and feature attributions. Provide an endpoint or Clarify creates an ephemeral shadow endpoint
Basics
- Evaluate LLMs
- Explain models with feature attributions for tabular, natural language processing (NLP), and computer vision models
- Using Shapely value
- Partial dependence plots (PDPs) to understand how much predicted target variable would change if you varied the value of one feature
- Detect bias
- Identify types of bias in pre-training data
- Identify types of bias in post-training data that can emerge during training or when your model is in production
- Areas of applicability:
- Regulatory – Policymakers and other regulators can have concerns about discriminatory impacts of decisions that use output from ML models
- Business – Regulated domains may need reliable explanations for how ML models make predictions
- Data Science – Data scientists can debug and improve ML if a model is using on noisy or irrelevant features.