Please enable JavaScript.
Coggle requires JavaScript to display documents.
4 - Modeling (SageMaker Training (Hardware (Performance: ASIC > GPU >…
4 - Modeling
SageMaker Training
Hardware
-
-
-
-
ASIC - specialized hardware with model "burned in". Use it for inference when the same model is expected to be used for months or Years
FPGA - specialized hardware with model programmed into the chip. Use it for inference when the same model will be used for hours or days
-
-
-
You can specify your own Dockerfile and upload it to your own repository. You can then issue CreateTrainingJob and reference your custom image. Use nvidia-docker image when your algo is expected to run on a GPU
-
When using Spark SageMaker Library, Spark dataframe format is converted to protobuf and loaded to S3
SageMaker Modeling
4 basic areas in console
Ground Truth - set up and manage labeling jobs for training datasets using active learning and human labeling
-
-
-
-
Input Data
-
-
Most algos accept CSV. The target value should be in the first column with no header. Be sure to set metadata Content-Type to "text/csv" in S3
When we use an unsupervised algo, we need to specify absent of a label. Metadata of the file in S3 should include "label_size=0" (in other words, the absence of the labeled values)
For optimal performance, use optimized protobuf recordIO. Using this format, we can take advantage of Pipe mode, which requires less EBS space and faster-startup
CreateTrainingJob API
-
-
Steps in both cases
-
- Specify algo-specific hyperparameters
- Specify the input and output configuration
-
Choose the right model
E.g. "Detect whether a financial transaction is fraud". You can use Binary Classification here since there are only 2 possible values: Fraud or Not Fraud
E.g. "Predict the rate of deceleration of a car when brakes are applied". We can use Heuristic Approach here, no ML is needed since well-known formulas involving speed, inertia, friction can be used to predict this
e.g. "Determine the most efficient path of surface travel for a robotic lunar rover".. Can use Simulation-based Reinforcement Learning because we must determine the most optimal path via trial, error and improvement
e.g. "Determine the breed of dog in a photograph". Can use Multi-class classification since we need to determine which breed the dog on the photo is most associated among a number of breeds.
Data Preparation
Steps
Randomize
Stratified Sampling - applies random sampling to each subgroup separately. It ensure that rare populations are not underrepresented in the training dataset
-
-
-
-
Fold
-
each round, you split the data using a different cross-section of a total dataset to ensure your model gets a variety of input
Error Rates - if error rates of Round1-4 are roughly the same, then we are pretty confident that our data was well randomized. If, on the other hand, one of the rounds has a significantly higher error rate, it means that we didn't have a well randomized dataset to begin with.
Challenges with Sampling
Seasonality - time of the day, time of the year, holidays, etc. Stratified sampling would help here to minimize bias
Trends - patterns can shift over time and new patterns can emerge. To detect, compare models trained over different time periods
-
-
Cascading Algos
-
E.g. to solve "What is the estimated basket size of shoppers who respond to our email promotion?" can be resolved by:
Remove Outliers (Random Cut Forest) -> Identify Relevant Attributes (PCA) -> cluster into groups (K-Means) -> Predict Basket size (Linear Learner)