Please enable JavaScript.
Coggle requires JavaScript to display documents.
Plural Sight: Understanding Machine Learning in R- Machine Learning…
Plural Sight: Understanding Machine Learning in R- Machine Learning Workflow
Machine Learning Workflow
An orchestrated and repeatable pattern which systematically transforms and processes information to create prediction solutions
4) select a subset of the data to train your model
Training Overview
1) Split Data
Training data ~ 70%
testing data ~ 30%
Data Splitting Requirements
divide into two data frames by user specified %
make sure that the proportion of the rare event that occurs in the entire population of data is preserved once the data is split into training and testing data.
can ensure this by using Caret's data partition function
createDataPartition
(data.frame$column.to.be.predicted, p = 0.70, list = FALSE)
the first argument for this command is the column that contains the value I want to be able to predict with the model. and it's the column that the function will ensure has the same proportion of the rare event between training and testing data frames
the 2nd argument specifies how much of the data needs to go into the training data. usually this is 70% of the data
the 3rd argument
the function returns a vector of row indices that are to be partitioned
2) Train the model using selected algorithm
3) trained model produced using the trained data
Selecting the Features/Columns to be used for the training data
train with only the minimum number of features (columns) needed
decide which columns will be relevant to finding your predicted value. what factors might come into play?
Caret Package
stands for "
C
lassification
A
nd
Re
gression
T
raining
has a common interface across algorithms
a
toolset
for training and evaluation tasks
data-splitting
pre-processing
feature selection
model tuning
1) Asking the right question
2) Preparing data
check to see if any columns in the data frame are correlated which means they change together. are the giving the same info? then one of them needs to be eliminated so that there aren't columns with double the info needed which could cause the downstream modelling to be inaccurate; giving something twice the importance it should
3) select the proper algorithm
train()
this function accessing the training algorithm. training data is passed through the function call.
the data contains
"examples"
of the value i'm trying to predict, and
"features"
that enable predicting the value
the algorithm executes it's logic and produces a
"trained" model
this model is another program?
predict()
the training algorithm produces a prediction model
to call this function, real data is passed
which factors to use when selecting the correct algorithm to use?
Decision Factors
2)
Result
1)
Regression
Continuous values
2)
Classification
Discrete values
3)
Complexity
ensemble?
container algorithms that have multiple child algorithms
can be difficult to debug
often used to fine tune the model to increase performance
1) Learning Type
1)
Supervised
2)
Unsupervised
4) Basic or Enhanced
Basic
easier to understand
Candidate Algorithms
Naive Bayes
Based on Bayes Theorem
assumes that all
features are independent
of each other and Every feature has the
same weight
based on likelihood and probability
Requires a smaller amount of data
Logistic Regression
returns a binary result
relationship between features are weighted
works fast
measures the relationship of each feature and weights them based on their impact on the result
stable to data changes
Decision Tree
based on a binary tree
each node in the tree contains a decision
requires enough data to determine nodes and splits
simpler
Enhanced
Variation of basic
used for performance improvements
additional functionality
more complex
5) test the model
1)
Evaluate the model against test data
prediction.object <-
predict
( R.object.storing.trained.model, test.data.object)
2)
interpret the results
confusionMatrix
(prediction.object, test.data.object[ , "column.name.of.what.to.predict"]
improve results