Chapter 13-15 (Chapter 13. startup processes (easiest way to bring a…
Chapter 13. startup processes
easiest way to bring a dataset into DataRobot is to read in a local file
You should definitely stick to a smaller dataset that's downsampled while learning
DR acceps CSV's
In CSV's you have to be cautious about comas within the data as those will throw off the data
B/c of this, the cautious move is to use a TSV where tabs, instead of comas are used to seperate columns
Excel files are the safeest bet if it's small enough data-wise to be read into excel
You can also use a URL to link web pages. ODBC can be done if it's stored in a database
Project on hospital readmission. start by selecting the local file
DR rapidly uploads the CSV file to the cloud platform, reads it and prepares data analysis results. name the project by clicking untitled project in the corner
after creating multiple projects, create tags to keep track of them. this will bring up a new menu where new projects can be created and managed. tags are given a color and are applied and solidified by pressing apply. then exit the project by clicking data. there's also a share button so you can collaborate
chapter 14 feature understanding and selection
now that the project and data are set up in DR, the next step is to interpret the data contained in each feature. going with this, dr is looking for a target to predict
14.1 Descriptive Statistics
Unique is listed after Var Type and notes how many unique values exist for each specific feature
DR will expand feature details if the feature name is clicked
In unique, median, mean, min, max, std dev and other descriptive stats are listed for the given data
In bin visualization within data visualization, a bracket denotes an inclusive range, whereas a parenthesis denotes an exclusive range
14.3 Evaluations of feature content and 14.4 Missing values
Unique helps us find duplicates and weed them out of the data
When the target feature is selected, it will be tagged as target. also, though, the code, "too many values can also be encountered"
The "missing" column can tell how many values are missing from a specific dataset. filtering out this data will pay dividends
Algos that struggle w missing values include regression, neural networks, and support vector machines. it finds nulls and ? marks. this is especially relevant when joins are done
Sometimes if there's missing data and algos can't handle missing values, DR will impute values before running.
14.2 data types
in categorical there's binary and multi-class categorical
Indicates the nature of the data inside of a feature
In categorical, there's a lot less usefuleness than numeric bar charts
DR can detect and tag currencies based on existence of currency symbols. it can also covert measurement to desired formats
Chapter 15. Build Candidate Models
Now, the data is ready to create a deep-learning neural network model
Most models built in this chapter will serve as candidate models. meaning they will improve our understanding of what data works well when constructing models. we have to know the best algos to work with
Now, we select the target feature. the desired target feature can be found directly in the feature list, find tit then click use as target
The target feature's distribution is then displayed. if a distribution is not well distributed, DR downsamples the majority class (randomly removes cases).
DR lets you choose a metric to optimize the produced models for
Logloss means the model is evaluated based on probabilities generated by the model and their distance from the correct answer. basically means weighted rather than just looking at true false predictions. so also confidence
15.3- Stating the analytical process
Prepares data through the options, autopilot, quick, and manual.
Quick run is an abbreviated version of autopilot that produces almost as good models by shortcutting the DR
Quick and auto are identical cept Quick starts analysis at 16 % of the sample and then uses that info to determine which models to run at 32 %. Quick starts at 32 % with historically well performing models
15.4 Model Selection process
16 % samples and 32 % samples are used to train the model, while the full set is used to evaluate the models.
On DR, there are symbols from other R-based models. Such as Vowpal, TensorFlow, XGBoost and blender models.
From 32 % predictions, the model moves to 64 %
Cross validation is now ready if the validation data set is samll, meaning less than 10 k cases
Then a blender is used for the average probability score.of each model's prediction
Autopliot leads to best results. Step 1 "setting target feature" transfers the user's target feature decisions into the analysis system. Step 2, "creating CV and Holdout partitions" uses the decisions to randomly assign cases to the holdout and cross validation folds. Step 3 Characterizing target variable" is where DR will save the dist of target to the analysis system for later use in decisions about which models to run
Step 4: loading dataset and preparing data is relevant if its large data. Step 5, Saving the target and partition info is where the actual partitions are stored. Step 6 importance scores are calculated. Step 7 calculating list of models is where info from steps 3-6 is used to determine which blueprints to run in the autopilot process. after these steps, an inmportance column is added