Data science learning map
🚩prerequisite
🚩ETL
Database #
Data lake #
Data warehouse
non - relational
relational
relational
can be anything
objects + schema
relational data for a specific use, eg. transactional
datasets merged from multiple DBs
usually dynamic, online data
static, historical data
Calculus
Linear algebra
🚩statistic
derivative
chain rule
integration
eigen value
Probability
Distribution
Statistic test
Bayesian
Z
T
Chi-sq
F
legend explanation:
🚩 knowledge domain
✅ tool, software, framework(API)
🔒supervised (with label) model
🔓unsupervised(without label) model
❤important concept
💰business application
🚩Statistical learning
✅ SQLs
Chain rule
3 components
prior
likelihood
posterior
P(Y==1)
P(X | Y==1)
P(Y=1 | X)
Conjugate prior
Time series
Exogenous / Endogenous
Gaussian / Normal
Student T
Bernoulli
Binomial
Categorical
Beta
prior: A distribution + likelihood: B distribution = posterior: A distribution
Poisson
Chi
for categorical's count
heteroscedasticity / Homoscedasticity
to test if set of Xs is related to Y
to test difference in groups' mean
frequency data
🚩Machine Learning
🚩Deep learning
Assumption
dataset represents population
population follows some distribution
Supervised Learning
Discrete Target
❤null hypothesis
🔒Naive Bayes
Unsupervised Learning
Continuous Target
🔒KNN
🔓K-mean
Markov Chain
Monte Carlo Simulation
Merkov Chain Monte Carlo
🚩Ensemble Learning
🔒Decision Tree #
Tree-based
❤Core concepts
🔒Extreme Gradient Boosting
🔒Logistic Regression
🔒 Linear Regression/SGD
Use Gradient descend to reach best weight (beta)
Gradient Descent
Loss function / Cost
🔒 Logistic Regression & Softmax
🔒Linear Discriminant Analysis
Xs follow Gaussian, same covariance
🔒Quadratic Discriminant Analysis
Xs follow Gaussian, different covariance
🔓 DBSCAN
Xs follow Gaussian
log(odds) = WtX
🔒Linear Regression / Ordinary Least Squared Regression
🔒Survival Regression
use Bayesian Chain rule
Xs follow Gaussian
Use Gradient descend to reach best weight (beta)
🚩Reinforcement learning
🔒Support Vector Machine
Kernel approach
RGB / Polynomial / Linear
Mapping data into higher dimension
🔒Upper Confidence Bound
🔒Thompson Sampling
stochastic approach
Time Series
🔒Auto Regression & Moving Average
🔒ARIMA(X)
🔒SARIMAX
🔒STL method
(Seasonal and Trend decomposition using Loess)
probability of event not happens at time T
🚩Core concepts
🔓Principal Component Analysis
Lags of target (Y)
Seasonality
White noise
Random Walk
Difference
🔒Bagging tree / Random Forest
✅ R
✅ SAS
✅ Python:: statsmodels
✅ Python::sklearn
Sequential
Non sequential
🔒Artificial Neural Network
🔒Convolutional Neural Network
Good for data that
dimension reduction and feature extraction is super important
🔒Recurrent Neural Network
Good for data that sequence matters
time series data & text data(NLP)
image data
❤Long Short-Term Memory
❤ Convolutional layer
❤ max pooling
❤ Flattening
🚩Association Rule Learning
🔓Apriori
🚩Nature Language Processing
❤bags of words
❤stemming
❤sparse matrix
❤ tf/idf
✅ python::tensorflow::keras
✅ python::Pytorch
🔓SOM (Self-Organizing-Map)
🔓 Boltzmann Machine
🔓Auto-encoder
🔓Deep Belief Networks
assumption
distribution is not important
each model's target can be either discrete or continuous
🚩Data visualization
Box plot
discrete X, continuous Y
Histogram
quantile
continuous X, continuous Y
Y is density
Bar plot
continuous X, continuous Y
Y is usually frequency; or some groups' volume
✅ Tableau
✅ PowerBI
✅ MS Excel
✅ R::ggplot, Shiny(interactive)
✅ Python::matplotlib, seaborn, plotly
Heat map
discrete X, discrete Y; fill is continuous
filled color represents some volume, X&Y represent groups
Line chart
continuous X, continuous Y
usually used on Time series,
X represent time change, Y represents some volume
Metrics
❤Mean
Variance/ Standard deviation/ Covariance/ Correlation coefficient
❤Median
Mode
skewness
❤confidence interval
❤ P-Value
✅ Python::Numpy
✅ R::base
a subset of Machine Learning
Machine extracts/selects features by itself
🔒Panel Regression
❤ statistical interpretation
💰recommender system
💰segmentation
💰customer life time value
💰financial forecast
💰market basket analysis
💰 discriminant analysis
💰sentiment analysis
💰 A/B testing
💰 A/B testing
✅ noSQLs
🚩Preprocessing
Learning rate
Optimizer
Feature Engineering / Data Cleaning
Encoding
Categorical
Improve the quality of data, let your model learn better
Missing Value imputation
one hot encoder / dummy variable
Time Series Down/Up scaling
Smoothing
Continuous
Scaling
Normalized
Standardized
rescale by max and min to [0-1]
rescale by σ and μ to [-inf,inf], center at 0
ordinal
❤Data type
collection
list/ vector/ tuple/ matrix / array/ dataframe/ datatable
object
character/ string/ integer/ float/ Null/ Boolean/ datetime
Outlier handling
❤ Regulization
L2 method (squared)
🔒 Ridge Regression
L1 method (absolute)
🔒 Lasso Regression
reduce overfitting
discritization
down scale variance
Mean/ Median/ Mode/ Random
Monotone revalue
Frequency
Probability ratio
Weight of evidence
Rare label handling
make your API recognize the data
✅ R::Tidyverse, data.table
✅ Python::Numpy, Panda, sklearn::preprocessing, feature-engine
Correlation Matrix
X,Y are variables(column)
the intersection is a correlation coefficient
3 types
❤bagging
❤boosting
❤stacking
Resampling on data / Bootstrapping
learn from previous models' error
Use multiple weak models to train a meta model
🔒Hierarchical clustering
Scatter / Point plot
continuous X, continuous Y
represent how datapoint scatters
❤ Join
left/ right/ inner/ outer
💰 understand your customers
💰 segmentation
multiple means, multiple time
Laplace Prior
Gaussian Prior
feature selection(sparsity),reduce collinearity
Central Limit Theorem
Robustness of Model
K-folds Cross Validation
feature importance
Grid search Cross Validation
Batch
💰 stores revenue forecast
fixed VS random effect
🔒Deep Q-learning
💰self driving car
💰 recommendation system