Data science learning map

🚩prerequisite

🚩ETL

Database #

Data lake #

Data warehouse

non - relational

relational

relational

can be anything

objects + schema

relational data for a specific use, eg. transactional

datasets merged from multiple DBs

usually dynamic, online data

static, historical data

Calculus

Linear algebra

🚩statistic

derivative

chain rule

integration

eigen value

Probability

Distribution

Statistic test

Bayesian

Z

T

Chi-sq

F

legend explanation:


🚩 knowledge domain
✅ tool, software, framework(API)
🔒supervised (with label) model
🔓unsupervised(without label) model
❤important concept
💰business application

🚩Statistical learning

✅ SQLs

Chain rule

3 components

prior

likelihood

posterior

P(Y==1)

P(X | Y==1)

P(Y=1 | X)

Conjugate prior

Time series

Exogenous / Endogenous

Gaussian / Normal

Student T

Bernoulli

Binomial

Categorical

Beta

prior: A distribution + likelihood: B distribution = posterior: A distribution

Poisson

Chi

for categorical's count

heteroscedasticity / Homoscedasticity

to test if set of Xs is related to Y

to test difference in groups' mean

frequency data

🚩Machine Learning

🚩Deep learning

Assumption

dataset represents population

population follows some distribution

Supervised Learning

Discrete Target

❤null hypothesis

🔒Naive Bayes

Unsupervised Learning

Continuous Target

🔒KNN

🔓K-mean

Markov Chain

Monte Carlo Simulation

Merkov Chain Monte Carlo

🚩Ensemble Learning

🔒Decision Tree #

Tree-based

❤Core concepts

🔒Extreme Gradient Boosting

🔒Logistic Regression

🔒 Linear Regression/SGD

Use Gradient descend to reach best weight (beta)

Gradient Descent

Loss function / Cost

🔒 Logistic Regression & Softmax

🔒Linear Discriminant Analysis

Xs follow Gaussian, same covariance

🔒Quadratic Discriminant Analysis

Xs follow Gaussian, different covariance

🔓 DBSCAN

Xs follow Gaussian

log(odds) = WtX

🔒Linear Regression / Ordinary Least Squared Regression

🔒Survival Regression

use Bayesian Chain rule

Xs follow Gaussian

Use Gradient descend to reach best weight (beta)

🚩Reinforcement learning

🔒Support Vector Machine

Kernel approach

RGB / Polynomial / Linear

Mapping data into higher dimension

🔒Upper Confidence Bound

🔒Thompson Sampling

stochastic approach

Time Series

🔒Auto Regression & Moving Average

🔒ARIMA(X)

🔒SARIMAX

🔒STL method
(Seasonal and Trend decomposition using Loess)

probability of event not happens at time T

🚩Core concepts

🔓Principal Component Analysis

Lags of target (Y)

Seasonality

White noise

Random Walk

Difference

🔒Bagging tree / Random Forest

✅ R
✅ SAS
✅ Python:: statsmodels

✅ Python::sklearn

Sequential

Non sequential

🔒Artificial Neural Network

🔒Convolutional Neural Network

Good for data that
dimension reduction and feature extraction is super important

🔒Recurrent Neural Network

Good for data that sequence matters

time series data & text data(NLP)

image data

❤Long Short-Term Memory

❤ Convolutional layer
❤ max pooling
❤ Flattening

🚩Association Rule Learning

🔓Apriori

🚩Nature Language Processing

❤bags of words
❤stemming
❤sparse matrix
❤ tf/idf

✅ python::tensorflow::keras
✅ python::Pytorch

🔓SOM (Self-Organizing-Map)

🔓 Boltzmann Machine

🔓Auto-encoder

🔓Deep Belief Networks

assumption

distribution is not important

each model's target can be either discrete or continuous

🚩Data visualization

Box plot

discrete X, continuous Y

Histogram

quantile

continuous X, continuous Y

Y is density

Bar plot

continuous X, continuous Y

Y is usually frequency; or some groups' volume

✅ Tableau
✅ PowerBI
✅ MS Excel
✅ R::ggplot, Shiny(interactive)
✅ Python::matplotlib, seaborn, plotly

Heat map

discrete X, discrete Y; fill is continuous

filled color represents some volume, X&Y represent groups

Line chart

continuous X, continuous Y

usually used on Time series,
X represent time change, Y represents some volume

Metrics

❤Mean

Variance/ Standard deviation/ Covariance/ Correlation coefficient

❤Median

Mode

skewness

❤confidence interval

❤ P-Value

✅ Python::Numpy
✅ R::base

a subset of Machine Learning

Machine extracts/selects features by itself

🔒Panel Regression

❤ statistical interpretation

💰recommender system

💰segmentation

💰customer life time value

💰financial forecast

💰market basket analysis

💰 discriminant analysis

💰sentiment analysis

💰 A/B testing

💰 A/B testing

✅ noSQLs

🚩Preprocessing

Learning rate

Optimizer

Feature Engineering / Data Cleaning

Encoding

Categorical

Improve the quality of data, let your model learn better

Missing Value imputation

one hot encoder / dummy variable

Time Series Down/Up scaling

Smoothing

Continuous

Scaling

Normalized

Standardized

rescale by max and min to [0-1]

rescale by σ and μ to [-inf,inf], center at 0

ordinal

❤Data type

collection

list/ vector/ tuple/ matrix / array/ dataframe/ datatable

object

character/ string/ integer/ float/ Null/ Boolean/ datetime

Outlier handling

❤ Regulization

L2 method (squared)

🔒 Ridge Regression

L1 method (absolute)

🔒 Lasso Regression

reduce overfitting

discritization

down scale variance

Mean/ Median/ Mode/ Random

Monotone revalue

Frequency

Probability ratio

Weight of evidence

Rare label handling

make your API recognize the data

✅ R::Tidyverse, data.table
✅ Python::Numpy, Panda, sklearn::preprocessing, feature-engine

Correlation Matrix

X,Y are variables(column)

the intersection is a correlation coefficient

3 types

❤bagging

❤boosting

❤stacking

Resampling on data / Bootstrapping

learn from previous models' error

Use multiple weak models to train a meta model

🔒Hierarchical clustering

Scatter / Point plot

continuous X, continuous Y

represent how datapoint scatters

❤ Join

left/ right/ inner/ outer

💰 understand your customers

💰 segmentation

multiple means, multiple time

Laplace Prior

Gaussian Prior

feature selection(sparsity),reduce collinearity

Central Limit Theorem

Robustness of Model

K-folds Cross Validation

feature importance

Grid search Cross Validation

Batch

💰 stores revenue forecast

fixed VS random effect

🔒Deep Q-learning

💰self driving car

💰 recommendation system