Please enable JavaScript.
Coggle requires JavaScript to display documents.
Google Data Science Interview (Machine Learning (Regression (Linear…
Google Data Science Interview
A/B testing
A/B testing intro
When to use A/B
Statistical power and sample size tradeoff
alpha low, beta high, sensitivity
Experiment & Design
Unit of diversion: user ID, cookies, Event
How to collect data?
Duration vs Exposure
Choosing and Characterizing Metrics
Correlation and causation
retrospective analysis
sensitive and robust
Boostrap
Analyzing result
Sanity check: Invariant
Simpson’s paradox
Multiple comparisons
Computation
Simulation
Data Preprocessing
Numpy
Random
Panda
Dataframe
groupby
Merge: concat, join, append
scipy.stats
Lambda
Stats knowledge + Data Intuition
Probability
Expected Value
Exponential mean
Binomial mean
Probability density function (PDF)
Cumulative distribution function (CDF)
Distribution
Uniform
Binomial
Poisson
Exponential
Normal
Mean, variance, moments and median
Discrete distributions
Resampling
Bootstrapping
Jackknife
Statistical inference
Point/interval estimation
hypothesis testing
Z test
t test
Chi test
F test
Multiple testing
Bonferroni
Type I and Type II error
Proportion
Power calculation
P value
Maximum likehood
Margin of error
Sampling
Random sampling
Sampling bias
Bayesian
Key concepts
Sample & Population
Cross table & Scatter plot, histogram
Covariance & Correlation
Mean, median, mode
Skewness, std, variance
Standard Error
Central limit theorem
Machine Learning
Regression
Linear
Assumptions
Multiple
RMSE and MSE
R square and Adjust R square
Measure of significance
Explanatory power
Logistic
Regularized regression - Lasso/Ridge
Avoid overfitting
Feature selection
ANOVA
forward/backward
Dummy variable
Dimension reduction, PCA
Random forest (Bagging)
Decision Tree
Entropy and Information gain
Kernel density
Cross-validation
Time series data
Kfold
Overfitting
Classification
Confusion Matrix
F1 Score
Accuracy, Precision, Recall
ROC
SVM
Kernel function
Naive Bayes
Maximum likelihood
Feature Selection
Ensemble method
Bagging
Boosting
Gradient boosting
XGboost
Stacking
Clustering
Time series prediction
SQL
Communication
STAR
Resume
Feature Engineering
Imputation
Missing variables
drop by isnull().mean()
Numerical Imputation
medians of the columns by fillna
Categorical Imputation
maximum occurred value or other
What about mean of categorical data?
Handling outliers
Visualization
Outlier Detection with Standard Deviation
Outlier Detection with Percentiles
An Outlier Dilemma: Drop or Cap
Log transform
log(x+1)
One-hot encoding
get_dummies
Grouping Operations
Categorical Column Grouping
Frequency
pivot table
Numerical Column Grouping
Average
SUM
Feature Split
Scaling
Normalization
Standardization
Extracting Date
Binning
Product Sense
Metrics