Data Analysis

Variables

Quantitative
(numbers)

Categorical
(groups)

Discrete
(counts)

Continuous
(measurements)

Ordinal
(ordered)

Nominal
(unordered)

Binary
(2 categories)

Reference: Introduction to Variable Types | Codeacademy

Analysis Path

1) Descriptive Analysis
(summary statistics)

2) Exploratory Analysis
(correlation)

3a) Inferential Analysis
(hypothesis testing)

4) Causal Analysis
(causation)

3b) Predictive Analysis
(modelling)

Reference: Analyzing Data | Codeacademy

Side note:
Be mindful of biases

Data collection:

Selection bias
Historical bias

Building algorithms:

Algorithmic bias
Evaluation bias

Interpretation / Drawing conclusions:

Confirmation bias
Over-generalization
Reporting bias

Exploratory Data Analysis (EDA)

Data inspection techniques:

.head()—first five rows
.describe()—numerical summaries
.info()
.isnull()—locate missing (null) values
.unique()—return unique values

Reference: Exploratory Data Analysis (EDA) | Codeacademy

Summary statistics

Univariate statistics

Quantitative variables

Categorical variables

Central location, i.e.:

Mean
Trimmed mean
Median
Mode

Spread, i.e.:

Range
Inter-quartile range (IQR)
Variance
Standard deviation
Mean absolute deviation (MAD)

Frequency—.value_counts()
Proportion—.value_counts(normalize=True)

Bivariate statistics

Distribution:

Right-skewed—long tail on the right
Left-skewed—long tail on the left

We can use .skew() in Python to confirm the skewness. If skewness is less than -1 or greater than 1, the distribution is highly skewed.

2 quantitative variables

1 quantitative variable & 1 categorical variable

2 categorical variables

Use mean / median difference

Use Pearson correlation

Use a contingency table & the Chi-square statistic

Probability distributions

References:

Reference: Statistics Fundamentals for Data Science

(see reference to learn how to calculate the expected value, variance and more for binomial / poisson distribution)

Random variables

Discrete random variables

Continuous random variables

Use .random.choice() from the numpy library to simulate random variables

Probability mass functions: For determining the probability of a specific value

Use .binom.pmf() method from the scipy.stats library

Cumulative distribution functions: For determining the probability of a specific value or less

Use .binom.cdf() method from the scipy.stats library

Probability density functions:

Use .norm.cdf() method from the scipy.stats library

Poisson distribution: Used to describe the number of times a certain event occurs within a fixed time or space interval

Use the .poisson.pmf() method in the scipy.stats library to evaluate the probability of observing a specific number

Use the .poisson.cdf() method in the scipy.stats library to evaluate the probability of observing a specific number or less

Reference: Statistics and Variables | Codeacademy (click the link to learn more about the relevant Python methods)

Data Transformation (Advanced EDA)

Data centering: Tells us how far above or below the mean each data point

Data scaling: Ensures every datapoint to have the same scale so each feature contributes equally to the relationship

Min-max normalization:

Use MinMaxScaler from the sklearn.preprocessing package

Standardization:

Use StandardScaler from the sklearn.preprocessing package

Reference: Advanced EDA: Data Transformation | Codeacademy

Binning data:

Use the .cut() method from pandas to create bins

💡

Log transformation:

Use .log() from NumPy
Use PowerTransformer from sklearn.preprocessing

Hypothesis testing