Data Analysis

Variables

Quantitative
(numbers)

Categorical
(groups)

Discrete
(counts)

Continuous
(measurements)

Ordinal
(ordered)

Nominal
(unordered)

Binary
(2 categories)

Analysis Path

1) Descriptive Analysis
(summary statistics)

2) Exploratory Analysis
(correlation)

3a) Inferential Analysis
(hypothesis testing)

4) Causal Analysis
(causation)

3b) Predictive Analysis
(modelling)

Side note:
Be mindful of biases

Data collection:


  • Selection bias
  • Historical bias

Building algorithms:


  • Algorithmic bias
  • Evaluation bias

Interpretation / Drawing conclusions:


  • Confirmation bias
  • Over-generalization
  • Reporting bias

Exploratory Data Analysis (EDA)

Data inspection techniques:


  • .head()—first five rows
  • .describe()—numerical summaries
  • .info()
  • .isnull()—locate missing (null) values
  • .unique()—return unique values

Summary statistics

Univariate statistics

Quantitative variables

Categorical variables

Central location, i.e.:


  • Mean
  • Trimmed mean
  • Median
  • Mode

Spread, i.e.:


  • Range
  • Inter-quartile range (IQR)
  • Variance
  • Standard deviation
  • Mean absolute deviation (MAD)
  • Frequency—.value_counts()
  • Proportion—.value_counts(normalize=True)

Bivariate statistics

Distribution:

  • Right-skewed—long tail on the right
  • Left-skewed—long tail on the left

We can use .skew() in Python to confirm the skewness. If skewness is less than -1 or greater than 1, the distribution is highly skewed.

2 quantitative variables

1 quantitative variable & 1 categorical variable

2 categorical variables

Use mean / median difference

Use Pearson correlation

Use a contingency table & the Chi-square statistic

Probability distributions

Reference: Statistics Fundamentals for Data Science


(see reference to learn how to calculate the expected value, variance and more for binomial / poisson distribution)

Random variables

Discrete random variables

Continuous random variables

Use .random.choice() from the numpy library to simulate random variables

Probability mass functions: For determining the probability of a specific value


Use .binom.pmf() method from the scipy.stats library

Cumulative distribution functions: For determining the probability of a specific value or less


Use .binom.cdf() method from the scipy.stats library

Probability density functions:


Use .norm.cdf() method from the scipy.stats library

Poisson distribution: Used to describe the number of times a certain event occurs within a fixed time or space interval

Use the .poisson.pmf() method in the scipy.stats library to evaluate the probability of observing a specific number

Use the .poisson.cdf() method in the scipy.stats library to evaluate the probability of observing a specific number or less

Reference: Statistics and Variables | Codeacademy (click the link to learn more about the relevant Python methods)

Data Transformation (Advanced EDA)

Data centering: Tells us how far above or below the mean each data point

Data scaling: Ensures every datapoint to have the same scale so each feature contributes equally to the relationship

Min-max normalization:


Use MinMaxScaler from the sklearn.preprocessing package

Standardization:


Use StandardScaler from the sklearn.preprocessing package

Binning data:


Use the .cut() method from pandas to create bins

💡

Log transformation:

  • Use .log() from NumPy
  • Use PowerTransformer from sklearn.preprocessing

Hypothesis testing