Data Analysis
Variables
Quantitative
(numbers)
Categorical
(groups)
Discrete
(counts)
Continuous
(measurements)
Ordinal
(ordered)
Nominal
(unordered)
Binary
(2 categories)
Reference: Introduction to Variable Types | Codeacademy
Analysis Path
1) Descriptive Analysis
(summary statistics)
2) Exploratory Analysis
(correlation)
3a) Inferential Analysis
(hypothesis testing)
4) Causal Analysis
(causation)
3b) Predictive Analysis
(modelling)
Reference: Analyzing Data | Codeacademy
Side note:
Be mindful of biases
Data collection:
- Selection bias
- Historical bias
Building algorithms:
- Algorithmic bias
- Evaluation bias
Interpretation / Drawing conclusions:
- Confirmation bias
- Over-generalization
- Reporting bias
Exploratory Data Analysis (EDA)
Data inspection techniques:
- .head()—first five rows
- .describe()—numerical summaries
- .info()
- .isnull()—locate missing (null) values
- .unique()—return unique values
Summary statistics
Univariate statistics
Quantitative variables
Categorical variables
Central location, i.e.:
- Mean
- Trimmed mean
- Median
- Mode
Spread, i.e.:
- Range
- Inter-quartile range (IQR)
- Variance
- Standard deviation
- Mean absolute deviation (MAD)
- Frequency—.value_counts()
- Proportion—.value_counts(normalize=True)
Bivariate statistics
Distribution:
- Right-skewed—long tail on the right
- Left-skewed—long tail on the left
We can use .skew() in Python to confirm the skewness. If skewness is less than -1 or greater than 1, the distribution is highly skewed.
2 quantitative variables
1 quantitative variable & 1 categorical variable
2 categorical variables
Use mean / median difference
Use Pearson correlation
Use a contingency table & the Chi-square statistic
Probability distributions
Reference: Statistics Fundamentals for Data Science
(see reference to learn how to calculate the expected value, variance and more for binomial / poisson distribution)
Random variables
Discrete random variables
Continuous random variables
Use .random.choice() from the numpy library to simulate random variables
Probability mass functions: For determining the probability of a specific value
Use .binom.pmf() method from the scipy.stats library
Cumulative distribution functions: For determining the probability of a specific value or less
Use .binom.cdf() method from the scipy.stats library
Probability density functions:
Use .norm.cdf() method from the scipy.stats library
Poisson distribution: Used to describe the number of times a certain event occurs within a fixed time or space interval
Use the .poisson.pmf() method in the scipy.stats library to evaluate the probability of observing a specific number
Use the .poisson.cdf() method in the scipy.stats library to evaluate the probability of observing a specific number or less
Reference: Statistics and Variables | Codeacademy (click the link to learn more about the relevant Python methods)
Data Transformation (Advanced EDA)
Data centering: Tells us how far above or below the mean each data point
Data scaling: Ensures every datapoint to have the same scale so each feature contributes equally to the relationship
Min-max normalization:
Use MinMaxScaler from the sklearn.preprocessing package
Standardization:
Use StandardScaler from the sklearn.preprocessing package
Binning data:
Use the .cut() method from pandas to create bins
💡
Log transformation:
- Use .log() from NumPy
- Use PowerTransformer from sklearn.preprocessing
Hypothesis testing