Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Analysis, Reference: Introduction to Variable Types | Codeacademy,…
Data Analysis
Exploratory Data Analysis (EDA)
Data inspection techniques:
.head()
—first five rows
.describe()
—numerical summaries
.info()
.isnull()
—locate missing (null) values
.unique()
—return unique values
Summary statistics
Univariate statistics
Quantitative variables
Central location, i.e.:
Mean
Trimmed mean
Median
Mode
Distribution:
Right-skewed—long tail on the right
Left-skewed—long tail on the left
We can use
.skew()
in Python to confirm the skewness. If skewness is less than -1 or greater than 1, the distribution is highly skewed.
💡
Log transformation:
Use
.log()
from NumPy
Use
PowerTransformer
from sklearn.preprocessing
Spread, i.e.:
Range
Inter-quartile range (IQR)
Variance
Standard deviation
Mean absolute deviation (MAD)
Categorical variables
Frequency—
.value_counts()
Proportion—
.value_counts(normalize=True)
Bivariate statistics
2 quantitative variables
Use Pearson correlation
1 quantitative variable & 1 categorical variable
Use mean / median difference
2 categorical variables
Use a contingency table & the Chi-square statistic
Probability distributions
Random variables
Discrete random variables
Use
.random.choice()
from the numpy library to simulate random variables
Probability mass functions: For determining the probability of a specific value
Use
.binom.
pmf
()
method from the scipy.stats library
Cumulative distribution functions: For determining the probability of a specific value or less
Use
.binom.
cdf
()
method from the scipy.stats library
Continuous random variables
Probability density functions:
Use
.norm.cdf()
method from the scipy.stats library
Poisson distribution: Used to describe the number of times a certain event occurs within a fixed time or space interval
Use the
.poisson.
pmf
()
method in the scipy.stats library to evaluate the probability of observing a specific number
Use the
.poisson.
cdf
()
method in the scipy.stats library to evaluate the probability of observing a specific number or less
Data Transformation (Advanced EDA)
Data centering: Tells us how far above or below the mean each data point
Data scaling: Ensures every datapoint to have the same scale so each feature contributes equally to the relationship
Min-max normalization:
Use
MinMaxScaler
from the sklearn.preprocessing package
Standardization:
Use
StandardScaler
from the sklearn.preprocessing package
Binning data:
Use the
.cut()
method from pandas to create bins
Variables
Quantitative
(numbers)
Discrete
(counts)
Continuous
(measurements)
Categorical
(groups)
Ordinal
(ordered)
Nominal
(unordered)
Binary
(2 categories)
Analysis Path
1) Descriptive Analysis
(summary statistics)
2) Exploratory Analysis
(correlation)
3a) Inferential Analysis
(hypothesis testing)
4) Causal Analysis
(causation)
3b) Predictive Analysis
(modelling)
Side note:
Be mindful of biases
Data collection:
Selection bias
Historical bias
Building algorithms:
Algorithmic bias
Evaluation bias
Interpretation / Drawing conclusions:
Confirmation bias
Over-generalization
Reporting bias
Hypothesis testing
Hypothesis testing | Codeacademy
One-sample t-test in SciPy | Codeacademy
Significance thresholds | Codeacademy
Reference:
Introduction to Variable Types | Codeacademy
Reference:
Analyzing Data | Codeacademy
Reference:
Exploratory Data Analysis (EDA) | Codeacademy
References:
EDA: Summary Statistics
Statistical Thinking
Reference:
Statistics Fundamentals for Data Science
(see reference to learn how to calculate the expected value, variance and more for binomial / poisson distribution)
Reference:
Statistics and Variables | Codeacademy
(click the link to learn more about the relevant Python methods)
Reference:
Advanced EDA: Data Transformation | Codeacademy