Data Exploration (ref. lectures of Marco Brambilla 2018) (steps (5-…
Data Exploration (ref. lectures of Marco Brambilla 2018)
find out the
at a pre-defined significance level
In correlation, the
draw a line through the data
such that the
from the line (xn) are
Because deviations can be negative or positive, each is first squared, then the squared deviations are added together, and the square root taken
linear correlation +1:perfect
linear correlation 0:
Same variable plot Against different «Facets» (categories or groups)
Also with Trend lines
And confidence (with smoothing)
bi-variate analysis can be perform for any
Categorical & Continuous
for each level of categorical variables.
If levels are small in number, it will not show the statistical significance.
To look at the statistical significance we can perform Z-test, T-test or ANOVA.
Continuous & Continuous
no correlation, strong (positive/negative), moderate (positive/negative), curvilinear relationship.
SAMPLES AND POPULATIONS
How to compare two samples?
In essence, the t-test gives a measure of the difference between the
in relation to the
t-test use t-test (only for 2 samples)
COMPARISONS BETWEEN THREE OR MORE SAMPLES
• F-test and ANOVA
ANOVA and Bartlett’s Test
2 more items...
3 more items...
Essentially, ANOVA involves dividing the variance in the results into: Between groups variance
Within groups variance
When to use t-distribution?
when the sample size is small and/or
when the population variance is unknown
Degrees of freedom = n-1 With n=size of the sample
null hypothesis (to be tested)
COMPARING TWO SAMPLES
Type I error
The a level represents the probability of finding a significant difference between the two means when none exists
Type II error
The b level represents the probability of not finding a significant difference between the two means when one exists
Categorical & Categorical
Two-way table: count and count%.
Stacked Column Chart: visual form
derive the statistical significance of relationship between the variables.
A formal statistical test to determine
whether results are statistically significant
Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well.
based on the difference between the expected and observed frequencies in one or more categories in the two-way table.
It returns probability for the computed chi- square distribution with the degree of
THERE IS A FORMULA TO COUNT IT
the chi-square value, the greater the likelihood there is a
difference between the
To know for sure, you need to look up the p-value in a chi-square table
Analysis of Categorical Data
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence.
4- Missing values treatment
Missing data in the training data set:
can reduce the power of a model
can lead to a biased model
THIS is because we have not analysed the behavior and relationship with other variables
It can lead to wrong prediction or classification.
Missing Values Treatment
Mean - Mode – Median imputation
Analyse each variable separately
measure of dispersion
Skewness and Kurtosis
outliers & missing values
Frequencies of categories
5- Outlier treatment
Outlier = an observation that appears far away and diverges from an overall pattern in a sample.
Beware of univariate vs. multivariate
Box-plot, Histogram, Scatter Plot
beyond the range of -1.5 x IQR to 1.5 x IQR
out of range of 5th and 95th percentile
three or more standard deviation away from mean
multivariate outliers are measured using index of influence or leverage, or distance.
Mahalanobis’ distance and Cook’s D
Deleting observations: if due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends
Transforming and binning values: log functions or discretization
Imputation & Separate treatment
e.g. character, numeric
(output) variables/ the thing that you want to measure
(Input) e.g. gender, height, weight
category of the
: gender or
6- Feature Engineering
Change of Scale, Linearization, Normalization, Binning
Variable / Feature creation
Crucial for quality of machine learning
• Derived variables
• Dummy variables (binarization)