Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Exploration (ref. lectures of Marco Brambilla 2018) (steps (5-…
Data Exploration (ref. lectures of Marco Brambilla 2018)
steps
3-
Bi-variateAnalysis
find out the
relationship
between two
variables
look for
association
and
disassociation
between
variables
at a pre-defined significance level
In correlation, the
aim
is to
draw a line through the data
such that the
deviations
of the
points
from the line (xn) are
minimised
Because deviations can be negative or positive, each is first squared, then the squared deviations are added together, and the square root taken
-1: perfect
negative
linear correlation +1:perfect
positive
linear correlation 0:
No
correlation
can be
visualise
by:
Faceting
Same variable plot Against different «Facets» (categories or groups)
Also with Trend lines
And confidence (with smoothing)
Matrix Plot
bi-variate analysis can be perform for any
combination
of
categorical
and
continuous
variables.
Categorical & Continuous
Draw
box plots
for each level of categorical variables.
If levels are small in number, it will not show the statistical significance.
To look at the statistical significance we can perform Z-test, T-test or ANOVA.
Continuous & Continuous
no correlation, strong (positive/negative), moderate (positive/negative), curvilinear relationship.
SAMPLES AND POPULATIONS
Comparative tests
How to compare two samples?
In essence, the t-test gives a measure of the difference between the
sample means
in relation to the
overall spread
t-test use t-test (only for 2 samples)
COMPARISONS BETWEEN THREE OR MORE SAMPLES
• F-test and ANOVA
ANOVA and Bartlett’s Test
2 more items...
Kruskal-Wallis Test
3 more items...
Essentially, ANOVA involves dividing the variance in the results into: Between groups variance
Within groups variance
IN SUMMARY
When to use t-distribution?
when the sample size is small and/or
when the population variance is unknown
Main parameter:
Degrees of freedom = n-1 With n=size of the sample
null hypothesis (to be tested)
COMPARING TWO SAMPLES
Type I error
The a level represents the probability of finding a significant difference between the two means when none exists
Type II error
The b level represents the probability of not finding a significant difference between the two means when one exists
Categorical & Categorical
Two-way table: count and count%.
Stacked Column Chart: visual form
Chi-square
derive the statistical significance of relationship between the variables.
A formal statistical test to determine
whether results are statistically significant
Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well.
based on the difference between the expected and observed frequencies in one or more categories in the two-way table.
It returns probability for the computed chi- square distribution with the degree of
freedom.
THERE IS A FORMULA TO COUNT IT
the
higher
the chi-square value, the greater the likelihood there is a
statistically significant
difference between the
two groups
To know for sure, you need to look up the p-value in a chi-square table
Types of
Chi-square
Statistical Tests
Analysis of Categorical Data
Measure of
association
(risk ratio
or
odds ratio)
Confidence interval
Confidence
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence.
4- Missing values treatment
Missing data in the training data set:
can reduce the power of a model
can lead to a biased model
THIS is because we have not analysed the behavior and relationship with other variables
correctly
.
It can lead to wrong prediction or classification.
Missing Values Treatment
Deletion
Mean - Mode – Median imputation
Prediction
Clustering-based
2-
Univariate Analysis
Analyse each variable separately
Continuous variables:
measure of dispersion
quartile
IQR
range
Variance
standard deviation
Skewness and Kurtosis
visualisation methods
Histogram
Box Plot
central tendency
min
max
median
mode
mean
outliers & missing values
Categorical variables
Frequencies of categories
Barcharts
5- Outlier treatment
Outlier = an observation that appears far away and diverges from an overall pattern in a sample.
Beware of univariate vs. multivariate
Outlier Detection
Box-plot, Histogram, Scatter Plot
beyond the range of -1.5 x IQR to 1.5 x IQR
out of range of 5th and 95th percentile
three or more standard deviation away from mean
multivariate outliers are measured using index of influence or leverage, or distance.
Mahalanobis’ distance and Cook’s D
Outliers removal
Deleting observations: if due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends
Transforming and binning values: log functions or discretization
Imputation & Separate treatment
1-
Variable Identification
identify the
data type
e.g. character, numeric
identify
Target
(output) variables/ the thing that you want to measure
identify
Predictor
(Input) e.g. gender, height, weight
category of the
variables
i.e.
categorical
: gender or
continuous:
height, weight
6- Feature Engineering
Variable transformation
Change of Scale, Linearization, Normalization, Binning
Variable / Feature creation
Crucial for quality of machine learning
• Derived variables
• Dummy variables (binarization)