EDA 📈
WHEN ❓
Data Science Pipeline ‼
- Ask a Q ➡ 2. find data ➡ 3. Get data ➡ 4. clean data ➡5. Analyze data ➡ 6. present data
DATA ANALYSIS TYPES
WHAT ❓
WHY ❓
Purpose
click to edit
Discover patterns
Spot anomalies
Check assumptions
Check missing data/other mistakes
Gain max. insight of dataset & underlying structure
associated with model fitting/hypothesis test
Create a list of outliers/anomalies
Identify the most influential variables
find parameter estimates (using sample to estimate for parameter for population distribution) & their associated confidence intervals or margins of error
Explore a dataset 🔎
get a feel of the dataset
Understand & summarize dataset
(1) investigate a Q; or
(2) prepare data for more advanced modelling
TWO METHODS 🛠
Data Visualization
(mostly)
Uncover a parsimonious model
explains data with min. # of parameters/predictor variables
1⃣ EDA
2⃣ Confirmatory Analysis
(Inferential Statistics)
Settles questions
Test Hypothesis
(Descriptive Statistics)
Find a good description
raises new questions
use own judgement
determine important elements of data
❓
understand data characteristics & properties
Quantitative methods
describe data
Frame Hypothesis
ANALYTICAL DESIGN
6 principles
1⃣ Show comparisons,contrast, differences 🍎🍏
2⃣ show causality, mechanism, explanation, systematic structure ⛈☔
3⃣ Show multivariate data
compare results of control group vs. test group
H0 vs. H1
explain why something happened
help form hypothesis
help show relationship, if any e.g. A affects B
real world multivariate
uncover unexpected relationship
4⃣ Integration of Evidence
e.g. identify confounding variable
Analysis drive tool selection
Confounding Variable
unaccounted variable
cause 2 problems,
if not detected
Introduce Bias
increase variance
hidden effect on dependent variable
e.g. can suggest correlation when in fact there isn't
5⃣ Describe with appropriate labels, title, scales, sources
use many modes of data presentation to highlight/display evidence
Enrich with words, images, #, diagrams, other statistical methods
add credibility
6⃣ Content is ‼
Depends
Content
quality
relevance
integrity
e.g.
poor data or poor question ➡ irrelevant evidence, lack clarity, information not useful despite advanced analytical design/graph
HOW ❓
Bivariate
Qualitative
Joint frequency
Quantitative
2 variables
Predictor & outcome e.g. regression
Measures
covariance
correlation i.e. normalized covariance
a statistical measure that indicates the extent to which two or more variables fluctuate together.
note: correlation apply for quali & quanti
Multivariate
more & = 2 variables
any correlation?
which variables affects outcome ⁉
e.g. multiple linear regression
Textual i.e. word cloud
word size proportional to its frequency in text
Univariate
- single variable
Quantiative
i.e. continuous
Qualitative
i.e. nominal
ordinal
frequency
Central tendency
dispersion
range, sd, variance
explore spread of data
mean, median, mode
summarize data in a few numbers
Exploratory Graphs
Purpose
debug analysis
initial step in data investigation
made quickly
a large # produced
goal to gain personal understanding
axis, legends cleaned up later
communicate results
understand data properties
find patterns in dataset
type
1 dimension
2 dimension
density plot
scatterplot
5 number summary
boxplot
histogram
dot plot
multiple box plot
multiple histogram
PLOT R
Plot R 💻
package
{ggplot2}
{base}
❓
constructed piecemeal, each plot aspect handled separately
stage
👍 flexible, high degree of control
👎
can't reverse - plan in advance
difficult to translate to others once new plot created (no graphical language)
1⃣ create plot
📊
hist()
plot()
boxplot()
barplot()
2⃣ annotation of plot e.g. modify/add lines, text, points
lines()
text()
points()
2D graphics
?par
{lattice}
❓
for conditioning plot
how Y changes with X across levels of Z
👍
put many plot in a screen
margins & spaces set automatically
👎
awkward to specify entire plot in a single function
annotation not intuitive
cannot 'add' to plot once created
❓
combine base & lattice concepts
👍 easier & more intuitive & can still customize
Refer to STATISTICS notes
single function e.g. xyplot, bwplot
measure
summary(_)
fivenum(_)
min, max, median, quartiles
five number summary
min
Q1
median
Q3
mean
boxplot is visualize five number summary
table
bar chart
box plot
scatter plot
frequency e.g. mode, distribution
Histogram
box plot
scatter plot
histogram
overall shape, skewness
variability
outliers
histogram
mode, concentration of points
univariate scatter plot
boxplot
histogram
density plot
25th, 50th, 75th percentiles
mean optional
min & max, outliers
check symmetrical
box & whiskers
location of median,
i.e. line in the box
dot plot
same interpretation as bar chart
Qualitative & Quantitative
distribution
Barplots/ bar chart
two way table
stacked
nested i.e. side by side
dot plots
side by side box plots
scatter plot
relationship pattern, strength
outliers
Central tendency
i.e. summarize, organize & explain data in a few numbers
Dispersion
i.e. spread of data
/ deviation from the mean
mean
median
mode
range :max - min
IQR: Inter Quartile Range,
quartile 3 - quartile 1
Q3 - Q1
Variance, SD : deviation from the mean
Percentiles
Q1 = 25th percentile
Median = 50th Percentile
Q3 = 75th Percentile
nth percentile = n% of
observations fall at or below it
takes a broad look & tries to makes sense of data
visualize what happened
assess & 🔎 patterns
null Hypothesis vs. Alternate Hypothesis
I am what is default, the status quo. I'm already accepted, can only be rejected. The burden of proof is on the alternative. I am H0
e.g. side-by-side boxplots
The second principle is to show causality or a | mechanism of how your theory of the data works. This explanation or systematic structure shows your | causal framework for thinking about the question you're trying to answer <swirl>
Dual Boxplot
change in symptom-free days for both groups (left) and the change in PM2.5 in both groups (right).
Detailed explanation + diagram
By showing the two sets of boxplots side by side you're explaining your theory of why the air cleaner increases the number of symptom-free days
The mechanism that this graph imply is air cleaner reduces pollution
showing air cleaner improves breathing for asthma children is not enough. (comparison)
It’s fairly well-understood that inhaling fine particles can exacerbate asthma symptoms, so it stands to reason that reducing the presence in the air should improve asthma symptoms.
Therefore, we’d expect that the group receiving the air cleaners should on average see a decrease in airborne particles.
In this case we are tracking fine particulate matter, also called PM2.5 which stands for particulate matter less than or equal to 2.5 microns in aerodynamic diameter.
Simpson Paradox
is a paradox in probability and statistics, in which a trend that appears in different groups of data disappears when these groups are combined.
e.g. To demonstrate a causative mechanism underlying a correlation
integrate different kinds of evidences to enhance results, provide clarity
distribution
suggest modelling for next step
explore basic questions & hypothesis
summarize data & highlight broad features
start with a blank canvas, then build a plot, step by step
1st step in data analysis
Understand the data
understand events that generated the data
see what happened i.e. via
vizualisation
Edward Tufte
the principles derived from principles of analytical thinking
evidence of presentation is to assist thinking