EDA 📈

WHEN ❓
Data Science Pipeline ‼

  1. Ask a Q ➡ 2. find data ➡ 3. Get data ➡ 4. clean data ➡5. Analyze data ➡ 6. present data
    DS pipeline

DATA ANALYSIS TYPES

WHAT ❓

WHY ❓
Purpose

click to edit

Discover patterns

Spot anomalies

Check assumptions

Check missing data/other mistakes

Gain max. insight of dataset & underlying structure

associated with model fitting/hypothesis test

Create a list of outliers/anomalies

Identify the most influential variables

find parameter estimates (using sample to estimate for parameter for population distribution) & their associated confidence intervals or margins of error

Explore a dataset 🔎

get a feel of the dataset

Understand & summarize dataset

(1) investigate a Q; or

(2) prepare data for more advanced modelling

TWO METHODS 🛠

Data Visualization
(mostly)

Uncover a parsimonious model

explains data with min. # of parameters/predictor variables

1⃣ EDA

2⃣ Confirmatory Analysis

(Inferential Statistics)

Settles questions

Test Hypothesis

(Descriptive Statistics)

Find a good description

raises new questions

use own judgement

determine important elements of data

understand data characteristics & properties

Quantitative methods

describe data

Frame Hypothesis

ANALYTICAL DESIGN

6 principles

1⃣ Show comparisons,contrast, differences 🍎🍏

2⃣ show causality, mechanism, explanation, systematic structure ⛈☔

3⃣ Show multivariate data

compare results of control group vs. test group

H0 vs. H1

explain why something happened

help form hypothesis

help show relationship, if any e.g. A affects B

real world multivariate

uncover unexpected relationship

4⃣ Integration of Evidence

e.g. identify confounding variable

Analysis drive tool selection

Confounding Variable

unaccounted variable

cause 2 problems,
if not detected

Introduce Bias

increase variance

hidden effect on dependent variable

e.g. can suggest correlation when in fact there isn't

5⃣ Describe with appropriate labels, title, scales, sources

use many modes of data presentation to highlight/display evidence

Enrich with words, images, #, diagrams, other statistical methods

add credibility

6⃣ Content is ‼

Depends

Content

quality

relevance

integrity

e.g.

poor data or poor question ➡ irrelevant evidence, lack clarity, information not useful despite advanced analytical design/graph

HOW ❓

Bivariate

Qualitative

Joint frequency

Quantitative

2 variables

Predictor & outcome e.g. regression

Measures

covariance

correlation i.e. normalized covariance

a statistical measure that indicates the extent to which two or more variables fluctuate together.

note: correlation apply for quali & quanti

Multivariate

more & = 2 variables

any correlation?

which variables affects outcome ⁉

e.g. multiple linear regression

multivariate

Textual i.e. word cloud

word size proportional to its frequency in text

Univariate

  • single variable

Quantiative
i.e. continuous

Qualitative
i.e. nominal
ordinal

frequency

Univariate

Central tendency

dispersion

range, sd, variance

explore spread of data

mean, median, mode

summarize data in a few numbers

Exploratory Graphs

Purpose

debug analysis

initial step in data investigation

made quickly

a large # produced

goal to gain personal understanding

axis, legends cleaned up later

communicate results

understand data properties

find patterns in dataset

type

1 dimension

2 dimension

density plot

scatterplot

5 number summary

boxplot

histogram

dot plot

multiple box plot

multiple histogram

PLOT R

Plot R 💻

package

{ggplot2}

{base}

constructed piecemeal, each plot aspect handled separately

stage

👍 flexible, high degree of control

👎

can't reverse - plan in advance

difficult to translate to others once new plot created (no graphical language)

1⃣ create plot

📊

hist()

plot()

boxplot()

barplot()

2⃣ annotation of plot e.g. modify/add lines, text, points

lines()

text()

points()

2D graphics

?par

{lattice}

for conditioning plot

how Y changes with X across levels of Z

lattice

👍

put many plot in a screen

margins & spaces set automatically

👎

awkward to specify entire plot in a single function

annotation not intuitive

cannot 'add' to plot once created

combine base & lattice concepts

👍 easier & more intuitive & can still customize

Refer to STATISTICS notes

single function e.g. xyplot, bwplot

measure

summary(_)

fivenum(_)

min, max, median, quartiles

five number summary

min

Q1

median

Q3

mean

boxplot is visualize five number summary

table

bar chart

box plot

scatter plot

frequency e.g. mode, distribution

Histogram

box plot

scatter plot

histogram

overall shape, skewness

variability

outliers

histogram

mode, concentration of points

univariate scatter plot

boxplot

histogram

density plot

25th, 50th, 75th percentiles

mean optional

min & max, outliers

check symmetrical

box & whiskers

location of median,
i.e. line in the box

dot plot

same interpretation as bar chart

Qualitative & Quantitative

distribution

Barplots/ bar chart

two way table

stacked

nested i.e. side by side

dot plots

side by side box plots

scatter plot

relationship pattern, strength

outliers

Central tendency
i.e. summarize, organize & explain data in a few numbers

Dispersion
i.e. spread of data
/ deviation from the mean

mean

median

mode

range :max - min

IQR: Inter Quartile Range,
quartile 3 - quartile 1
Q3 - Q1

Variance, SD : deviation from the mean

Percentiles

Q1 = 25th percentile

Median = 50th Percentile

Q3 = 75th Percentile

nth percentile = n% of
observations fall at or below it

takes a broad look & tries to makes sense of data

visualize what happened

assess & 🔎 patterns

null Hypothesis vs. Alternate Hypothesis

null hypothesis


Link


I am what is default, the status quo. I'm already accepted, can only be rejected. The burden of proof is on the alternative. I am H0

e.g. side-by-side boxplots

The second principle is to show causality or a | mechanism of how your theory of the data works. This explanation or systematic structure shows your | causal framework for thinking about the question you're trying to answer <swirl>

Rplot_causal


Dual Boxplot
change in symptom-free days for both groups (left) and the change in PM2.5 in both groups (right).
Detailed explanation + diagram

By showing the two sets of boxplots side by side you're explaining your theory of why the air cleaner increases the number of symptom-free days

The mechanism that this graph imply is air cleaner reduces pollution

showing air cleaner improves breathing for asthma children is not enough. (comparison)

It’s fairly well-understood that inhaling fine particles can exacerbate asthma symptoms, so it stands to reason that reducing the presence in the air should improve asthma symptoms.

Therefore, we’d expect that the group receiving the air cleaners should on average see a decrease in airborne particles.

In this case we are tracking fine particulate matter, also called PM2.5 which stands for particulate matter less than or equal to 2.5 microns in aerodynamic diameter.

Simpson Paradox

is a paradox in probability and statistics, in which a trend that appears in different groups of data disappears when these groups are combined.

e.g. To demonstrate a causative mechanism underlying a correlation

integrate different kinds of evidences to enhance results, provide clarity

distribution

suggest modelling for next step

explore basic questions & hypothesis

summarize data & highlight broad features

start with a blank canvas, then build a plot, step by step

1st step in data analysis

Understand the data

understand events that generated the data

see what happened i.e. via
vizualisation

Edward Tufte

the principles derived from principles of analytical thinking

evidence of presentation is to assist thinking