Please enable JavaScript.
Coggle requires JavaScript to display documents.
EDA :chart_with_upwards_trend: (TWO METHODS :hammer_and_wrench: (HOW …
EDA
:chart_with_upwards_trend:
WHEN :question:
Data Science Pipeline :!!:
Ask a Q :arrow_right: 2. find data :arrow_right: 3. Get data :arrow_right: 4. clean data :arrow_right:
5. Analyze data
:arrow_right: 6. present data
DATA ANALYSIS TYPES
:one: EDA
(
Descriptive Statistics
)
Find a good description
raises new questions
1st step in data analysis
:two: Confirmatory Analysis
(Inferential Statistics)
Settles questions
Test Hypothesis
WHAT :question:
Explore a dataset
:mag_right:
get a feel of the dataset
use own judgement
determine important elements of data
Understand & summarize
dataset
(1) investigate a Q; or
(2) prepare data for more advanced modelling
takes a
broad look
& tries to
makes sense
of data
Understand the data
understand events that generated the data
see what happened i.e. via
vizualisation
WHY :question:
Purpose
Discover patterns
Gain max. insight of dataset & underlying structure
Identify the most influential variables
Spot anomalies
Create a list of outliers/anomalies
Check assumptions
associated with model fitting/hypothesis test
Check missing data/other mistakes
find parameter estimates (using sample to estimate for parameter for population distribution) & their associated confidence intervals or margins of error
Uncover a parsimonious model
explains data with min. # of parameters/predictor variables
Frame Hypothesis
TWO METHODS
:hammer_and_wrench:
Data Visualization
(mostly)
:question:
understand data characteristics & properties
visualize what happened
assess & :mag_right: patterns
Exploratory Graphs
Purpose
debug analysis
communicate results
understand data properties
find patterns in dataset
suggest modelling for next step
explore basic questions & hypothesis
summarize data & highlight broad features
initial step in data investigation
made
quickly
a large # produced
goal to gain
personal understanding
axis, legends cleaned up later
type
1 dimension
density plot
5 number summary
boxplot
histogram
dot plot
2 dimension
scatterplot
multiple box plot
multiple histogram
PLOT R
Plot R :computer:
package
{ggplot2}
:question:
combine base & lattice concepts
:+1: easier & more intuitive & can still customize
{base}
:question:
constructed piecemeal, each plot aspect handled separately
stage
2 more items...
2D graphics
start with a blank canvas, then build a plot, step by step
:+1: flexible, high degree of control
:-1:
can't reverse - plan in advance
difficult to translate to others once new plot created (no graphical language)
?par
{lattice}
:question:
for conditioning plot
how Y changes with X across levels of Z
:+1:
put many plot in a screen
margins & spaces set automatically
:-1:
awkward to specify entire plot in a single function
annotation not intuitive
cannot 'add' to plot once created
single function e.g. xyplot, bwplot
Quantitative methods
describe data
measure
Central tendency
i.e. summarize, organize & explain data in a few numbers
mean
median
mode
Dispersion
i.e. spread of data
/ deviation from the mean
range :max - min
IQR: Inter Quartile Range,
quartile 3 - quartile 1
Q3 - Q1
Variance, SD : deviation from the mean
Percentiles
Q1 = 25th percentile
Median = 50th Percentile
Q3 = 75th Percentile
nth percentile = n% of
observations fall at or below it
summary(_)
fivenum(_)
min, max, median, quartiles
five number summary
min
Q1
median
Q3
mean
boxplot
is visualize five number summary
HOW :question:
Bivariate
Qualitative
Joint frequency
two way table
Barplots/ bar chart
stacked
nested i.e. side by side
dot plots
Quantitative
2 variables
Predictor & outcome e.g. regression
Measures
covariance
correlation i.e. normalized covariance
a statistical measure that indicates the extent to which two or more variables fluctuate together.
note: correlation apply for quali & quanti
scatter plot
relationship pattern, strength
outliers
Qualitative & Quantitative
distribution
side by side box plots
Multivariate
more & = 2 variables
any correlation?
which variables affects outcome :!?:
e.g. multiple linear regression
Textual i.e. word cloud
word size proportional to its frequency in text
Univariate
single variable
Quantiative
i.e. continuous
univariate scatter plot
boxplot
histogram
density plot
Central tendency
mean, median, mode
box plot
scatter plot
histogram
mode, concentration of points
summarize data in a few numbers
dispersion
range, sd, variance
box plot
25th, 50th, 75th percentiles
mean optional
min & max, outliers
check symmetrical
box & whiskers
location of median,
i.e. line in the box
scatter plot
histogram
overall shape, skewness
variability
outliers
distribution
explore spread of data
frequency e.g. mode, distribution
Histogram
Qualitative
i.e. nominal
ordinal
frequency
table
bar chart
dot plot
same interpretation as bar chart
ANALYTICAL DESIGN
6 principles
:one: Show comparisons,contrast, differences :apple::green_apple:
compare results of control group vs. test group
H0 vs. H1
null Hypothesis vs. Alternate Hypothesis
Link
I am what is default, the status quo. I'm already accepted, can only be rejected. The burden of proof is on the alternative. I am H0
e.g. side-by-side boxplots
:two: show causality, mechanism, explanation, systematic structure :thunder_cloud_and_rain::umbrella_with_rain_drops:
explain why something happened
help form hypothesis
help show relationship, if any e.g. A affects B
The second principle is to show causality or a | mechanism of how your theory of the data works. This explanation or systematic structure shows your | causal framework for thinking about the question you're trying to answer <swirl>
Dual Boxplot
change in symptom-free days for both groups (left) and the change in PM2.5 in both groups (right).
Detailed explanation + diagram
By showing the two sets of boxplots side by side you're explaining your theory of why the air cleaner increases the number of symptom-free days
The mechanism that this graph imply is air cleaner reduces pollution
showing air cleaner improves breathing for asthma children is not enough. (comparison)
It’s fairly well-understood that inhaling fine particles can exacerbate asthma symptoms, so it stands to reason that reducing the presence in the air should improve asthma symptoms.
Therefore, we’d expect that the group receiving the air cleaners should on average see a decrease in airborne particles.
In this case we are tracking fine particulate matter, also called PM2.5 which stands for particulate matter less than or equal to 2.5 microns in aerodynamic diameter.
e.g. To demonstrate a causative mechanism underlying a correlation
:three: Show multivariate data
real world multivariate
uncover unexpected relationship
e.g. identify confounding variable
Confounding Variable
unaccounted variable
hidden effect on dependent variable
e.g. can suggest correlation when in fact there isn't
cause 2 problems,
if not detected
Introduce Bias
increase variance
Simpson Paradox
is a paradox in probability and statistics, in which
a trend
that appears in different groups of data
disappears
when these
groups are combined
.
:four: Integration of Evidence
Analysis drive tool selection
use many modes of data presentation to highlight/display evidence
Enrich with words, images, #, diagrams, other statistical methods
integrate different kinds of evidences to enhance results, provide clarity
:five: Describe with appropriate labels, title, scales, sources
add credibility
:six: Content is :!!:
Depends
Content
quality
relevance
integrity
e.g.
poor data or poor question :arrow_right: irrelevant evidence, lack clarity, information not useful despite advanced analytical design/graph
https://bookdown.org/rdpeng/exdata/principles-of-analytic-graphics.html
~similar concept
Edward Tufte
the principles derived from
principles of analytical thinking
evidence of presentation is
to assist thinking
Refer to
STATISTICS notes