Please enable JavaScript.
Coggle requires JavaScript to display documents.
STATISTICS FOR DATA SCIENCE (:four: Outlier (detect (fences = 1.5 x IQR …
STATISTICS
FOR DATA SCIENCE
What :question:
Definition
Statistical Analysis is the process of generating statistics from stored data & analyzing the results to deduce and infer meaning about the underlying dataset, or the reality that it attempts to describe
one of DS
core analytical skill
for scientific problem solving
Type
DESCRIPTIVE (EDA)
Organize
Summarize
Simplify
Describe & present data
example
what study strategies used by students
no. of students who experienced stress
INFERENTIAL
Generalise sample to population
Hypothesis testing
make predictions
example
whether 1 group of students experience more stress than another
assess relationship between variables
7 principles
Stats permeate DS
:one: Sample & Population
:two: Sample Statistics
:three: Bootstrap
:four: Outlier
:five: Statistical Model
:six: Confounding & other factors
:seven: p-values
:one: Sample & Population
population
Collection
of all subjects/objects of interest
real world issue
no access to population data
so, use sample
Sample
subset
of population
used to
infer about the characteristics
of population
challenge :!?:
is the result from the sample good enough :question:
:+1: 98% CI
if not good, how big a sample size should be :question:
:two: Sample Statistics
Characteristics
of sample
e.g.
sample variance
sample proportion
sample mean
:dart:
provide
estimates
of
population parameters
e.g. population mean
H0 & H1 are statements of population parameters
assumption
population parameter estimation
distribution of the interested variable is adequately described by a distribution with one or more (unknown) parameters
Reliability
sample size,
n
no. of cases in a sample
sampling dist.
collection
of sample statistics
from all trials
large # of trials
What is sampling dist?
to know what is the frequency of distribution of the sample statistics e.g. sample mean, from all of the random samples (each sample is the size n)
mean
sampling distribution of
sample mean
approach a
normal dist.
, bell shaped if n is large
closer to TRUTH
The
Central Limit Theorem
states that
the sampling distribution of the sample means
will approach a
normal distribution
as the
sample size increases
.
CLT regardless of shape of population
normal dist. is where symmetrical where mean = median = mode
:+1: better if its closer to truth
larger n
sampling dist of sample mean, KHAN
shape
of sampling dist
skewness
nothing abnormal with other dist. shape
standard error
sd of the sampling distribution
:eye: width of sampling dist.
(hist/density plot)
sampling distribution of sample mean with different n sample sizes
A :black_large_square: larger
n
yield :black_small_square: smaller
standard error
of sampling dist.
:+1: smaller standard error ->> the more reliable
larger n
:three: Bootstrap
Bootstrap
define
a
type of resampling
large #
of samples of
same sample size
n, are
repeatedly drawn
from the
original sample
instead of drawing random sample n from population i.e. normal sampling
random sampling from the sample with replacement
based
theorem: law of large numbers
repeated sampling yield data that approximate the true population data
sample mean, variance & SD converge to respective parameter estimates
describe the results of performing the same experiment a large # times
:+1:
approximate sampling dist. w/o access to population
think of the sampling from population but allow duplicates
resampling allow
DUPLICATES
of data points/ observations
but each bootstrap sample is
SAME SIZE
procedure
resample data x times
compute summary statistics
Estimate the standard error for the bootstrap statistic using the standard deviation of the bootstrap distribution.
:four: Outlier
An observation that lies an
abnormal distance
from other values, in a random sample
treat
treatment depends on cause
remove
should never drop unless clear rationale
dropped outliers need to be reported
change or bring value into range
fix error or irregular outliers
detect
more data helpful to detect extreme outliers
fences = 1.5 x IQR
lower fence = Q1 - fences
upper fence = Q3 + fences
large outlier: data point > upper fence
small outlier: data point < lower fence
IQR: Inter Quartile Range,
quartile 3 - quartile 1
Q3 - Q1
box plot, scatter plot
:+1:
often can tell interesting things
may yield important insights
causes
typographical error
problem with experimental protocol
special circumstances e.g. extremely rich families ~household income
:five: Statistical Model
:question:
a special class of mathematical model
non-deterministic
compared with math model
some variables without specific value
variable with probability distributions
stochastic variables i.e. randomly determined
linear model lm(_), the most widely used statistical model
:+1:
embodies a set of assumptions concerning the generation of sample data and similar data from a larger population
a way to model relationship between variables
:dart:
predictions
extract info.
describe stochastic structures
:six: Confounding & other factors
:question:
unaccounted variable
hidden effect on dependent variable
e.g. can suggest correlation when in fact there isn't
https://explorable.com/confounding-variables
extraneous variable in
statistical model
correlate directly or inversely with both dependent & independent variable
estimate fail to take into account the confounding factor
:-1:
cause 2 problems,
if not detected
Introduce Bias
increase variance
:-1: distort observed relationship/association
correlation does not imply causation.
maybe sometimes
Strategies :hammer_and_wrench:
reduce confounding
Randomization
random distribution of confounders between study groups
Matching
equal distribution of confounders, of individuals or groups
Stratification
confounders are distributed evenly within each stratum
i.e. grouping - divide the variable into group (see example lecture notes slide 46)
Adjustment
usually distorted by choice of standard
Multivariate Analysis
only works if confounders identified and measured
e.g. multiple regression
:seven: p-values
probability values
statistical significance testing
i.e. likelihood of obtaining a given result by chance
ways
Confidence level is 95%
Type 1 error rate is 0.05
the error of rejecting the true null Hypothesis, H0
false positive finding
Link
I am what is default, the status quo. I'm already accepted, can only be rejected. The burden of proof is on the alternative. I am H0
finding is significant at 0.05 level
the alpha level is 0.05
the P-value is 0.05
:+1: (0<p<1) between
P-value indicates how likely it is that a result occurred by chance alone
:+1::skin-tone-5: the
smaller P-value
the more stringent the test
the
greater likelihood
that the conclusion is correct
indicates the result was unlikely to have occurred by chance alone.
result is
statistically significant
Result is statistically significant if p-value <= significance level
P-value is a measure of the strength of evidence
against H0
help draw conclusion Hypothesis
P > 0.05
weak evidence against H0
can't reject H0
P <= 0.05
strong evidence against H0
can reject H0
P = 0.05
marginal value
possible either way
to answer How much confidence on the outcome of hypothesis test :question:
6 principles
indicate if data is incompatible with the statistical model
do not measure prob of hypothesis is true &
data produced by random chance
scientific conclusion, biz, policy can't be solely based on p-value
proper inference requires full reporting & transparency
American Statistical Association
p-value/statistical significance does not measure the size of an effect or the importance of a result
by itself, p-value not a good measure of evidence regarding model or hypothesis
widely used
t-test
regression analysis
A/B Test
enable to accurately quantify the effect size & errors
calculate P(type I or II error)
Type I error
falsely concluding that the intervention is successful
Reject the true H0
false positive findings
wrongly classified a non-event as event
Type II error
falsely concluding that intervention wasn't successful
fail to reject H0 that is false
false negative result
wrongly classify event as non-event
2 sample hypothesis testing
randomized controlled experiments with two variants A & B
A control
B Variation
used to determine the significance of result after statistical hypothesis test
Excluded statistical learning slide