Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science (Hypothesis Testing (steps in DATA-DRIVEN decision making…
Data Science
Hypothesis Testing
steps in DATA-DRIVEN decision making
Formulate a hypothesis
Find the right test
Execute the test
Make a decision
what is a HYPOTHESIS
"A hypothesis is an idea that can be tested"
Null and Alternative
Null Hypothesis (H0)
is the statement we are trying to reject
Alternative Hypothesis(H1)
is the present state of affairs while the alternative is our personal opinion
terms
Significance Level
denoted as "
α
"
df: The probability of rejecting the null hypothesis, if it is true
Typical values
0.01
※0.05
0.1
Error type
Type I error
Reject a true null hypothesis
probability : α
which means that the responsibility of making this error lies solely on you, because the α is chosen by you
Type II error
Accept a false null hypothesis
probability : β
decided by sample size and variance
Rejecting a false null hypothesis
probability : 1 - β
aka. The power of the test
example
Probability
Definition : the likelihood of an event occurring
Category
Experimental Probabilities
a good approximation of the true probabilities
P(A) = successful trials / all trials
Theoretical(True) Probabilities
Probability frequency distribution
Definition:a collection of the probabilities for each possible outcome
Probability distribution
Important nouns & notations
nouns
Distribution
The possible values a variable can take and how frequently they occur
notations
Y
The actual outcome of an event
P(Y=y) = p(y) (probability function)
y
One of the possible outcomes
x with a bar
Sample mean
s^2
Sample variance
σ
Standard deviation (square root of variance)
σ^2
Variance
how spread the data is
μ
Mean
average value
Population vs Sample
Population data
"all the data"
Sample data
"just part of it"
ex: An entire department of a company tried to plan a trip to Australia."Entire department" is
'Population of the department'
, and
'Sample of the whole company'
Types of probability distribution
Discrete distribution
Uniform distribution
All outcomes are equally likely (such outcome called Equiprobable)
ex: roll the dice
X~U(3,7)
to read the statement: variable X follows a discrete uniform distribution ranging from 3 to 7
drawbacks(because all outcomes have same prob.)
The expected value provides us no relevant info
Mean and Variance are uninterpretable and no real intuition behind what they mean
so --> no predictive power
Bernoulli Distribution
Events with only two possible outcomes(1 trial 2 ps.oc.)
ex: True of False
X~Bern(p)
read the statement: variable X follows Bernoulli Distribution with a probability of success equal to p
Values
σ^2 = p(1-p)
E(X) = p
Binomial Distribution
Carrying out a similar experiment several times in a row
Two outcomes per iteration, with many iterations
ex: flipping a coin
X~B(10,0.6)
read the statement: variable X follows a binomial distribution with 10 trials and a likelihood of 0.6 to succeed
Bern(p) = B(1,p)
Bern vs B
E(Bernoulli event) = which outcomes we expect
for a single event
E(Binomial event) = The number of times we expect to get a specific outcome
ex: for a True and False quiz, Guessing 1 question = Bernoulli event, Guessing the entire quiz = Binomial event
Values
p(y) = Cny
p^y
(1-p)^(n-y)
E(X) = x0
p(x0) + x1
p(x1) + ... +xn*p(xn)
Y~B(n,p) ; E(Y)=n*p
σ^2 = E(y^2) - E(y)^2 = n x p x (1-p)
Poisson Distribution
Test out how unusual an event frequency is
for a given interval
ex: Calculate the chance of Lebron James getting 12 points
in the first quarter
of his next game
Po(λ) Y~Po(4)
read the statement: variable Y follows a Poisson Distribution with lambda equal to 4
Values
P(Y) = λ^y * e^(-λ) / y!
e≒2.72
E(Y) = λ
Var(Y) = λ
Characteristics
Have a finite number of outcomes
Can add up individual values to determine probability of an interval.
Continuous distribution
Characteristics
※The probability distribution would be a curve
Have infinitely many consecutive possible values
Can not add up the individual values that make up an interval because there are infinitely many of them
The probability for any individual value equal to 0,
so p(x>X) = P(x>=X), because P(x=X) = 0
Has Cumulative Distribution Function(CDF)
Graph
F(-無限) = 0 , F(無限) = 1
PDF -- Integral→CDF, so the area of the CDF is PDF
Probability of Intervals
The area of under the density curve is the probability of that interval
ex: p(b>x>a)
積分a~b p(x)dx
can use
WolframAlpha
to calculate
Normal Distribution
X~N(μ,σ^2)
read the statement: The variable X follows a normal distribution with mean μ and variance σ squared
characteristics
it's graph is
bell-shaped curve, symmetric
and has tails
mean = median = mode
has no skew
Values
E(X) = μ
Var(X) = σ^2(not usual given) = E(X^2) - E(X)^2
Standardization
E(X) = 0, Var(X) = 1
follows the 68,95,99.7 Rule
Z~N(0,1)
Z = x-μ / σ
this will drive the standard deviation of the new data set to 1
Y ~ N(μ, σ^2)
drawbacks
Useful when we have a Normal Distribution
Not always the case
It requires a lot of data
Risk of outliers drastically affecting our analysis
Student's-T Distribution
Characteristics
accommodates extreme values significantly better than N.D.
Because any extreme values represents a much bigger part of the population
A
small sample approximation of a Normal Distribution
Y~t(k)
read the statement: variable Y follows A student's t distribution with three degrees of freedom
features and characteristics
Values
k : degrees of freedom
If k > 2
E(Y) = μ
Var(Y) = s^2 * k / k - 2
Application
Hypothesis testing with limited data
CDF table(T-table)
Independent Sample T test
timing
two sample t-test: comparing if there is a difference between the means of the two samples
ex: 分析使用A牌飼料與B牌飼料餵養的乳牛,其每季平均生產的鮮乳量是否有差異
Assumptions
1.Dependent variables should be continuous variables and random variables
2.The population of the dependable variables should follow normal distribution
3.The samples should be independent variables
4.Variance: The variance of two samples should follow normal distribution and be equal(變異數同質性).
Paired sample T test
timing
比較 1.兩組成對樣本 或 2.單一樣本重複量測 的平均值是否有差異
1.兩組成對樣本
ex:分析一群夫妻之中,夫與妻分別的年收入多寡是否有差異
2.重複量測
ex:分析參加減肥試驗的一群人,參加試驗前與規律運動3個月後的體重是否有差異
Assumptions
1.Dependent variables should be continuous variables and random variables
2.The population of the dependent variables should follow normal distribution
3.The samples should be
dependent variables
, which means that the variables of each samples will affect each other
4.Variance: The variance should follow normal distribution and be a constant
Chi-Squared Distribution
Characteristics
Only consists of non-negative values
Asymmetric
Chi-squared Distribution is the square of the t-Distribution
Y ~ χ^2(k)
read the statement: variable Y follows a chi square distribution with k degrees of freedom
Application
statistical analysis
Hypothesis testing
Computing confidence intervals
Few events in life
Goodness of fit of categorical values
用於"兩類別變數的關聯性"
Values
E(X) = k
Var(X) = 2k
Assumptions
1.All the variables are categorical variables
2.The samples should be independent variables, which mean that sample A would not have an impact on sample B
Exponential(指數) Distribution
Characteristics
for Events that are rapidly changing early on
variables follow this E.D. with a probability that
initially decreases before eventually plateauing
表示獨立隨機事件發生的時間間隔
ex: The number of views for a Youtube blog
X ~ Exp(λ)
read the statement: variable X follows an exponential distribution with a scale of λ
Rate parameter : λ
How fast the CDF/PDF curve reaches the point of plateauing
How spread out the graph is
每單位時間發生該事件的次數
graph of PDF/CDF
superlink
PDF plateaus around the 0, CDF has to reach the 1
Values
E(Y) = 1 / λ
Var(Y) = 1 / λ^2
drawbacks
No table of known variables(ex: Z-table ...)
Transformation
take the natural logarithm of every set of an exponential distribution and get a normal distribution
Y~Exp(λ) & X = ln(Y)
X~N(μ,σ^2)
Logistic Distribution
Useful in forecast analysis
Useful for determining a cut-off point for a successful outcome
ex: Often used in sports to anticipate how a player's or team's performance can determine the outcome of the match
Y~Logistic(μ,S)
read the statement: variable Y follows a logistic distribution
with location μ and a scale of S
Values
E(Y) = μ
Var(Y) = s^2 * π^2 / 3
graph
Inferential Statistics
Sampling Distribution (of the mean)
Central Limit Theorem
n > 30, the Sampling Distribution will approximate the Normal Distribution
why so important ?
Decisions based on Normal Distribution insights have a good track record
allows us to perform test, solve prblms and make inferences using the Normal Distribution, even the population is not normally distributed.
df: a distribution formed by samples
characteristics
No matter the underlying distribution, the sampling distribution approximates a Normal Distribution
Values
N(μ, σ^2 / n)
Which matches Central Limit Theorem, the bigger the sample size you draw, the more accurate the results you will get
Standard error
df: is the standard deviation of the sampling distribution
= σ / n^1/2 (variance ^ 1/2)
shows how well you approximated the true mean
same. The bigger the sample size, the better approximation
Estimates and Estimators
point estimates
is in the middle of the confidence interval
ex : x hat of μ & s^2 of σ^2
the example of"Estimator, Parameter and Estimate"
confidence intervals estimates
provides much more information and are preferred when making inferences
Bias and Efficient
Bias
Efficiency
less variability
Confidence Interval
values
Confidence Level
1 - α
0 ≦ α ≦ 1
ex: 95% C.I. , then α will be 5%
Confidence Interval
95% C.I.
means that you are sure that 95% of the cases, the true population parameter would fall into the specified interval.
margin error
(z or t score) * proper standard deviation
df: is the range within which you expect the population parameter to be
two conditions
σ known
Normal Distribution
σ unknown
Student's T Distribution
C.I. : xhat ± tn-1(degree of freedom),α/2 * s / n^1/2
has wider width than σ known condition, it makes sense, because there is much uncertainty in σ unknown condition.
assume that population is normally distributed
, and we only have small sample size.
Samples (with two means)
Dependent Samples
before and after situation
cause and effect
Independent Samples
known population variances
unknown population variances but assumed to be equal
ex: the difference of the price of apples in New York and LA
what u should know
unknown population variances but assumed to be different
Combinatorics
Restrictions
repetition
order
other
Integral parts
Permutations(置換) (as P in Taiwan math course)
Definition:The number of different possible ways we can arrange a set of elements
ex: (A,B,C) , (A,C,B) , (B,A,C) , (B,C,A) , (C,A,B) , (C,B,A)
※The order of each elements is crucial
Variations(V)
Variations with repetition
V n(up)p(down) = n^p
n = the total number of elements, we have available
p = the number of positions we need to fill
df: The total number of ways we can pick and arrange some elements of a given set
Variations without repetition
※The order of each elements is crucial
Combinations(as C in Taiwan math course)
The number of combinations equals the number of variations, over the number of permutations
C n(up) p(down) = n![variations] / p!(n-p)![permutations]
Factorials(階乘)
n! = 1x2x3x...xn
Descriptive Statistics
Population and Sample
Sample
is a subset of the population, denoted as "n"
characteristics
Randomness
a random sample is collected when each member of the sample is chosen from the population strictly by chance
Representativeness
a representative sample is a subset of the population that accurately reflects the members of the entire population
Population
hard to find and observe in real life
is a collection of all items of interest, denoted as "N"
data classification
types of data
categorical
df: describes categories or groups
ex:
Car brands, Yes and No question
Data Visualization(open S15_L17 excel)
Frequency distribution tables
Bar charts
Pie charts
Pareto diagrams
It shows how subtotals change with each additional category
numerical
discrete
can be counted in a
finite number
ex:
number of children you want to have(it is impossible to have like 1.2 children), grades on SAT
continuous
is infinite and
impossible to have an absolute count
ex: weight, because it can take infinite amount of values no matter how many digits there are after the dot / area, distance and time,ex: the age of the building can be 35 years old, and it also can be 35.9234 years old
all of these can vary by infinitely smaller amounts
Data Visualization
Frequency Distribution
Histogram(直方圖)
the interval can be different
Measurement levels
Qualitative
nominal
ex: car brands like Benz, BMW, Yaris...
are only used to classify the data.
ordinal
ex: from disgusting to delicious
consists of groups in categories which follow a strict order
Quantitative
interval
doesn't have a true 0
ratio
has a true 0
Central Tendency
Measures of central tendency
mean
drawbacks
is easily affected by outliers
mode(眾數)
median(中位數)
Measures of Asymmetry
skewness
df: indicates whether the data is concentrated on one side
tell us where the data is situated
positive and negative
positive
vice versa
negative
the left tail is longer than right tail, which means that
the most of the values are situated at the right side of the mean.
Variability
Variance
for Population
sum(Xi-μ)^2 / N
for Sample
sum(Xi-xhat)^2 / n-1
measures the dispersion of a set of data points around their mean
Coefficient of variation(CV)
for Population
σ / mean
for Sample
s / xhat
Comparing two or more datasets
Standard Deviation
is the most common measure of variability for a
single dataset
Covariance
the two variables are correlated and the main statistic to measure this correlation is called Covariance
formula
Correlation coefficient(相關係數)
Cov(x,y) / Stdev(x) * Stdev(y)
幫助Covariance能夠更直覺地判斷,因為Covariance的數字本身並沒有太大的意義
Bayesian Inference
Sets and events
Events
each event has set of outcomes(favorable outcomes)
ex: Event: even , set = 2,4,6,8...
Independence and Dependence
Independent events
The theoretical probability remains unaffected by other events
Dependent events
Probabilities of dependent events vary as conditions change
Conditional Probability
ex: Two events A&B
P(A|B) -- "P of A given B" = P(A∩B) / P(B) only if P(B)>0
The probability of getting A, if we are given that B has occurred
df: The likelihood of an event occurring assuming a different one has already happened
Baye's Rule
P(A|B) = P(B|A) * P(A) / P(B)
it allows us to find a relationship between the different conditional probabilities of two events
it is super useful in real-life, such as medical research, business hire
Multiplication Rule
P(A∩B) = P(B|A) x P(A)
Sets
features
Sets : Upper-case / Elements : Lower-case ex: Set X means even number, and element 8, is one of the even number
ex: x(element) , A(set) | x ∈ A
can be expressed as "x is an element of A"
can be empty or have values
aka. null set, empty set
non-empty sets can be finite or infinite
don't have to be numerical
Subset
df: a set is fully contained in another set
※every set has at least 2 subsets
𝐴⊆A
∅⊆A
Interaction ways
partially overlap(intersect)
Union(聯集)
A combination of all outcomes preferred for either A or B
be careful of
"DOUBLE COUNTING"
completely overlap
not touch
Mutually Exclusive Sets
Complement
pic:
https://revisionworld.com/sites/revisionworld.com/files/imce/complement.gif
Complement Set
All values that are part of the sample space, but not part of the set
features
Complements are always mutually exclusive
NOT all mutually exclusive sets are complements
Symbols
𝑥∈𝐴
“Element x is a part of set A.”
ex: 2 ∈ all even numbers
A∋𝑥
“Set A contains element x.”
ex: all even numbers ∋ 2
𝑥∉𝐴
"Element x is NOT a part of set A."
ex: 1 ∉ all even numbers
∀𝑥:
"For all /any x such that..."
∀𝑥: 𝑥∈𝐴𝑙𝑙 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
𝐴⊆𝐵
"A is a subset of B"
ex: even numbers ⊆ integers
Analysis
Analysis of Variance(ANOVA)
timing
比較多組(兩組以上)樣本平均數是否相等。
2.One way: means that there's only one "y" in this ANOVA
Assumptions
1.The "x" should be categorical variable, and the "y" should be continuous variable.
2.The population should follow normal distribution
The samples should be independent variables.
變異數同質性: The variances of each samples should be equal.
other types
重複量測變異數分析: When samples are not independent variables.
多因子變異數分析: When there are multiples "x(自變項)" in the sample