Please enable JavaScript.
Coggle requires JavaScript to display documents.
Statistics - Coggle Diagram
Statistics
Analysis of Variance
1st case: Paired Sample
response: 1 interval; predictor: 2 nominal
Hypothesis
H0: (μ1 - μ2) = 0
Ha: (μ1 - μ2) > or < or != 0
t.test((data_column1 - data_column2), alternative = "two.sided")
2nd case:
independent groups, unequal variances
hypothesis:
H0: (μ1 - μ2) = (μ1 - μ2)0
Ha: (μ1 - μ2) [<, >, not equal] (μ1 - μ2)0
alternative hypothesis: true difference in means is not equal to 0
assumption:
1.random sampling
2.stability over time
if n1<30 or n2<30, X1, X2 ~N
t.test(response, groups, conf.level)
3rd case:
independent samples, same variance, special case of one way ANOVA
assumption:
random sampling
stability over time
n1<30 or n2<30; X1, X2 ~N
var.equal = TRUE (conduct lavene test)
parameter: relationship in process between variables
hypothesis:
H0: μ1 = μ2 =μ3=.... = μk
Ha: at least two μi differ
if Ha is accepted and checking for normality also meet: use tukey's HSD test to check which pairs have the relationship
TukeyHSD(fit, conf.level = ??%)
LaveneTest
hypothesis:
H0: σ1= σ2=σ3=...=σk
Ha: at least one pair is not equal
Assume: 1. random sampling 2. stability over time (3. H0 is true)
sampling distribution- F distribution
p-value: when alpha is known, p-value<= alpha, reject H0
when alpha is unknown, report the p-value and its interpretation
statistical measure:
F-statistic = MSB (mean square between) / MSW (mean square within)
MSB (mean sqaure between): SSB (Sum of square between)/ dfB (degree of freedom between)
MSW(mean square within): SSW (Sum of square within)/ dfW (degree of freedom within)
if F-statistic is large & p-value is small (below pre-defined significance level) -> there's a significant difference between the means of group
two variables
response (dependent) - target variable of interest
predictor (independent)
Introduction
Sampling Methods
Simple random sampling
Stratified Sampling: sample from defined data
Cluster sampling: sample from defined cluster
Types of Statistics
descriptive
graphical display
stacked/ grouped bar plot
design for 2 nominal (ordinal) variables) 兩個變數的長方圖
histogram
design for interval variable, showing counts/ percentages
bar plot
design for nominal or ordinal variable; if ordinal: be sure to order the values on the x-axis 一個變數的長方圖
scatter plot
show relationship between 2 interval variable (小考: relationship between Day of the week and Predicted weather?)
box (& whiskers) plot
design for ordinal/ interval data
numerical summary
percentiles
x is the Pth percentile of the data if the percentage of the data that are less than or equal to x is P. The number P is the percentile rank of x
quartiles
Five-number summary: {xmin,Q1,Q2,Q3,xmax}
interquartile range (IQR) = Q3 - Q1
Z-scores
compute data set distance from the mean in units of standard deviation
empirical rule
bell-shaped data distribution
68%/ 95%/ 99.7%
Chebyshev's rule
at least 3/4 (75%) of the data lie within 2SD of the mean;
at least 8/9 (88.89%) of the data lie within 3 SD of the mean
distribution shape
symmetric (mirror images of right and left distribution)
positively (right) skewed
negatively (left) skewed)
numerical measure
Population Parameter
μ (mean) ; σ (standard deviation) ; p (proportion)
we don't know, but assumed to be fixed
Sample Statistics
𝑋 bar(mean) ; s (standard deviation) ; ̂ p hat (proportion)
known values
Statistical Characteristic
Center of Location (中心位置)
mean
median
mode
Dispersion (分散程度)
Range
variance
standard deviation
inferential
drawing conclusion (estimates, predictions)
say sth about the process that generated
Types of data
Nominal
measurements in distinct categories, with no given order; categorical
CALCULATION: Frequencies, proportion, mode
Ordinal
measurement categories are ordered, convey additional order numbers, identifying higher & lower values now makes sense
CALCULATION: median, percentiles, range, interquartile range
If distance between adjacent values are reasonbly comparable, we often analyzed as interval data
Interval
different between scale values are comparables; quantitative
CALCULATION: mean, standard deviation, skewness; common arithmetic calculations are valid
Probability
S = Sample space = set of all possibilities; P(S) = 1
Complements/ intersection/ union
Conditional (A given B) V.S. Independent
Sampling Distribution - probability distribution for a statistic
assumption
random sampling (no non-sampling error)
stability over time - process feature
sampling/ non-sampling error
non-sampling error (bias) - improper wording/ leading language, problem in the way of selecting subset
Sample 100 current CSOM student of the mean number of hours worked
non-sampling error: only sample MS Finance student/ respondents over-report
sampling error - the sample would not match what exists in the process necessary (unavoidable). The variability is a sample error.
central limit theorem
Allows distribution for any form, but assume standard deviation σ of X is known, and we have sufficiently large sample size
with the assumption of known μ (in practice, μ is always unknown)
with unknown distribution, n>= 30 is sufficient; with normal distribution, then any size of n fits the theorum
Population -> Sample distribution(單一樣本、資料點、會得到 standard deviation) -> Sampling distribution (從資料點得出的統計量,需要考慮 standard error)
Probability Distribution
Discrete Distribution
一個隨機變數的「所有可能值是有限的、個數無限但可數」(countable)
Binomial
X = the number of successes in n Bernoulli trail 每次的結果只有 success 或是 fail
probability density function
dbinom(p, trails or sample_size, probability)
問題:get the probability of certain number of success
a person makes 70% of his throw attempts, and if he shoots 20 throws, so what will be the probability that the person makes exactly 12 of them attempt
cumulative density function
pbinom(x, trails or sample_size, probability)
問題:calculating the probability to get a head more than 3 times if the coin is flipped fairly 10 times
key word: more than/ less than
inverse cumulative density function
qbinom(q, size, p)
問題:get the xx th quantile of a binomial distribution with n trails of p
key word: find cutoff
Poisson
X = number of success in a given interval
general discrete distribution
mean = mu = E(X)
variance σ2 = ∑[( x − µ)2 P(x)]
Bernoulli
X = “Success” or “Failure” ε {0, 1}
Standard deviation: np 根號
negative binomial
X = the number of trails until the rth success occurs
Continuous Distirbution
一個隨機變數的「所有可能值之個數為無限或不可數」;可能的值域為一個區間(interval)
Normal distribution
Continuous, symmetric, mound-shaped distribution
median = mode = mean
Standard normal distribution Z ~ N(0,1)
probability density function
dnorm(x, mean, sd)
dnorm (density function): (x, mean, sd)
pnorm (culmulative function): (q, mean, sd)
qnorm (quartile): (probability, mean, sd)
t distribution
Use T-distribution instead of normal distribution when σ is unknown (not because N is small)
shape depends on degrees of freedom (df); when n 趨近於無限,t distribution get closer to normal distribution
symmetric 0 center, mean = mode = median = 0
standard deviation >1(如果 t 分布的標準差小於1,這意味著對於給定的自由度(degrees of freedom),該分布的變異性比正常情況下更小)
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
F distribution
statistics that involves a ratio of variance
2 Inference Questions: Estimation/ Testing
Estimation: what is parameter? trying to make statement about the value of parameter
試圖通過觀察到的樣本數據來推斷母體參數的值
Point estimation (點估計)
Due to sample error, X bar is unlikely to equal to μ
Confidence interval (信賴區間)
interval = Best Point Estimate (X bar) ± (test statistic) x (estimated variability of sampling distribution)
test statistic (測試統計量)
form of distribution (t-distribution)
acceptable error level of
estimated variability of sampling distribution
Calculate confidence interval for μ
STEP 1. find sample mean/ sample size/ standard deviation/ standard error
STEP 2. calculate t-score = qt(p = alpha/2, df, lower.tail = F) or t.test(data$column_name, conf.level = ???)
STEP 3. calculate upper/ lower bound
Calculate confidence interval for p from the binomial distribution
Wilson score interval method
prop.test(x, n, conf.level)
interpretation of confidence interval
we have a (1 - alpha) 100% degree of belief that we obtained an interval that contains p
i.e. we believe that there's a 90% chance that we obtained an interval that contains μ
Testing: how does parameter compare to value parameter?
假設檢定:我們對於某個參數(parameter)進行檢定,看它是否等於某個特定的值(value parameter)。
也就是說,我們想要判斷這個參數的真實值是否等於我們所假設的特定值
general logic
step 1: set 2 competing, non-overlapping hypotheses (Ha, H0)
Ha: the claim that we want to establish: μ [ >, <. or not equal] μ0
H0: competing hypotheses
step 2: make assumption
random sampling
stability over time
if n <30, X (process)~N
assume that H0 is true
step 3: get sample information
P-value = P(Sample result | H0 is true)
if p-value <= a, reject H0
possible testing result
compare "actual but unknown" V.S. "Decision"
H0 Is true, not reject H0: correct decision
H0 is true, but reject H0: type 1 error (alpha)
Ha is true, but not reject H0: type 2 error
Ha is true, reject H0: correct decision
minimizing errors
increasing n (sample size) can decrease standard error, which makes distribution narrower
reducing alpha by moving the cutoff means increasing beta
drawing conclusion
Critical Value: Cut-off
p-value:the probability of getting a sample result same or more extreme than observed result, given H0 is true
testing normality
hypothesis:
H0: X ~ N(., .), normally distributed
Ha: X != N(., .), NOT normally distributed
3 step evaluation
histogram (check distribution shape)
QQ plot (probability plot):qqnorm + qqline
fit test: Shapiro-Wil test