Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Science (Hypothesis Testing (steps in DATA-DRIVEN decision making…

- - - - is the statement we are trying to reject
    - - is the present state of affairs while the alternative is our personal opinion
  - - - denoted as "α"
      - df: The probability of rejecting the null hypothesis, if it is true
      - Typical values
        
        0.01
        
        ※0.05
        
        0.1
    - - Type I error
        
        Reject a true null hypothesis
        
        probability : α
        
        which means that the responsibility of making this error lies solely on you, because the α is chosen by you
      - Type II error
        
        Accept a false null hypothesis
        
        probability : β
        
        decided by sample size and variance
      - Rejecting a false null hypothesis
        
        probability : 1 - β
        
        aka. The power of the test
      - example
- - - - a good approximation of the true probabilities
      - P(A) = successful trials / all trials
- - - - Distribution
        
        The possible values a variable can take and how frequently they occur
    - - Y
        
        The actual outcome of an event
        
        P(Y=y) = p(y) (probability function)
      - y
        
        One of the possible outcomes
      - x with a bar
        
        Sample mean
      - s^2
        
        Sample variance
      - σ
        
        Standard deviation (square root of variance)
      - σ^2
        
        Variance
        
        how spread the data is
      - μ
        
        Mean
        
        average value
  - - - "all the data"
    - - "just part of it"
  - - - Uniform distribution
        
        All outcomes are equally likely (such outcome called Equiprobable)
        
        ex: roll the dice
        
        X~U(3,7)
        
        to read the statement: variable X follows a discrete uniform distribution ranging from 3 to 7
        
        drawbacks(because all outcomes have same prob.)
        
        The expected value provides us no relevant info
        
        Mean and Variance are uninterpretable and no real intuition behind what they mean
        
        so --> no predictive power
      - Bernoulli Distribution
        
        Events with only two possible outcomes(1 trial 2 ps.oc.)
        
        ex: True of False
        
        X~Bern(p)
        
        read the statement: variable X follows Bernoulli Distribution with a probability of success equal to p
        
        Values
        
        σ^2 = p(1-p)
        
        E(X) = p
      - Binomial Distribution
        
        Carrying out a similar experiment several times in a row
        
        Two outcomes per iteration, with many iterations
        
        ex: flipping a coin
        
        X~B(10,0.6)
        
        read the statement: variable X follows a binomial distribution with 10 trials and a likelihood of 0.6 to succeed
        
        Bern(p) = B(1,p)
        
        Bern vs B
        
        E(Bernoulli event) = which outcomes we expect for a single event
        
        E(Binomial event) = The number of times we expect to get a specific outcome
        
        ex: for a True and False quiz, Guessing 1 question = Bernoulli event, Guessing the entire quiz = Binomial event
        
        Values
        
        p(y) = Cny p^y (1-p)^(n-y)
        
        E(X) = x0p(x0) + x1p(x1) + ... +xn*p(xn)
        
        Y~B(n,p) ; E(Y)=n*p
        
        σ^2 = E(y^2) - E(y)^2 = n x p x (1-p)
      - Poisson Distribution
        
        Test out how unusual an event frequency is for a given interval
        
        ex: Calculate the chance of Lebron James getting 12 points in the first quarter of his next game
        
        Po(λ) Y~Po(4)
        
        read the statement: variable Y follows a Poisson Distribution with lambda equal to 4
        
        Values
        
        P(Y) = λ^y * e^(-λ) / y!
        
        e≒2.72
        
        E(Y) = λ
        
        Var(Y) = λ
      - Characteristics
        
        Have a finite number of outcomes
        
        Can add up individual values to determine probability of an interval.
    - - Characteristics
        
        ※The probability distribution would be a curve
        
        Have infinitely many consecutive possible values
        
        Can not add up the individual values that make up an interval because there are infinitely many of them
        
        The probability for any individual value equal to 0,
        so p(x>X) = P(x>=X), because P(x=X) = 0
        
        Has Cumulative Distribution Function(CDF)
        
        Graph
        
        F(-無限) = 0 , F(無限) = 1
        
        PDF -- Integral→CDF, so the area of the CDF is PDF
        
        Probability of Intervals
        
        The area of under the density curve is the probability of that interval
        
        ex: p(b>x>a)
        
        積分a~b p(x)dx
        
        can use WolframAlpha to calculate
      - Normal Distribution
        
        X~N(μ,σ^2)
        
        read the statement: The variable X follows a normal distribution with mean μ and variance σ squared
        
        characteristics
        
        it's graph is bell-shaped curve, symmetric and has tails
        
        mean = median = mode
        
        has no skew
        
        Values
        
        E(X) = μ
        
        Var(X) = σ^2(not usual given) = E(X^2) - E(X)^2
        
        Standardization
        
        E(X) = 0, Var(X) = 1
        
        follows the 68,95,99.7 Rule
        
        Z~N(0,1)
        
        Z = x-μ / σ
        
        this will drive the standard deviation of the new data set to 1
        
        Y ~ N(μ, σ^2)
        
        drawbacks
        
        Useful when we have a Normal Distribution
        
        Not always the case
        
        It requires a lot of data
        
        Risk of outliers drastically affecting our analysis
      - Student's-T Distribution
        
        Characteristics
        
        accommodates extreme values significantly better than N.D.
        
        Because any extreme values represents a much bigger part of the population
        
        A small sample approximation of a Normal Distribution
        
        Y~t(k)
        
        read the statement: variable Y follows A student's t distribution with three degrees of freedom
        
        features and characteristics
        
        Values
        
        k : degrees of freedom
        
        If k > 2
        
        E(Y) = μ
        
        Var(Y) = s^2 * k / k - 2
        
        Application
        
        Hypothesis testing with limited data
        
        CDF table(T-table)
        
        Independent Sample T test
        
        timing
        
        two sample t-test: comparing if there is a difference between the means of the two samples
        
        ex: 分析使用A牌飼料與B牌飼料餵養的乳牛，其每季平均生產的鮮乳量是否有差異
        
        Assumptions
        
        1.Dependent variables should be continuous variables and random variables
        
        2.The population of the dependable variables should follow normal distribution
        
        3.The samples should be independent variables
        
        4.Variance: The variance of two samples should follow normal distribution and be equal(變異數同質性).
        
        Paired sample T test
        
        timing
        
        比較 1.兩組成對樣本或 2.單一樣本重複量測的平均值是否有差異
        
        1.兩組成對樣本 ex:分析一群夫妻之中，夫與妻分別的年收入多寡是否有差異
        
        2.重複量測 ex:分析參加減肥試驗的一群人，參加試驗前與規律運動3個月後的體重是否有差異
        
        Assumptions
        
        1.Dependent variables should be continuous variables and random variables
        
        2.The population of the dependent variables should follow normal distribution
        
        3.The samples should be dependent variables, which means that the variables of each samples will affect each other
        
        4.Variance: The variance should follow normal distribution and be a constant
      - Chi-Squared Distribution
        
        Characteristics
        
        Only consists of non-negative values
        
        Asymmetric
        
        Chi-squared Distribution is the square of the t-Distribution
        
        Y ~ χ^2(k)
        
        read the statement: variable Y follows a chi square distribution with k degrees of freedom
        
        Application
        
        statistical analysis
        
        Hypothesis testing
        
        Computing confidence intervals
        
        Few events in life
        
        Goodness of fit of categorical values
        
        用於"兩類別變數的關聯性"
        
        Values
        
        E(X) = k
        
        Var(X) = 2k
        
        Assumptions
        
        1.All the variables are categorical variables
        
        2.The samples should be independent variables, which mean that sample A would not have an impact on sample B
      - Exponential(指數) Distribution
        
        Characteristics
        
        for Events that are rapidly changing early on
        
        variables follow this E.D. with a probability that initially decreases before eventually plateauing
        
        表示獨立隨機事件發生的時間間隔
        
        ex: The number of views for a Youtube blog
        
        X ~ Exp(λ)
        
        read the statement: variable X follows an exponential distribution with a scale of λ
        
        Rate parameter : λ
        
        How fast the CDF/PDF curve reaches the point of plateauing
        
        How spread out the graph is
        
        每單位時間發生該事件的次數
        
        graph of PDF/CDF
        
        superlink
        
        PDF plateaus around the 0, CDF has to reach the 1
        
        Values
        
        E(Y) = 1 / λ
        
        Var(Y) = 1 / λ^2
        
        drawbacks
        
        No table of known variables(ex: Z-table ...)
        
        Transformation
        
        take the natural logarithm of every set of an exponential distribution and get a normal distribution
        
        Y~Exp(λ) & X = ln(Y)
        
        X~N(μ,σ^2)
      - Logistic Distribution
        
        Useful in forecast analysis
        
        Useful for determining a cut-off point for a successful outcome
        
        ex: Often used in sports to anticipate how a player's or team's performance can determine the outcome of the match
        
        Y~Logistic(μ,S)
        
        read the statement: variable Y follows a logistic distribution with location μ and a scale of S
        
        Values
        
        E(Y) = μ
        
        Var(Y) = s^2 * π^2 / 3
        
        graph
- - - - n > 30, the Sampling Distribution will approximate the Normal Distribution
      - why so important ?
        
        Decisions based on Normal Distribution insights have a good track record
        
        allows us to perform test, solve prblms and make inferences using the Normal Distribution, even the population is not normally distributed.
    - - No matter the underlying distribution, the sampling distribution approximates a Normal Distribution
    - - N(μ, σ^2 / n)
        
        Which matches Central Limit Theorem, the bigger the sample size you draw, the more accurate the results you will get
      - Standard error
        
        df: is the standard deviation of the sampling distribution
        
        = σ / n^1/2 (variance ^ 1/2)
        
        shows how well you approximated the true mean
        
        same. The bigger the sample size, the better approximation
  - - - is in the middle of the confidence interval
      - ex : x hat of μ & s^2 of σ^2
      - the example of"Estimator, Parameter and Estimate"
    - - provides much more information and are preferred when making inferences
    - - Bias
      - Efficiency
        
        less variability
  - - - Confidence Level
        
        1 - α
        
        0 ≦ α ≦ 1
        
        ex: 95% C.I. , then α will be 5%
      - Confidence Interval
        
        95% C.I.
        
        means that you are sure that 95% of the cases, the true population parameter would fall into the specified interval.
      - margin error
        
        (z or t score) * proper standard deviation
    - - σ known
        
        Normal Distribution
      - σ unknown
        
        Student's T Distribution
        
        C.I. : xhat ± tn-1(degree of freedom),α/2 * s / n^1/2
        
        has wider width than σ known condition, it makes sense, because there is much uncertainty in σ unknown condition.
        
        assume that population is normally distributed, and we only have small sample size.
    - - Dependent Samples
        
        before and after situation
        
        cause and effect
      - Independent Samples
        
        known population variances
        
        unknown population variances but assumed to be equal
        
        ex: the difference of the price of apples in New York and LA
        
        what u should know
        
        unknown population variances but assumed to be different
- - - - Definition：The number of different possible ways we can arrange a set of elements
      - ex: (A,B,C) , (A,C,B) , (B,A,C) , (B,C,A) , (C,A,B) , (C,B,A)
      - ※The order of each elements is crucial
    - - Variations with repetition
        
        V n(up)p(down) = n^p
        
        n = the total number of elements, we have available
        
        p = the number of positions we need to fill
      - df: The total number of ways we can pick and arrange some elements of a given set
      - Variations without repetition
      - ※The order of each elements is crucial
    - - The number of combinations equals the number of variations, over the number of permutations
      - C n(up) p(down) = n![variations] / p!(n-p)![permutations]
    - - n! = 1x2x3x...xn
- - - - is a subset of the population, denoted as "n"
      - characteristics
        
        Randomness
        
        a random sample is collected when each member of the sample is chosen from the population strictly by chance
        
        Representativeness
        
        a representative sample is a subset of the population that accurately reflects the members of the entire population
    - - hard to find and observe in real life
      - is a collection of all items of interest, denoted as "N"
    - - types of data
        
        categorical
        
        df: describes categories or groups
        
        ex: Car brands, Yes and No question
        
        Data Visualization(open S15_L17 excel)
        
        Frequency distribution tables
        
        Bar charts
        
        Pie charts
        
        Pareto diagrams
        
        It shows how subtotals change with each additional category
        
        numerical
        
        discrete
        
        can be counted in a finite number
        
        ex: number of children you want to have(it is impossible to have like 1.2 children), grades on SAT
        
        continuous
        
        is infinite and impossible to have an absolute count
        
        ex: weight, because it can take infinite amount of values no matter how many digits there are after the dot / area, distance and time,ex: the age of the building can be 35 years old, and it also can be 35.9234 years old all of these can vary by infinitely smaller amounts
        
        Data Visualization
        
        Frequency Distribution
        
        Histogram(直方圖)
        
        the interval can be different
      - Measurement levels
        
        Qualitative
        
        nominal
        
        ex: car brands like Benz, BMW, Yaris...
        
        are only used to classify the data.
        
        ordinal
        
        ex: from disgusting to delicious
        
        consists of groups in categories which follow a strict order
        
        Quantitative
        
        interval
        
        doesn't have a true 0
        
        ratio
        
        has a true 0
  - - - mean
        
        drawbacks
        
        is easily affected by outliers
      - mode(眾數)
      - median(中位數)
    - - skewness
        
        df: indicates whether the data is concentrated on one side
        
        tell us where the data is situated
        
        positive and negative
        
        positive
        
        vice versa
        
        negative
        
        the left tail is longer than right tail, which means that the most of the values are situated at the right side of the mean.
    - - Variance
        
        for Population
        
        sum(Xi-μ)^2 / N
        
        for Sample
        
        sum(Xi-xhat)^2 / n-1
        
        measures the dispersion of a set of data points around their mean
      - Coefficient of variation(CV)
        
        for Population
        
        σ / mean
        
        for Sample
        
        s / xhat
        
        Comparing two or more datasets
      - Standard Deviation
        
        is the most common measure of variability for a single dataset
      - Covariance
        
        the two variables are correlated and the main statistic to measure this correlation is called Covariance
        
        formula
      - Correlation coefficient(相關係數)
        
        Cov(x,y) / Stdev(x) * Stdev(y)
        
        幫助Covariance能夠更直覺地判斷，因為Covariance的數字本身並沒有太大的意義
- - - - each event has set of outcomes(favorable outcomes)
      - ex: Event: even , set = 2,4,6,8...
      - Independence and Dependence
        
        Independent events
        
        The theoretical probability remains unaffected by other events
        
        Dependent events
        
        Probabilities of dependent events vary as conditions change
        
        Conditional Probability
        
        ex: Two events A&B
        
        P(A|B) -- "P of A given B" = P(A∩B) / P(B) only if P(B)>0
        
        The probability of getting A, if we are given that B has occurred
        
        df: The likelihood of an event occurring assuming a different one has already happened
        
        Baye's Rule
        
        P(A|B) = P(B|A) * P(A) / P(B)
        
        it allows us to find a relationship between the different conditional probabilities of two events
        
        it is super useful in real-life, such as medical research, business hire
        
        Multiplication Rule
        
        P(A∩B) = P(B|A) x P(A)
    - - features
        
        Sets : Upper-case / Elements : Lower-case ex: Set X means even number, and element 8, is one of the even number
        
        ex: x(element) , A(set) | x ∈ A
        
        can be expressed as "x is an element of A"
        
        can be empty or have values
        
        aka. null set, empty set
        
        non-empty sets can be finite or infinite
        
        don't have to be numerical
      - Subset
        
        df: a set is fully contained in another set
        
        ※every set has at least 2 subsets
        
        𝐴⊆A
        
        ∅⊆A
      - Interaction ways
        
        partially overlap(intersect)
        
        Union(聯集)
        
        A combination of all outcomes preferred for either A or B
        
        be careful of "DOUBLE COUNTING"
        
        completely overlap
        
        not touch
        
        Mutually Exclusive Sets
      - Complement
        
        pic: https://revisionworld.com/sites/revisionworld.com/files/imce/complement.gif
        
        Complement Set
        
        All values that are part of the sample space, but not part of the set
        
        features
        
        Complements are always mutually exclusive
        
        NOT all mutually exclusive sets are complements
    - - 𝑥∈𝐴
        
        “Element x is a part of set A.”
        
        ex: 2 ∈ all even numbers
      - A∋𝑥
        
        “Set A contains element x.”
        
        ex: all even numbers ∋ 2
      - 𝑥∉𝐴
        
        "Element x is NOT a part of set A."
        
        ex: 1 ∉ all even numbers
      - ∀𝑥:
        
        "For all /any x such that..."
        
        ∀𝑥: 𝑥∈𝐴𝑙𝑙 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
      - 𝐴⊆𝐵
        
        "A is a subset of B"
        
        ex: even numbers ⊆ integers
- - - - 比較多組(兩組以上)樣本平均數是否相等。
      - 2.One way: means that there's only one "y" in this ANOVA
    - - 1.The "x" should be categorical variable, and the "y" should be continuous variable.
      - 2.The population should follow normal distribution
      - The samples should be independent variables.
      - 變異數同質性: The variances of each samples should be equal.
    - - 重複量測變異數分析: When samples are not independent variables.
      - 多因子變異數分析: When there are multiples "x(自變項)" in the sample