Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Analysis and Programming (Hypothesis Testing (Hypothesis Errors…

- - - - A nominal scale describes a variable with a limited number of different values that cannot be ordered
      - E.g. A variable industry would be nominal if it had categorical values such as 'financial', 'engineering', 'retail'
    - - An ordinal scale describes a variable whose values can be ordered or ranked
      - E.g. low < medium < high
    - - An interval scale describes values where the interval between the values can be compared
      - interval scales dont have a "true zero", i.e. No such thing as no temperature but is a zero temperature
      - With interval data, we can add and subtract, but cannot multiply or divide
    - - A ratio scale describes variables where both intervals between values and ratio of values can be compared.
      - An example of a ratio scale is a bank account balance whose possible values are $5, $10 and $15. The difference between each pair is $5, and $10 is twice as much as $5.
    - - A variable is referred to a dichotomous if it can contain only two values
      - E.g. 'Male', 'Female' or 0, 1
  - - - The mode is the most commonly reported value for a particular variable
    - - The median is the middle value of a variable, once it has been sorted from low to high
    - - The average. It is defined as the sum of all the values divided by the number of values
    - - Quartiles divide a continuous variable into four even segments based on the number of observations
      - Q1: 25%, Q3: 75%
    - - Variance
        
        The variance describes the spread of the data and measures how much the values of a variable differ from the mean
      - Standard Deviation
        
        The standard deviation is the square root of the variance
        
        one : 68%, two: 95%
      - z-score
        
        It is possible to calculate a normalized value, called a z-score, for each data element that represents the number of standard deviations that element value is from the mean
        
        z-scores: 0 =mean, > mean, < mean
        
        z-score reflects the number of standard deviations that value is from the mean
      - Skewness
        
        Can be both positive and negative. A skewness value of zero indicates a symmetric distribution
        
        If the lower-tail is longer than the upper-tail the value is positive
      - Kurtosis
        
        The type of peak the distribution has should be considered. It can be characterized by a measurement called Kurtosis
      - Confidence Intervals
        
        Confidence intervals are a measure of our uncertainty about the statistics we calculate from a single sample of observations
        
        Alpha-values
        
        confidence interval: 100 x (1- alpha)
  - - - Shapes
        
        constant
        
        Bell-shaped
        
        U-shaped
        
        constantly increasing
        
        constantly decreasing
        
        exponentially increasing
        
        bi-modal
    - - Box-plots provide a succinct summary of the overall frequency distribution of a variable
      - Values needed
        
        The lowest value, minimum
        
        The lowest Quartile, Q1
        
        The median, Q2
        
        The upper Quartile, Q3
        
        The highest value, maximum
        
        The mean
- - - - It is much better incases with non-normality and/or outliers to use a non-parametric techniques such as Wilcoxon's signed-rank test. If there is a serial correlation in the data, then you need to use time series analysis or mixed-effects models
  - - - If our sample is normally distributed then the line is straight
      - Departures from normality show up as various sorts of non-linearity (S-shaped, banana shape)
      - Functions: qqnorm, qqline
      - par(mfrow = c(1,1,)) qqnorm(y) qqline(y, lty=2)
    - - Null hypothesis is that the sample data are normally distributed
      - y <- runif(100) shapiro.test(y)
        
        Shapiro-Wilk normality test data:y W=0.94205 p-value = 0.0002579
        
        Here, the null hypothesis is that the data are normally distributed. running the Shapiro we find a p-value = 0.0002579. Thus we interpret this as being "assuming the data is normally distributed , the chances of drawing this vector from a normally distributed population is 0.0002579" Therefore we reject null hypothesis
    - - wilcox.test(speed, mu=990)
      - Wilcoxon signed rank test with continuity correction data: speed V = 22.5 p-value = 0.00213 alternative hypothesis: true location is not equal to 990
      - We reject the null hypothesis and accept the alternative hypothesis because p = 0.00213( p < 0.05) The speed of light is significantly less than 299 990
    - - You have a single sample of n measurements, but you can sample from this in very many ways,so long as you allow
      - some values to appear more than one and other samples to be left out (i.e, sampling with replacement)
  - - - comparing two variances
    - - comparing two samples with normal errors
    - - comparing two means with non-normal errors
    - - comparing two proportions
    - - correlating two variables
    - - testing for independence of two variables in a contingency table
- - - - 95% - 0.05
      - 99% 0.01
  - - - We reject the null hypothesis if the value T is outside of C1 and C2 boundaries
      - C1 boundaries: 0.025 above C1
      - C2 boundaries: 0.025 below C2
- - - - There are no dependencies between chunks
      - Chunks are encoded in the data set
      - Each chunk is of manageable size
    - - R> con <- dbConnect("MySQL", user="root", password="",
        host="localhost", dbname="tc")
      - R> tcData <- dbGetQuery(con,
        "select * from text_classification")
      - R> res <- dbSendQuery(con,
        "select * from text_classification")
      - R> fetch(res, n=10); fetch(res, n=10)
      - R> res <- dbSendQuery(con,
        "select * from text_classification order by classifier")
      - R> dbApply(res, INDEX="classifier",
        FUN=function(x,grp) {computeAccuracy(x)})
- - - - Data cannot be randomly reordered
      - Joining points using lines males trends clearer but may not actually be correct
    - - visitData <- read.csv('daily-numbers.csv', header=T)
      - visitTS <- ts(visitData$Average.Num.Visitors, frequency = 7)
      - plot(decompose(visitTS))
  - - - y = a+bx
        
        y = response variable
        
        x = single continuous explanatory variable
        
        a = the intercept of the value y when x = 0
        
        b = the slope of the line/ regression coefficient
      - Manual calculations
        
        b = units on y / units on x
        
        then plug in a and b values and that's a 'best' estimates
      - Maximum Likelihood Estimates (MLE)
        
        Given the data and having selected a linear model, we want to find the values of the slope and intercept that make the data most likely
        
        Under the assumptions, MLE is given by the method of least squares
        
        The residuals are the vertical differences between the data and the fitted model
        
        Least Squares
        
        Distance d
        
        d = y - y^
        
        Each of the residual is a distance d, between a data point, y, ad the value predicted, y^, evaluated at the appropriate value of the explanatory variable ,x
        
        Fitting Least Squares
        
        d = y - a- bx
        
        code
        
        3 more items...
        
        Plots
        
        4 more items...
      - Assumptions
        
        The variance in y is constant
        
        The explanatory variable x , is measured without error
        
        The difference between a measured value of y and the value predicted by the model for the same value of x is called residual
        
        Residuals are measured on the scale of y and normally distributed