Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Analysis and Programming (Hypothesis Testing (Hypothesis Errors…
Data Analysis and Programming
Data Analysis and Theory Methods :
Type of Variables
Nominal
A nominal scale describes a variable with a limited number of different values that cannot be ordered
E.g. A variable industry would be nominal if it had categorical values such as 'financial', 'engineering', 'retail'
Ordinal
An ordinal scale describes a variable whose values can be ordered or ranked
E.g. low < medium < high
Interval
An interval scale describes values where the interval between the values can be compared
interval scales dont have a "true zero", i.e. No such thing as no temperature but is a zero temperature
With interval data, we can add and subtract, but cannot multiply or divide
Ratio
A ratio scale describes variables where both intervals between values and ratio of values can be compared.
An example of a ratio scale is a bank account balance whose possible values are $5, $10 and $15. The difference between each pair is $5, and $10 is twice as much as $5.
Dichotomous
A variable is referred to a dichotomous if it can contain only two values
E.g. 'Male', 'Female' or 0, 1
Central Tendency
Mode
The mode is the most commonly reported value for a particular variable
Median
The median is the middle value of a variable, once it has been sorted from low to high
Mean
The average. It is defined as the sum of all the values divided by the number of values
Quartiles
Quartiles divide a continuous variable into four even segments based on the number of observations
Q1: 25%, Q3: 75%
Other
Variance
The variance describes the spread of the data and measures how much the values of a variable differ from the mean
Standard Deviation
The standard deviation is the square root of the variance
one : 68%, two: 95%
z-score
It is possible to calculate a normalized value, called a z-score, for each data element that represents the number of standard deviations that element value is from the mean
z-scores: 0 =mean, > mean, < mean
z-score reflects the number of standard deviations that value is from the mean
Skewness
Can be both positive and negative. A skewness value of zero indicates a symmetric distribution
If the lower-tail is longer than the upper-tail the value is positive
Kurtosis
The type of peak the distribution has should be considered. It can be characterized by a measurement called Kurtosis
Confidence Intervals
Confidence intervals are a measure of our uncertainty about the statistics we calculate from a single sample of observations
Alpha-values
confidence interval: 100 x (1- alpha)
Visualizations
Barcharts
Histograms
Shapes
constant
Bell-shaped
U-shaped
constantly increasing
constantly decreasing
exponentially increasing
bi-modal
Boxplot
Box-plots provide a succinct summary of the overall frequency distribution of a variable
Values needed
The lowest value, minimum
The lowest Quartile, Q1
The median, Q2
The upper Quartile, Q3
The highest value, maximum
The mean
Classical Statistical Tests
Invalidation of results
Non-normality
It is much better incases with non-normality and/or outliers to use a non-parametric techniques such as Wilcoxon's signed-rank test. If there is a serial correlation in the data, then you need to use time series analysis or mixed-effects models
Outliers
Serial Correlation
Testing for normality
quantile-quantile plot (qqplot)
If our sample is normally distributed then the line is straight
Departures from normality show up as various sorts of non-linearity (S-shaped, banana shape)
Functions: qqnorm, qqline
par(mfrow = c(1,1,)) qqnorm(y) qqline(y, lty=2)
Shapiro.test
Null hypothesis is that the sample data are normally distributed
y <- runif(100) shapiro.test(y)
Shapiro-Wilk normality test data:y W=0.94205 p-value = 0.0002579
Here, the null hypothesis is that the data are normally distributed. running the Shapiro we find a p-value = 0.0002579. Thus we interpret this as being "assuming the data is normally distributed , the chances of drawing this vector from a normally distributed population is 0.0002579" Therefore we reject null hypothesis
Wilcoxon's signed-rank test
wilcox.test(speed, mu=990)
Wilcoxon signed rank test with continuity correction data: speed V = 22.5 p-value = 0.00213 alternative hypothesis: true location is not equal to 990
We reject the null hypothesis and accept the alternative hypothesis because p = 0.00213( p < 0.05) The speed of light is significantly less than 299 990
Bootstrap in hypothesis testing
You have a single sample of n measurements, but you can sample from this in very many ways,so long as you allow
some values to appear more than one and other samples to be left out (i.e, sampling with replacement)
Outliers
A good rule of thumb is that an outlier is a value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile
Two Samples
Fisher's F test, var.test
comparing two variances
Student's t-test, t.test
comparing two samples with normal errors
Wilcoxon's rank test, wilcox.test
comparing two means with non-normal errors
the binomial test, prop.test
comparing two proportions
Pearson's or Spearman's rank correlation, cor.test
correlating two variables
chi-squared, chisq.test or Fishers exact test. fisher.test
testing for independence of two variables in a contingency table
Hypothesis Testing
Hypothesis test are used to support making decisions by helping to understand whether data collected from a sample of all possible observations supports a particular hypothesis
Null Hypothesis (H0) is stated in terms of what would be expected if nothing unusual about the measured values of the observations in the data from the samples we collect
Alternative Hypothesis(HA) is what would be expeced if something unusual about the measured values of the observations in the data from the samples we collect. Generally the opposite of H0
Hypothesis Errors
Reject the null hypothesis when, infact the null hypothesis should stand (type1 error)
Accept the null hypothesis when it should be rejected (type2 error)
Threshold/Significance
level of significance/Threshold helps accept/reject hypothesis
95% - 0.05
99% 0.01
Test Statistic (T)
T = xbar - population mean / (sd / sqrt(n))
C1/2 regions
We reject the null hypothesis if the value T is outside of C1 and C2 boundaries
C1 boundaries: 0.025 above C1
C2 boundaries: 0.025 below C2
p-value
The probability of obtaining a test statistic value at least as extreme as the observed value (assuming the null hypothesis is true)
Big Data in R
Pseudo code for big data sources with R
for each chunk of data:
read it into R memory
perform an analysis of the chunk
store the partial results
discard the chunk
Chunk-wise processing of data
An analysis is suited to this approach if :
There are no dependencies between chunks
Chunks are encoded in the data set
Each chunk is of manageable size
Text classification example
R> con <- dbConnect("MySQL", user="root", password="",
host="localhost", dbname="tc")
R> tcData <- dbGetQuery(con,
"select * from text_classification")
R> res <- dbSendQuery(con,
"select * from text_classification")
R> fetch(res, n=10); fetch(res, n=10)
R> res <- dbSendQuery(con,
"select * from text_classification order by classifier")
R> dbApply(res, INDEX="classifier",
FUN=function(x,grp) {computeAccuracy(x)})
Understanding Relationships
Scatter-plots can be used to identify whether a relationship exist between two continuous variables measured on the ratio or interval scales
Relationships States:
Positive
Negative
No
Correlation Coefficient (r)
the linear relationship -1.0 to 1.0
Time Series Data
Data collected over a time period
A time series is a series of data points indexed in time order (i,e. CPU utilization)
Visualizations
Data cannot be randomly reordered
Joining points using lines males trends clearer but may not actually be correct
Code
visitData <- read.csv('daily-numbers.csv', header=T)
visitTS <- ts(visitData$Average.Num.Visitors, frequency = 7)
plot(decompose(visitTS))
Linear Regressions
The essence of regression analysis is using sample data to estimate parameter values and their standard errors
First, however, we need to select a model which describes the relationship between the response variable and the explanatory variable(s)
Simplest is linear model
y = a+bx
y = response variable
x = single continuous explanatory variable
a = the intercept of the value y when x = 0
b = the slope of the line/ regression coefficient
Manual calculations
b = units on y / units on x
then plug in a and b values and that's a 'best' estimates
Maximum Likelihood Estimates (MLE)
Given the data and having selected a linear model, we want to find the values of the slope and intercept that make the data most likely
Under the assumptions, MLE is given by the method of least squares
The residuals are the vertical differences between the data and the fitted model
Least Squares
Distance d
d = y - y^
Each of the residual is a distance d, between a data point, y, ad the value predicted, y^, evaluated at the appropriate value of the explanatory variable ,x
Fitting Least Squares
d = y - a- bx
code
3 more items...
Plots
4 more items...
Assumptions
The variance in y is constant
The explanatory variable x , is measured without error
The difference between a measured value of y and the value predicted by the model for the same value of x is called residual
Residuals are measured on the scale of y and normally distributed