Please enable JavaScript.
Coggle requires JavaScript to display documents.
Lecture, 11 Lesson 7 Part 2 Cement Data (second half) (Stepwise regression…
-
-
-
-
-
- Lesson 5.1 - Statistical Graphics
Errors in Graphic Construction:
- Misrepresentation of data;
- Redundant dimensions;
- Excessive Decoration;
- Multiple axes on the same plot;
- Gratuitous meddling with convention
Misrepresentation of Data
- Scale: The scale of the graphics have to be right; to convey the message;
- Showing part of the y-axis can make bar chart figure many times bigger;
- Hidden graphics behind legend;
- Graphics works better when they are close, not far apart
Redundant Dimensions:
- Putting 3 dimensional graphs on 2-d plot is not necessary, as it cause confusions.
- 3-d plot can lose the surface area represented by 2-d plots.
Excessive Decoration:
- Too many information on plots can be excessive;
- It is hard to represent proportions by pictures, such as car parts, etc.
Multiple axes on the same plot:
- One problem of multiple plots with different axes is that, when plots are crossing then it is doesn't mean much.
Gratuitous meddling with convention:
- Different to conventional rules can confuse people, e.g. the time go from right to left.
- 5.2 - Story of Sir Ronald Fisher
-
Regression:
- In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.
Causation:
- Causal links are mostly estabilished by scientists, not statisticians.
- Statistics does not cover causation.
- Causation is not correlation
- Bad policy decisions are made while people think correlation is causation.
Statistics is talking about tendencies, not always
p-values:
- Tells us how likely it is to get a result like this if the Null Hypothesis is true.
- If p-value (area under the curve) is less than a (Significance Level), then the Null Hypothesis is rejected.
Association:
- In Statistics, an association is any relationship between two measured quantities that renders them statistically dependent.
-
Patterns:
- Skew to left
- Skew to right
- Normal
- Uniform
- Multi-modal
Types of Plots
One-dimensional Scatter Plot:
- The one-dimensional scatterplot is a compact representation of the data.
- Allows us to see the maximum and minimum values
- Get a rough impression of the centre, spread, and density of the data
- Determine whether the distribution is symmetric or not and detect outliers.
However, the resolution of the points is a problem when there are many observations or when, as in this case, there are equal observations. This problem can be dealt with by introducing a second dimension by “jittering” or stacking the data.
Histogram:
- Histograms are single bin width graphics;
- hist(inter,nclass=1);
- nclass changes the number and width of bins;
Density Plot:
- DENSITY plots is more like what people draw with a finger;
- the density function is a weighted average of value in the rectangular kernel, does not look good;
- Density plots can be thought of as plots of smoothed histograms;
- plot(density(inter,window="r", width=10, type="l"))
- lty = different line types;
- window = g, g=gaussian, the gaussian kernel is smoother than the box kernel;
- If the density plot looks bumpy, increase the width to make it smoother;
- Type = “l” for density plots, means drawing a line.
- A rule of thumb of choosing width = (Max-Min)/Number of scale width
-
Q-Q Plot:
- Q-Q plot is data vs theory from small to the large quantiles, it is more efficient than comparing just the bell curve.
- A straight line qq-plot looks like a normal distribution
- The inter data is skew to the left, qq plot curve down
- The halo data is skew to the right, qq plot curve up
Boxplot:
- Boxplot are good for non-symetric data
-
Graphics
Plot aspect ratio
- Change plotting ratio:
par()$pin
par(pin=c(1,1))
-
Multiple plots on one graph:
- par(mfrow=c(1,3));
- 1 row, 3 collumns;
Plot with same x-axis
- Put the plots with the same x axis scales:
- hist(disk, breaks=seq(-300, 250, 50))
- hist(inter, breaks=seq(-300, 250, 50))
- A good way to assess the data is to assume someone did the graphics
- and you are reading it if it communicated and what is needed
- Inset, specifying the corner or center inset types
- pch, plot charactor;
- cex, number indicating the amount by which plotting text and symbols should be scaled relative to the default. 1=default, 1.5 is 50% larger, 0.5 is 50% smaller;
- pt.cex, expansion factor to the point
Types of Distribution
t-distribution:
- Heavier tails than normal;
- Heavy tails means more contamination;
- So it's Q-Q plot looks like waste goes into toilet:
Chi-square Distribution:
- Density distribution skew to the right;
- Q-Q plot curve up if density distribution skew to thr right, or vice versa;
Uniform Distribution:
- Random run (generate uniform random numbers), light tail with strong s-shape;
- Light tails goes to zero faster than heavy tails.
- Less contamination like a Superman (Strong "S" Shape) for Q-Q plot, less contamination in the light tailed
Rounded Data:
- Q-Q plot on the rounded data, e.g. 10.1 and 10.3 are rounded to 10.
- This will produce stair-case type of Q-Q plot as numbers are rounded.
Bi-modal (multimodal) distribution:
- Bi-modal (multimodal) distribution produce histogram with two peaks;
- Its Q-Q plot is a light "S" shape;
- It looks like a steped shape;
-
-
Data is Everywhere
- Data visulisation can help us uncover patterns hidden in the noise of data
- Businesses monetise data by uncovering new relationships
- Scientists uncover causal links by discovering new relationships
Course Goals
- Be an effective creator of presentation graphics so that we communicate facts efficiently and without bias
- Produce and consume analysis graphics so that we use them to make wise modelling choices that won't lead to flawed conclusions
Course has 2 Parts
- Part 1 Analysis graphics: statistical methods that use graphics to analyse data
- Part 2 Presentation Graphics: What makes a visualisation good or bad
- Numerical data should not replace graphical analysis
- Data visualisation communicates patterns more effectively than numbers alone
- Anscombe dataset is used to see why data visulisation is important.
-
Final Project
Cement Video
- We have covered scatter plot matrices, which is the first step to do.
- Then, can dimensional reduction techniques such as simple plots or coplots (don't use R coplots)
- PCA can be used if you want to do analytical discovery of dimension of the covariance face. This can provide some hints on which x are related to other x. This info will feed into further modelling
- Then can move towards to formal modelling like normal ANOVA type tools, F-test, t-test or subset regression if the dataset is small enough. Use stepwise procedure if datasets are large
- Check added variable plot or partial residual to inform if the individual variables are useful with the precense of other variables, and if the forms they introduce to the model is linear.
- Then the decisions can be provided by a sets of regular residual plots to whether your model assumptions are reasonably met.
- Write down the model at the end!
Assignment 1&2 Video
- Group of otter
- Season: breeding or non-breeding
- Time: period of observation
- Groomer: otter that groome
- Groomee: otter that gets groomed
- Freqency: number of the grooms per observation period
- Plan your analysis first
- Time can be relavent to number of grooms (00:27:00)
- Tools like bar plots can be useful for frequency
- There are 4 questions you will tell
- How to visualise it
-
- install. packages("aplpack")
- install.packages("rlg")
- install.packages("Rcmdr")