Please enable JavaScript.
Coggle requires JavaScript to display documents.
TAE Week2 - Coggle Diagram
TAE Week2
-
-
PCA steps
prcompt function, remember to scale = T
then plot a scree plot to see how much variance the first PC explains, you can also add a line if you want
plot a cumulative variance plot to see what is the cumulative level of variance the PCs explain, you can also use the sum() feature to see how much variance PCs 1,2 explain or PCs 1,2,3 explain
do your ranking if you need to,
- to see what is the weight for each different country/species for a particular PC
- and to see what each attribute's weight is on for each PC
library(facto extra) and then do your factor plots for individuals and loading
- the first plot tries to cluster different data points on the two dimensional graph of PC1 and 2 based on their similarity
- the second plot tries to see how strongly each attribute or characteristic influences a particular component
- you can use biplots to visually inspect what variables best characterise each cluster - which you can segregate out by using habillage
apparently if you have habillage, you cannot have col.ind = "contrib"
Homework Observations
- learn how to plot predicted vs actual values
- learn to use the predict function and the data.frame()
-
revise rsq formula and calculation
- by right, if we have sufficient values in the test set, we dont need to take mean() of training set to calculate sst
learn how to subset
- iris_data<-iris[,-5] to remove one column
- iris_sp<-iris[,5] to only take the last column
PCA
-
sum(pca$sdev^2) is the sum of all the variances, and they should equal the number of variables you have. in addition, as a rule of thumb, you should pick the variance which is more than 1. This means that they explain at least one variable
lm(meds~poly(stat, 2, raw = True) leave raw = True
R Code
-
-
can also subset the original dataset - to remove the NA rows use subset(df, !is.na(wine$price))
use pairs.panels to see correlation / cor(x,y)
- if independent variables exhibit high correlation with each other, its presence of multi collinearity, and we should drop that
- high correlation between independent and dependent variable is good
-