Please enable JavaScript.
Coggle requires JavaScript to display documents.
Course 3: Getting and Cleaning Data (Reading from mySQL (Connecting to an…
Course 3: Getting and Cleaning Data
Components of Tidy Data
Raw data
A tidy data set
a code book describing each variable and its value in the tidy data set; aka the meta data
can include the units for a variable
an explicit and exact recipe you used to go from A - Z
What is Raw Data?
data in the rawest form that I had access to
a strange binary file my measurement machine spits out
unformatted excel file with 10 worksheets the company you contracted for sent you
the complicated JSON file I got from scraping Twitter's API
the hand entered numbers I entered while looking through a microscope
What is Tidy Data?
every variable I measure should be in 1 column
each different observation of that variable should be in a row
there should be 1 table for each "kind" of data
if multiple tables then they should include a column in the table that allows them to be linked
Downloading Files
file.exists("directory name")
will check to see if the directory named exists
dir.create("directoryName")
will create a directory if it doesn't exist
Getting Data from the Internet
download.file()
improves the reproducibility, instead of downloading the data by hand
parameters:
url
destfile
method
useful for download tab-delimited and csv files and others
fileUrl <- "http etc"
download.file(fileUrl, destfile = "./data/cameras.csv", method = "curl")
the method, curl, has to be specified on Mac when the website is a https
list.files("./data")
Keep track of the date you downloaded the data and put it in the R script
dateDownload <- date()
records the current date in the object dataDownloaded
Reading Excel Files
download.file(fileUrl, destfile = "./data/cameras.xlsx", method = "cur")
fileUrl <- "https etc"
use the xlsx package
read.xlsx()
read.xlsx("./data/cameras.xlsx", sheetIndex = 1, Header = TRUE)
read.xlsx2()
can also read in only certain columns and rows
specify as parameters in read.xlsx(), colIndex, and rowIndex
XL Connect package also has more options for reading and writing excel files
Reading JSON
java script object notation
common format for data from API's
similar to XML but different syntax/format
use json lite package
jSonData <- fromJSON("https etc")
returns a structured data frame
Writing data frames in R to JSON
myjson <- toJSON(iris, pretty = TRUE)
iris is a data set in R. this command writes the data frame, Iris, into a JSON file. good for exporting that would be used in an API
Reading XML
extensible markup language
frequently used to store structured data
extracting XML is the basis for most web scraping
Components
Content
the actual text of the document
Markup
labels that give the text structure
Tags, Elements & Attributes
Tags
correspond to general labels
start tags
<section>
end tags
</section>
empty tags
<line-break />
Elements
<Greeting> Hello, world </Greeting>
are specific examples of tags
Attributes
are components of the label
<img src="jeff.jpg" alt="instructor" />
< step number="3"> Connect A to B. </step>
Swirl: Manipulating Data w/ dplyr
same set of tools allows me to work with tabular data from a variety of sources, data frames, data tables, databases, multidimensional arrays
dim()
returns the dimensions of a data frame. the first dimension is the number of rows, and the 2nd dimension is the number of columns
step 1
put the data frame object into a tbl_df using the tbl_df() function on the data frame object
this compresses the original data frame making looking at it on the console more compact
5 Verbs
select()
keeps only the variables i mention. so I can easily select the columns i'm interested by name
select(R object, wanted.column, wanted column.name)
this command selects 2 columns by name, up to however many I want
select(R object, column.1:column.8)
this command will let me select the columns from 1 to 8 of the data frame
works the same way as specifying a sequence of numbers, only this is doing it with columns in my data frame
select(R object, -column.name)
this will discard the column name with the negative sign in front of it
filter()
filter(R object, column.name== "name.of.column")
pulls out the rows based on a value in a particular column.
I can also put in as many conditions as I want for how filter command will pull out the rows I want. so I can 1 condition AND another condition
can also have conditions where either OR are true
filter(cran, country=="US" | country=="IN")
arrange()
orders the rows of a data set according to the values of a particular variable
arrange(R.object, column.name)
will arrange the rows of the data set based on the column.name being in ascending order
arrange(R.object, desc(column.name))
arranges the rows so that the column.name is in descending order
arrange(R.object, column.name1, column.name2))
this will arrange the rows based on column.name1 ascending first and then by column.name2
mutate()
creates a new variable based on the values of 1 or 2 variables already in the data set
mutate(R.object, new.column.name = formula to convert old values to a new value that will then be saved as a new column in the data set)
summarize()
collapses the data set to 1 row
rename()
Swirl: dplyr
Grouping and Chaining (piping) w/ dplyr
group_by()
any operation I apply to the grouped data will take place on per the grouping specified
break up the data set into groups of rows based on the values of one or more variables
piping: %>%
subsetting & sorting
sort()
pass a variable through the function and it will sort the column into ascending order. or set decreasing = TRUE
sort(R.object$variable.1)
order()
will order all the rows in the data set based on setting the values of a specified column to ascending order. ascending order is the default
R.object[order(R.object$var1, R.object$var2),]
Summarizing Data
head(R.object, n = number of rows I want to see)
summary()
gives some information on every single variable
str()
quantile(R.object$variable1, na.rm = TRUE)
shows the cut off values for the 0%, 25%, 50% 75% and 100% percentiles
quantile(R.object$column.name, probs = 0.5, 0.75, 0.90)
this will show the values that are separating the bottom 50% from the top 50%, the bottom 75% from the top 25%, etc.
table()
make a table of a specific variable
check for missing values
is.na()
colSums
Reading from mySQL
mySQL is a free open source database software
How is data structured?
database: like a collection of data frames
table: a data frame inside of a database
field: like the columns of a data frame
each box corresponds to 1 data frame in R
each row is called a record
Connecting to an existing database server
dbConnect(MySQL)
dbConnect allows me to connect to a database server and the parameter inside specifies the type of database I want to connect to. I can use the function to connect to other types of databases
ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")
the host address is the URL for the database I want off of the website
this command opens a connection to the sql server where the data is stored
result <- dbGetQuery(ucscDb, "showdatabases;"); dbDisconnect(uscsDB);
"show databases" is not an R command, it is a SQL command this is being run through the R function of dbGetQuery
be sure to disconnect from the server after getting the data
Connecting to a specific database on the server at a specified URL
hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome.mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
allows me to store all the tables that are within the specified database into an R object
all the tables can be thought of as an R data frame which represents a different data set
dbListFields(name.of.database, "name.of.data frame")
this command will list all the column names within the specified data frame
How to find out how many records (rows) are in a table?
dbGetQuery(name.of.database, "select count(
) from name.of.data.frame")*
Pulling a data frame from a database into R
R.object.name <- dbReadTable(name.of.database, "name.of.data frame")
Swirl: tidyr
gather()
use it when you notice columns that are not variables
gather(data frame, gives the column name I want to be in my tidy data set, gives names of all the columns to be gathered)
separate()
turns a single character column into multiple columns
spread()
key/value pair
a key explains what the information describes
a value contains the actual information
Editing Text Variables
if a letter in the name of a column is upper case, it's better to turn it into a lower case
tolower(names(data.set))
there is also a to upper command if I need to make all the letters uppercase
if column names have a period in them, I can split the text before and after the period into two components, removing the period in the process
strsplit(names(data.set), "\.")
if there are underscores in the column names and I want to remove them
sub("_", "", names(data.set), )
this command says that it will look through all column names of the data set and substitute the underscore for nothing whenever the underscore is found
note this only works when there is only 1 underscore in any of the column names. because sub will only remove the first underscore in a name
if multiple underscores
gsub()
Finding Values
grep("helix", dataframe$ column.name)
this command will go through the specifed column and look for the string that has been specified. it will return which rows in that column contain the string I'm looking for
grepl("helix", dataframe$column.name)
grepl will look for the string I've specified in the column I'm looking at and a return a TRUE for the row if it contains the string and a FALSE if it does not
grep and grepl can be useful way to subset data based on variables containing certain strings. I can either keep or get rid of what I'm singling out
Important notes about text
names of variables should be:
all lower case when possible
descriptive. be able to tell what the variable is by the name of it
not duplicated
no underscores, dots, or white spaces