Please enable JavaScript.

Coggle requires JavaScript to display documents.

Course 3: Getting and Cleaning Data (Reading from mySQL (Connecting to an…

- - - - a strange binary file my measurement machine spits out
      - unformatted excel file with 10 worksheets the company you contracted for sent you
      - the complicated JSON file I got from scraping Twitter's API
      - the hand entered numbers I entered while looking through a microscope
- - - - improves the reproducibility, instead of downloading the data by hand
      - parameters:
        
        url
        
        destfile
        
        method
      - useful for download tab-delimited and csv files and others
    - - the method, curl, has to be specified on Mac when the website is a https
    - - dateDownload <- date()
        
        records the current date in the object dataDownloaded
- - - - read.xlsx("./data/cameras.xlsx", sheetIndex = 1, Header = TRUE)
    - - specify as parameters in read.xlsx(), colIndex, and rowIndex
- - - - iris is a data set in R. this command writes the data frame, Iris, into a JSON file. good for exporting that would be used in an API
- - - - the actual text of the document
    - - labels that give the text structure
  - - - correspond to general labels
      - start tags
        
        <section>
      - end tags
        
        </section>
      - empty tags
        
        <line-break />
    - - <Greeting> Hello, world </Greeting>
      - are specific examples of tags
    - - are components of the label
      - <img src="jeff.jpg" alt="instructor" />
      - < step number="3"> Connect A to B. </step>
- - - - this compresses the original data frame making looking at it on the console more compact
  - - - keeps only the variables i mention. so I can easily select the columns i'm interested by name
      - select(R object, wanted.column, wanted column.name)
        
        this command selects 2 columns by name, up to however many I want
      - select(R object, column.1:column.8)
        
        this command will let me select the columns from 1 to 8 of the data frame
        
        works the same way as specifying a sequence of numbers, only this is doing it with columns in my data frame
      - select(R object, -column.name)
        
        this will discard the column name with the negative sign in front of it
    - - filter(R object, column.name== "name.of.column")
      - pulls out the rows based on a value in a particular column.
      - I can also put in as many conditions as I want for how filter command will pull out the rows I want. so I can 1 condition AND another condition
      - can also have conditions where either OR are true
        
        filter(cran, country=="US" | country=="IN")
    - - orders the rows of a data set according to the values of a particular variable
      - arrange(R.object, column.name)
        
        will arrange the rows of the data set based on the column.name being in ascending order
      - arrange(R.object, desc(column.name))
        
        arranges the rows so that the column.name is in descending order
      - arrange(R.object, column.name1, column.name2))
        
        this will arrange the rows based on column.name1 ascending first and then by column.name2
    - - creates a new variable based on the values of 1 or 2 variables already in the data set
      - mutate(R.object, new.column.name = formula to convert old values to a new value that will then be saved as a new column in the data set)
    - - collapses the data set to 1 row
- - - - any operation I apply to the grouped data will take place on per the grouping specified
      - break up the data set into groups of rows based on the values of one or more variables
- - - - sort(R.object$variable.1)
  - - - R.object[order(R.object$var1, R.object$var2),]
- - - - this will show the values that are separating the bottom 50% from the top 50%, the bottom 75% from the top 25%, etc.
- - - - table: a data frame inside of a database
        
        field: like the columns of a data frame
  - - - dbConnect allows me to connect to a database server and the parameter inside specifies the type of database I want to connect to. I can use the function to connect to other types of databases
    - - the host address is the URL for the database I want off of the website
      - this command opens a connection to the sql server where the data is stored
    - - "show databases" is not an R command, it is a SQL command this is being run through the R function of dbGetQuery
  - - - allows me to store all the tables that are within the specified database into an R object
      - all the tables can be thought of as an R data frame which represents a different data set
- - - - strsplit(names(data.set), "\.")
    - - sub("_", "", names(data.set), )
        
        this command says that it will look through all column names of the data set and substitute the underscore for nothing whenever the underscore is found
        
        note this only works when there is only 1 underscore in any of the column names. because sub will only remove the first underscore in a name
      - if multiple underscores
        
        gsub()
  - - - this command will go through the specifed column and look for the string that has been specified. it will return which rows in that column contain the string I'm looking for
    - - grepl will look for the string I've specified in the column I'm looking at and a return a TRUE for the row if it contains the string and a FALSE if it does not
  - - - all lower case when possible
      - descriptive. be able to tell what the variable is by the name of it
      - not duplicated
      - no underscores, dots, or white spaces