Please enable JavaScript.
Coggle requires JavaScript to display documents.
Big Data - week 6 (Kent (1982), Focus: relational database theory, Did NOT…
Big Data - week 6
-
- Focus: relational database theory
-
-
- This paper focusses on 'data tidying'.
- What 'defines' tidy data? Each variable is a column, each observation is a row, and each type of observational unit is a table
- It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data
- Despite the amount of time it takes, there has been surprisingly little research on how to clean data well.
- What do you 'do' when you clean data? 'Outlier checking', to date parsing (?), to missing value imputation
- What is data tidying? structuring datasets to facilitate analysis
- The principles of tidy data are closely tied to those of relational databases and Codd’s relational algebra. TIdy data sets Tidy datasets provide a standardised way to link the structure of a dataset (its physical layout) with its semantics (its meaning). Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.
- database normalisation: each fact is expressed in only one place
- Other problems with untidy data: (1) A more complicated situation occurs when the dataset structure changes over time. For example, the datasets may contain different variables, the same variables with different names, different file formats, or different conventions for missing values.
- Tidy data is only worthwhile if it makes analysis easier.
- Four verbs associated with data manipulation: Filter, Transform (adding or changing variables), Aggregate (collapsing multiple variables into a single variable) and Sort (changing the order of observations).
- They use a case study to; illustrate how tidy data and tidy tools make data analysis easier by easing the transitions between manipulation, visualisation & modelling
-
- SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.
- You can think of a relational database as a collection of tables.
- A 'column' is also referred to as a field.
- SQL can be used to create and modify databases
- A query is a request for data from a database table (or combination of tables).
- In SQL, you can: (1) select data from a table using a SELECT statement. (2) You can see the results of executing your query in the query result tab to the right! (3) IT'S SIMPLE, just write SELECT then the lowercase name of data you want to query then capital FROM then the second name of data you want to query (usually the row or table it's from), semicolon and then press enter. (4) To select multiple columns from the data (in a query), you just use a comma to separate the two columns. (5) If you want to select ALL the columns of data, simply put SELECT * FROM and it will select all the columns of data (in your query). (6) To limit the number of results you get you can just write LIMIT 10; at the end, for example, to only get ten results.
- Try and make sure to: make SQL keywords uppercase to distinguish them from other parts of your query, like column and table names. (2) include a semicolon at the end of your query. This tells SQL where the end of your query is! (3)