Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Joins, there are two types (Inner join- each…
Chapter 9: Data Integration
Access to additional data will lead to better algorithm predictions until we reach the point where more observations are no longer helpful to detect the signal.
As we gain more data streams though, we must integrate and put them all in one place. We can do this through joins and unions
Joins- used to access more features
Unions- Used to access more observations
A join combines two datasets with a shared identity value, such as customer ID
Ex of join: you may wanna join the row containing customer address with a customers login information
After a join, one or more new rows are created, and in our ex: each contains customer adress and customer login information
A union is similar to join, except it's column based, not row based. Union assumes there's similar info in the columns of two tables
A union lines up each column containing similar information on top of each other (zip code existing in A and C, for example) to create a new table.
Joins, there are two types
Inner join- each row in the left table is combined horizontally with any row in the right table that has the same identity value
Then there's join, which has 3 subsets: left outer join, right outer join, full outer join
The inner join produces a header row and two data rows. Ex:
CustId, Name, CustID, Product
1, kai, 1, pants
1, kai, 1, bouncing castle
Left outer join- produces same result as an inner join, but also adds any rows from the left table that don't have corresponding rows in the right table. We'll also get rows containing customers who never placed an order, with accompanying NULLs.
Left outer join ex: produces a header row and three data rows:
CustId, Name, CustID, Product
1, kai, 1, pants
1, kai, 1, bouncing castle
2, Dan, Null, Null
Right outer join: produces the same result as an inner join, but also adds any rows from the right table that do not have corresponding rows in the left table. Would be used when you don't have a customer ID for a product (ex: someone grabs a product then leaves without buying it, abandoned shopping cart
Right outer join ex: produces a header row and three data rows:
CustId, Name, CustID, Product
1, kai, 1, pants
1, kai, 1, bouncing castle
2, NULL. NULL, 3, Sunglasses
Full outer join- produces the same result as an inner join + left outer join, + a right outer join
Full outer join example: produces a header row and four data rows:
CustId, Name, CustID, Product
1, kai, 1, pants
1, kai, 1, bouncing castle
2, Dan, NULL, NULL
2, NULL. NULL, 3, Sunglasses
We use inner join most often when dealing with carefully curated databases. Most common.
If we are in a project with the goal being to collect as much customer data as possible, then the left outer join is the most useful. We'll user inner join and left outer join almost every time
Unions allow us to combine two datasets
We perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns
For example, a university might be tracking a students grades, meal swipes, and tuition payments in different lists of data
The full outer join is great for connecting these different lists for the university IF there are overlaps between rows AND a shared identifier to allow for integration
If there's no row overlab, but simply additional data points, a union is a more appropriate tool for combing lists that consist of different people
Always wanna use a union- When looking at a dataset containing training cases and test cases for machine learning. training cases train model and test cases rate the model's performance. before you modify the data, you have to integrate the data be sure sure that all modifications are shared between training and test data
Imputation- replacing of missing data in a column with the assignment of reasonable values. often done by replacing empty columns with the average age.