Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Join (2 types: (Outer Join (Left outer join…
Chapter 9: Data Integration
access to additional and relevant data leads to better predictions from algorithms until we reach the point where more obs (cases) are no longer helpful to detect the signal, the feature(s) or conditions that inform the target
can look for features of interest we don't have: will be necessary to integrate data from different sources
Join
ex: might join row containing customer record (CustomerID, visit time, products purchased, etc.) with login info on a website (CustomerID, visit time, products purchased, etc)
after a join, one or more new rows are created, each containing customer info from A on the left and websit behavior from B on the right
if only 1 customer record in A and 50 rows containing that customer's various behaviors in B, the join will result in 50 rows represented as a new table: D.
likely to be many rows in B containing the customer's Customer ID (customers generally visit website many times/ has noteworthy behaviors during a session)
2 types:
Inner Join
each row in left table combined horizontally with any row in the right table that has the same identity value
join based on CustomerID, both Customers and Orders tables contain, the customer info and order info for each customer is combined into a 3rd table, containing repeated info on the customer for each order they've made
produces a header row and 2 data rows:
CustId, Name, CustId, Product / 1, Kai, 1, pants / 1, Kai, 1, bouncing castle
Outer Join
Left outer join
produces same result as inner join but adds any rows from left table that don't have corresponding rows in right table
below the rows outlined in the above cell, we also get rows containing customers who never placed an order. for them, the order columns will contain "nulls"
left outer joins in this case may be used if goal is to find out why some customers place orders and some do not
1, Kai, 1, pants / 1, Kai, 1, bouncing castle / 2, Dan, NULL, NULL
Right outer join
produces same result as inner join but also adds any rows from right table that don't have corresponding rows in left table
in customer and order case, this join might be used to combine orders or near-orders, such as common problem of abandoned shopping carts.
these shopping carts wil be filled by an anonymous user and registered in our system as incomplete order when the potential customer leaves the sit without registering as a customer and paying for the products
header row and 3 data rows:
Cust ID, Name, CustID, Product / 1, Kai, 1, pants / 1, Kai, 1, bouncing castle / NULL, NULL, 3, sunglasses
Full outer join
produces same results as inner + left outer + right outer
any customer without orders is added to resulting table, and any order not connected to customer is also added to resulting table
header row and 4 data rows:
Cust Id, Name, Cust Id, Product / 1, Kai, 1, pants / 1, Kai, 1, bouncing castle / 2, Dan, NULL, NULL / NULL, NULL, 3, sunglasses
Union
lines up each column containing similar similar info on top of another (ex: zip code existing in A and C), creating a new table: #
ex: if a company has multiple customers, and records are stored in various databases, union will create one table containing all customers
allows us to combine 2 datasets
we perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns
ex: to create consistent system of tracking entirety of college's interactions, different data sources need to be integrated
full outer join is great for connecting different lists if there are overlaps between rows and a shared identifier to allow for integration
if no overlap but additional data points a union is more appropriate tool for combining lists that consist of different people
ex: union of lists of parishioners at 20 different churchers, synagogues, mosques. use of the union in this situation makes clear which individuals on combined list are the same people.
when union is always the correct tool:
ex: dataset containing training and test cases for machine learning: training cases used to train model, test set allows someone else to rate the performance of the model
before making modifications to data, integrate training and test data sets to be sure modifications are shared between training and evaluation data (don't develop separate processes)
dataset requires imputation (replacing of missing data in a column with assignment of reasonable values) of ages
assign average age to every column in which an age value is missing