Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Unions (Imputation: replacing of missing data…
Chapter 9: Data Integration
Join: combines 2 datasets w/ a shared identity value (customer identifier)
After join new rows are created (each containing info from set A and set B)
Union: multiple columns in common between A and C
lines up each column that contains similar information on top of each other - creating a new table
example: company has many customers and records are stored in various databases - union creates one table for all databases
Joins
Inner Join - Each row in left table is combined horizontally with any row in right table that has same identity value
produces a header row and two data rows
Outer Join
Left- produces same result as an inner join but also adds any rows from the left table that do not have corresponding rows in the right table
collect as much customer data as possible
produces a header row and three data rows
Right- produces same result as an inner join, but also adds any rows from the right table that do not have corresponding rows in the left table
most useful data related to project unit of analysis
produces a header row and three data rows
Full- produces same result as an inner join + a left outer join + a right outer join
most useful data related to project unit of analysis
produces a header row and four data rows
Unions
combine 2 data sets
have datasets that contain unique sets of cases sharing similar columns
no overlap but additional data points union is used (if overlaps outer join is more useful)
Imputation: replacing of missing data in a column w/ assignment of reasonable values
assign average age to every column in which age value is missing
Unions are always used when you have a dataset containing training and test cases for machine learning
Ensure that all modifications are shared between training and evaluation data