Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Integration - Chp. 9 ("Union" (can use imputation…
Data Integration - Chp. 9
"Joins"
combines two datasets with a shared identity value (ex: customer identifier)
ex: you might join the row containing your customer record with your login info on a website
after this, one or more new rows are created containing customer info from A on the left and website behavior from B on the right
if only one customer record in A, and 50 rows containing that customer's various behaviors in B, the JOIN will result in 50 rows represented as a new table, D
several types of "joins"
Inner Join and Outer Join are most relevant here
Inner Join
each row in left table combined horizontally w/any row in the right table that has the same identity value and the tables are combined into one new one w/repeated info
excludes "NULLs"
produces header row and two data rows
most commonly used for our data integration process
Outer Join
Left outer join
produces same result as an inner join but also adds any rows from the left table that don't have corresponding rows in the right table
some cells will be empty and will be shown as "NULL"
produces header row and three data rows
most useful when we need as much customer data as possible
right outer join
produces same result as inner join but adds any rows from the right table that don't have corresponding rows in the left table
produces header row and three data rows
full outer join
produces same results as an inner join + a left outer join + a right outer join
produces a header row and four data rows
great for connecting diff lists if there are overlaps between row and a shared identifier to allow for integration
"Union"
based on assumption that there are multiple columns in common between A and C
lines up each column with similar info on top of another (created table E)
allow us to combine two datasets; usually used when there are two datasets w/unique sets of cases sharing the same or similar columns
having the aggregate of a given person's info available w/i a single system is critical to performing high quality analytics and machine learning
can use imputation
imputation is the replacing of missing data in a column with the assignment of reasonable values
one common way to perform it is to assign for example, the avg age (of all rows in both test and training sets) to every column in which an age value is missing...calculating the avg for both datasets is often better than doing so for just one of them bc it is more representative of the total population(to generalize, more data points produces a more accurate avg)