Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Types of Joins (Inner Join (Each row in the…
Chapter 9: Data Integration
Join
A “join” combines two datasets with a shared identity value, such as a customer
identifier
After a join, one or more new rows are created, each containing customer information from A on the left and website behavior from B on the right.
Because a customer generally visits a website many times or has many valuable or noteworthy behaviors during each session (i.e., visiting multiple web pages), there are likely to
be many rows in B containing this particular customer’s CustomerID.
Union
A union is based on the assumption that there are multiple columns in common between A and C.
A union lines up each column containing similar information on top of another.
For example, if a company has multiple customers, and their records are stored in various databases, a union will create one table containing all customers for that company.
Types of Joins
Inner Join
Each row in the left table is combined
horizontally with any row in
the right table that has the same identity
value.
Left Outer Join
This join produces the same result as an
inner join but also adds any rows from the
left table that do not have corresponding
rows in the right table.
Right Outer Join
This join produces the same result as an
inner join, but also adds any rows from the
right table that do not have corresponding
rows in the left table.
Full Outer Join
This join produces the same result as an
inner join + a left outer join + a right outer
join.
More About Joins
The full outer join is great for connecting these different lists if there are overlaps between rows and a shared identifier to allow for integration
One common way to perform imputation, is to assign the average age (of all rows in both test and training sets) to every column in which an age value is missing
Calculating the average for both datasets is often better than doing so for just one of them as it is more representative of the total population