Data Integration

Joins continued

Processes of data integration

Joins

Unions

combines datasets with shared values

creates one or two rows containing information that was combined

some rows will have repeat information

Assumes there are multiple columns in common between "A-C"

combines columns into a new table

check example

think of tables as matrices

two types of joins

outer join

Inner join

Left outer join

Right outer join

Full outer join

rows in the left are combined with any row in the right with the same identity

makes a 3rd table with the repeated information

same as inner join but adds rows from the right table that aren't corresponding in the right table

could find why some customers have orders and some do not

produces the same as inner, outer left, and outer right joins

produces the same result as inner joint but adds rows that aren't corresponding in the left table

could help find cart abandonment issues

inner joins are most commonly used

when trying to find the most information possible use left outer joins

inner and left outer joins will take care of most needs

math is the best way to learn joins

see examples

Unions

use unions when datasets have unique cases of similar values

use full outer join to connect lists that overlap between rows

if there is no overlap then use a union

Always use unions with training and test cases

avoid developing separate processes