Chapter 9: Data Integration

Important Information

Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations are no longer helpful to detect the signal, the feature or conditions that inform the target.

Look for additional features of interest that we do not currently have, at which point it will invariably be necessary to integrate data from different sources.

Two Methods of data Integration

Joins

Unions

Access more features

Access more observations

Combines two datasets with shared identity value

Based on assumption that there are multiple columns in common . A union lines up each column containing similar info on top of another

Joins

Two Types

Inner Join

Outer Join

Right outer join

Full outer join

Left outer join

Each row in the left table is combined horizontally with any row in
the right table that has the same identity value.

This join produces the same result as an
inner join but also adds any rows from the
left table that do not have corresponding
rows in the right table.

This join produces the same result as an
inner join, but also adds any rows from the
right table that do not have corresponding
rows in the left table

This join produces the same result as an
inner join + a left outer join + a right outer
join

Most commonly used when dealing with carefully curated database

Most useful if engaging in a project with goal being to collect as much data as possible

Unions

We perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns

Great for connecting lists in there are overlaps between rows and shared identifiers