Please enable JavaScript.
Coggle requires JavaScript to display documents.
Larsen Chapter 8: Accessing and Storing Data (8.1: Track Down Relevant…
Larsen Chapter 8: Accessing and Storing Data
8.1: Track Down Relevant Data
Review decisions made when defining project objectives
What is the business problem we are examing?
Scan through available data sources & inspect the ones that may be potentially useful for the project's context
How do we access this data?
Internal Data (often found in a database table)
Ex. Info on customers (addresses & payment information( or Account history (length of customer relationship, transactional data, and web logs)
What is the unit of analysis and prediction target?
Goal of tracking relevant data
Access one or more data sources and turn each source into a matrix (create tables for the data)
Detect and remove columns of data within your research that are not helpful to the analysis
Each type of data comes with an "identity" field enabling the eventual integration of the data w. other tables
The "identity" field is a column that uniquely identifies the case (row) in databases referred to as a primary key
Identity columns must exist in 2 or more tables to be used for integration of data (In
or
out of the organization!)
API (Application Programming Interfaces)
API Definition= communication protocols that allow you to connect to an external provider of data and upload parameters for the kinds of data you want to download
Analyst can pay for access to external data or internal data (API's)
8.2 Examine Data and Remove Columns
Transform your data into one table of relevant data for Machine Learning Analysis
Locate a tool that allows you to interact with the data effectively (Alteryx> Excel if your data are larger than 200K items)
Initial Data Exploration
Understand the content of your data in relation to the project objective
Remove any column that is not relevant to either combining the files or predicting the target
If in doubt about whether a column is relevant, keep the column and allow AutoML to guide you
8.3 Example Dataset
Key Learning Point: regardless of the source of your data, you must be able to connect the data together from the different sources (relationships & connections)
Recognize the connections
one-at-a-time
then combine them into a single table
one-at-a-time