Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 8. Accesing and Storing Data (8.1 Track Down Relevant Data (Each…
Chapter 8. Accesing and Storing Data
8.1 Track Down Relevant Data
When tracking down relevant data, it is important to review decisions made when defining project objectives.
These decisions will inform further decisions about the data needed.
For most of these data sources, additional research on how to access the data, possibly requiring help from an information technology department staff member, as well as support from a subject matter expert may be needed.
Many analytics problems start with internal data, often found in a database table containing customer info, patient records, financial market information, marketing spend, or Internet of Things records.
Each type of data generally comes with an "identity" field enabling the eventual integration of the data with other tables.
An identity field is a column uniquely identifying the case (row), in database referred to as a primary key.
8.2 Examine Data and Remove Columns
For each of your tables or files, carefully examine their column names, data types, and number of rows.
For each file you have, now is the time to remove any column that is clearly not relevant to either combining the files or predicting the target.
For example, if you are predicting sales, the geographic region is important, but if you already have the identifier (code) for a region, adding the name of that region is not likely to add further predictive ability (through information on longitude and latitude might.)
While understanding what data may help us predict a target takes time and experience, there may be data in your tables or files that are clearly irrelevant to the project goal.
Regardless of the source of your data, you must be able to connect the data together from the difference sources.
8.3 Example Dataset
It is a small-scale company database containing example data ranging from customer and employee information to products and sales.
This database will reappear throughout the section as a way of gaining experience with the process of preparing the data for machine learning
For each table, there is a set of attributes that we will simply call column names for the time being
Each table in the database diagram is tied together with relationships, in this case, anchored by a key on one side and an infinity sign on the other.
The infinity sign denotes the "many" side, which, in this case, means that for each record on the "one" side, there may be many records related to it in the connected table.