The Data Science Handbook
The Data Science Handbook
Chapter 2: The Data Science Road Map
If the client is machine, then you probability need to zero in a single, canonical model into the product.
If the client is human, then tuning more and demonstrating more models will help undertstanding
The target variable
You also need to measure (extracts) what do you need to predict, or we call the predicted result. Since the dataset normally won't contain a field called "output". So you need to define this output field by yourself.
These data are the most important part to run the ML algorithms, and they directly determined the final quality of the model.
It's very crucial to do a few simple visualization to understand how the datasets are related and distribute.
The key in data science is
define the question(s)
SOW: Statement of Works
This document guarantees that you won't be sue for the wasting of time on solving the wrong problems.
But this can be flexible too.
Questions can point toward human clients or machines.
For people, the definition of questions can be more open ended.
For machine it's more quantitive.
Have a battery of standard questions
How's the size and completeness of the data?
How to handle the NA case (blank)?
How two tables are joined (especially for though totally difference data)?
How common is the blank entries?
Where does the data come from, and when?
Any artificial entries?
of a question is also very crucial.
This is the process of cleaning data (which is the skill that most statisticians don't have)
Questions to ask
If the data is raw text, read the text first and specify all those junks
If you are going to use any tool (such as CSV reader), make sure the read in formats are correct (all rows in? the data types are correct? such as datatime etc.)
Make some simple plots
Ask some simple questions that you already know the answer about the data, and see if they match.
Presentation and deploying code
You are the best translator among them, and you need to be able to talk to all the people among the audience (give them want they want to know)
You are the core
It's always more practical for you to be a full-fledged software developer, since this guarantees the quality of the code.
Get preliminary result as soon as possible after you understand the data (scatter plot or histogram >> crude model >> simple analysis and report)
The key is the encounter problems and bugs as a soon as possible.
Make your code automated from the very beginning (read in, formatting, cleaning, sorting, etc.)
Separate steps (OOP) for you functions, and make you code clear for later adding or changing.
Chapter 3: Programming Languages
: you can access the doc string when you import a file as a library.
import pandas as pd
df.set_index('the column you want to use as index')
When axis=0, the function will process features on the x-axis, which means each column will clustered an unit (press down). When axis=1, the y-axis features will be processed, and each row (left >> right) will be swiped.
This is the main resource for documentation of Python version 2’s syntax.
The official documentation for the pandas library.
The documentation for scikit‐ learn. This is some of the best documentation I’ve ever seen for software. Most of it is example scripts that show off all the various things you can do
Besides general browsing, I can recommend several specific resources that are great for coming up to speed:
1 Pilgrim, M, 2004, Dive into Python: Python from Novice to Pro, viewed 7 August
2 Pandas: Python Data Analysis Library, viewed 7 August 2016,
, viewed 7 August 2016, The Python Software
4 Scott, M, Programming Language Pragmatics, 4th edn, 2015, Morgan
Kaufmann, Burlington, MA.