Please enable JavaScript.

Coggle requires JavaScript to display documents.

The Data Science Handbook (Chapter 2: The Data Science Road Map (The key…

- - - - You also need to measure (extracts) what do you need to predict, or we call the predicted result. Since the dataset normally won't contain a field called "output". So you need to define this output field by yourself.
    - - These data are the most important part to run the ML algorithms, and they directly determined the final quality of the model.
  - - - This document guarantees that you won't be sue for the wasting of time on solving the wrong problems.
      - But this can be flexible too.
    - - For people, the definition of questions can be more open ended.
      - For machine it's more quantitive.
    - - How's the size and completeness of the data?
      - How to handle the NA case (blank)?
      - How two tables are joined (especially for though totally difference data)?
      - How common is the blank entries?
      - Where does the data come from, and when?
      - Any artificial entries?
      - Any outliers?
  - - - If the data is raw text, read the text first and specify all those junks
      - If you are going to use any tool (such as CSV reader), make sure the read in formats are correct (all rows in? the data types are correct? such as datatime etc.)
      - Make some simple plots
      - Ask some simple questions that you already know the answer about the data, and see if they match.
  - - - You are the core
  - - - The key is the encounter problems and bugs as a soon as possible.
- - - - __doc__: you can access the doc string when you import a file as a library.
    - - df.set_index('the column you want to use as index')
      - df.join(another_df)
      - groupby()
      - apply()
      - When axis=0, the function will process features on the x-axis, which means each column will clustered an unit (press down). When axis=1, the y-axis features will be processed, and each row (left >> right) will be swiped.