The Data Analysis / Science Process

Ask a question

  • Who is your audience?
  • What is the goal? (use SMART)
  • What do you want to predict / estimate?

Determine the necessary data

  • What sample size is needed?
  • What data do we need to prove / refute the hypothesis?

Getting the data

  • How were the data sampled?
  • Who participated in the data? Who is left out?
  • Which data are relevant?
  • Are there privacy issues?
  • Active data collection — e.g., running experiments, surveys
  • Passive data collection — e.g., locating datasets, web scraping

Cleaning the data

  • Is the data readable / organised?
  • Are there unnecessary values?

Model the data

  • Build a model
  • Fit the model
  • Validate the model

Communicate your findings

  • Visualising
  • Storytelling

Reproducibility — If your study produces results that no one can reproduce, it is likely that your results are invalid and the product of bias or error.


Automation — If you’re creating reports, it’s most likely that you’ll be processing the same data at regular intervals. Rather than writing a new program each time, you can write a program that automates these processes.

💡 Remember: Garbage in, garbage out

Explore / Analyse the data

  • Transform the data
  • Summary statistics
  • Hypothesis testing

Test / Finetune