The Data Analysis / Science Process
Ask a question
- Who is your audience?
- What is the goal? (use SMART)
- What do you want to predict / estimate?
Determine the necessary data
- What sample size is needed?
- What data do we need to prove / refute the hypothesis?
Getting the data
- How were the data sampled?
- Who participated in the data? Who is left out?
- Which data are relevant?
- Are there privacy issues?
- Active data collection — e.g., running experiments, surveys
- Passive data collection — e.g., locating datasets, web scraping
Cleaning the data
- Is the data readable / organised?
- Are there unnecessary values?
Model the data
- Build a model
- Fit the model
- Validate the model
Communicate your findings
Reference: Data Visualisation | Coggle
- Visualising
- Storytelling
Reproducibility — If your study produces results that no one can reproduce, it is likely that your results are invalid and the product of bias or error.
Automation — If you’re creating reports, it’s most likely that you’ll be processing the same data at regular intervals. Rather than writing a new program each time, you can write a program that automates these processes.
💡 Remember: Garbage in, garbage out
Reference: Data Acquisition | Coggle
Reference: Data Wrangling | Coggle
Explore / Analyse the data
- Transform the data
- Summary statistics
- Hypothesis testing
References: Data Analysis | Coggle
Test / Finetune