Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Science Project Guideliens - Coggle Diagram

- - - - Tasks affected by the problem
      - Tasks purpose
      - Current Solution
      - KPIs definition and current metrics
      - Tasks properties - who do them, in what freq.
      - Is the problem too dynamic (in the real world) to model?
      - In which different ways we can solve this problem?
    - - Summarize all past efforts: who, what, how, when
      - Failure Analysis - can we tune any of the past solutions to better perform?
      - Did the clients exploit all their options?
      - Did someone solved similar problems?
    - - What's the project's goal in terms of the KPIs
      - What can be considered as MVP?
      - Product Interface requirements
      - What's the required model frequency?
      - Wrapping it up - is this applicable?
      - What sort of interpretation is required?
      - Is there any constraint on the model stability?
  - - - Which labels/data sources were used in past solutions?
      - Who's the expert for each labels/data source? catch up with her
      - Learn the best practices to use each labels/data source (in terms of correctness and efficiency)
    - - How were the labels/data generated?
      - Is there any bias? How do we handle it?
      - How dynamic is this source? How often it changes? What part of it changes? Could it be changed in the future? Will we know about it?
      - Is the labels/data source consistent internally/with other labels/data sources?
      - Is there any processing (filtering, missing data handling, conversions etc.) needed? What exactly? Document it.
      - What's the number of labels we have from each source after processing?
      - How is the labels/data distributed (use histograms, percentiles, timelines, etc.)?
      - What's each labels source reliability? How we'll we take that into account?
      - Is the desired solution applicable considering this analysis?
      - How do we combine all labels sources? How do we handle discrepancies?
      - Coverage – for how much of the labeled data each data source is applicable?
      - Do we need more work of the clients/data engineers?
    - - Compile all the features to a single dataset, containing both labeled and unlabeled data.
      - Transform – normalize, filter, alias, convert formats, use binning and one-hot encoding so that the dataset be even and ready to go.
      - Handle missing data in a reasonable way.
    - - Examine some samples to get sense of what we’re dealing with. Can you tell which samples are impossible? Prove they aren’t exist in the data.
      - Run statistics on your dataset. Find out if there are exceptional samples and scrutinize them with domain experts.
      - If the data is temporal, draw it as a function of time. Are there anomalies in time?
      - How's the dataset distributed (check with different features sets)? From how many different distributions it was sampled? Do they correlate with different class labels?
      - Are the labeled data and the unlabeled data distributed differently? Can we handle it?
      - Draw the features correlation matrix. Which feature correlates with the target label the most? The least?
      - Repeat your experiments for different splits of the data (labeld/unlabeld, different classes, different time slices, etc.). Is there a difference? Can you explain it?
      - Summarize your finding and update the dataset preparation code accordingly.
  - - - Assign roles: product manager, R&D manager, researchers, engineers, clients, POCs
      - Revise Problem Definition and Solution Definition
      - Is the project's goal feasible considering the business analysis and the data analysis conclusions?
      - What side products can we produce during the project?
      - Is the solution suggested reusable for other projects?
    - - What’s the project's data architecture? Means, what is the data flow from the data sources through the final product. Is there alternatives? Why do we choose this option?
      - What’s the project software architecture? Means, what are the different modules and how will they interact. Is there alternatives? Why do we choose this option?
      - What are the metrics we'll optimize? How do those metric match the KPIs?
      - Which loss functions we’ll use? How do they relate to the metrics we defined?
      - What are the technical resources needed? Are they available? If not, can we handle it?
    - - Are there any constraints in terms of deadlines, working environments, production environments, etc. ?
      - What are the high-level tasks in the project? What are the tasks for the first round of the research?
      - What risks are entailed with this project? How severe are they? How do we manage them?
      - Write a schedule, and assign tasks.
      - What are the human resources (labor, time) needed? Are they available? If not, can we handle it?
    - - How the data architecture will be deployed on production?
      - What resources do we need to deploy this project?
      - How are we going to monitor our product's performance?
      - How are we going to monitor the labels, data sources and all the assumptions upon which the product is based?
      - How do we measure the product's KPIs along time?
- - - - modify/add/reduce features
      - Given the last iteration results what features do we need? Brainstorm with the product manger and the clients.
      - Are there other ways to process some of the features? consider it
      - Try Feature selection and validate results with the product manager
      - Is there feature computation based on the data itself? Make sure you compute those features only on the training data.
      - Do we normalize the features based on the training data only?
      - Make sure scaling method isn’t sensitive to anomalies.
    - - Apply all steps from the data analysis phase (compile, transform and missing data handling)
      - Apply data augmentation if possible and neccessary.
    - - Are there any heuristics that apply for part of the data? If so, add them to the data architecture. Don’t leave it to the model.
      - Are the current models suffer from overfitting/underfitting? Decide what models should be tried next accordingly.
      - Review literature & existing solutions to address current challenges with your model.
  - - - Train a simple model and look at the feature importance. Make sure there isn’t a suspiciously important feature – it might be a leakage.
    - - Run dimensionality reduction methods to find features' apriori importance, and visualize your data.
      - Use regression methods to find hidden connections between different numeric features.
      - Run clustering. Consult with the product manager in order to give each cluster a meaning.
      - Depends on the data: SNA, Association Rules.
      - Repeat your experiments for different splits of the data (labeld/unlabeld, different classes, different time slices, etc.). Is there a difference? Can you explain it?
    - - Draw conclusions. Think what evidence will prove those conclusions are wrong. Double check.
      - Consult with the product manager about your conclusions. Do they make sense?
      - Update the dataset preparation according to the conclusions (anomalies, biases, etc.).
      - Refine the project plan.
  - - - Train models
      - Predict all samples (including unlabeled)
      - If relevant, did the loss function really stop improving?
    - - Was the performance significantly changed? Does it make sense? If not, look for bugs
      - Draw a confusion matrix
      - Sample data for which the model failed and verify it was properly represented in the training set.
      - Plot learning curves: performance as a function of model complexity and as a function of time, ROC Curves (per class)
      - Diagnose - overfitting/underfitting?
      - Does the feature importance make sense to the client?
      - Which samples were most significant for the model training?
      - Can we estimate how adding/changing data will improve performance?
      - Can we obtain the same results with simpler model (less parameters)?
    - - Did we make any progress since the last iteration? What caused it?
      - Does the model(s) meet the project's goal?
      - What measures will we take in the next iteration, and why?
      - Make sure you document this iteration as possible
    - - Does the model meet the interpretation requirement?
      - Check there's ZERO intersection between the different sets/folds.
      - Verify all sets/folds are similarly distributed.
      - Create train, test and validation sets or use cross validation.
      - Find out models' null error rate
      - Does the number of parameters matches the amount of data?
      - What hyper-parameters are tunable?
      - Define error significance
- - - - Check the model’s stability. Does it meet the requirement? Investiage the instabilities and try to explain them.
      - What assumptions were made in the project? Write code modules that verify any assumption and decide what’s the production code flow in case an assumption is violated.
      - How do we expect performance to be affected as a result of violation of these assumptions?
      - In case we didn’t exhaust all directions we had in mind, write down directions and suggestions to future research.
      - Which feature takes the longest to compute? Does it worth it?
      - Is there a discrepancy in the data flow between labeled data and unlabeled data?
    - - What’s the final features set? What’s the feature importance?
      - How do we receive feedback from users? How do we process feedback to usable data?
      - Which interactive parameters will be in the product?
    - - What's the rollback policy in case performance becomes significantly worse? Do we still update the model?
      - Build dashboards that continuously measures: performance, assumptions, data/labels distributions, known data patterns, feedbacks
    - - Write logs
      - Unittests everything possible. Make sure your tests pass.
      - Verify your code matches the code conventions
      - Make sure the project is properly stored on some source control repository. Clone it from different machine and make sure it’s running as expected.
      - Only then – ask someone to Code Review you.
    - - Do we assume specific hardware specs? check how reducing resources affects runtime.
      - What's the total runtime? Does it meet the requirements? Can we do better?
      - Is the data pipeline scalable in case we’ll have significantly more data samples (to train/predict on) in the future?
      - Can we reuse components to future projects? Are they designed for this purpose?
      - Deploy everything and schedule processes. Make sure that there is absolutely no scheduled test running.
  - - - Does the product manager approve all the requirements are satisfied? If not, will we deliver it anyway, completing lacks later?
      - Has anything changed while working on the project? Does the current product still work for the clients?
      - Are there “quick wins” that we can supply before delivery?
    - - Which users are going to use the product eventually? Who’s going to use it at the beginning? What phases are between? Draw a plan.
      - If there’s a current solution deployed that we replace, how is that going to be? A/B testing? Immediate substitution?
      - What number of users & uses we expect on the first week? First month? Eventually? Monitor it
    - - What is the preformance threshold from which the research should be renewed?
      - How often does the data (labels or data sources) change? Does it need a special research treatment?