Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science Project Guideliens (1. Scoping (1.2. Data Analysis (1.2.2.…
Data Science Project Guideliens
1. Scoping
1.1. Business Analysis
1.1.1. Problem Definition
Tasks
affected by the problem
Tasks
purpose
Current Solution
KPIs
definition and current metrics
Tasks
properties
- who do them, in what freq.
Is the problem too
dynamic
(in the real world) to model?
In which
different ways
we can solve this problem?
1.1.2. Past Efforts
Summarize
all past efforts: who, what, how, when
Failure Analysis
- can we tune any of the past solutions to better perform?
Did the clients
exploit
all their options?
Did someone solved
similar problems
?
1.1.3. Solution Definition
What's the
project's goal
in terms of the KPIs
What can be considered as
MVP
?
Product Interface
requirements
What's the required
model frequency
?
Wrapping it up - is this
applicable
?
What sort of
interpretation
is required?
Is there any constraint on the
model stability
?
1.2. Data Analysis
1.2.1. Labels & Data sources collection
Which labels/data sources were used in
past solutions
?
Who's the
expert
for each labels/data source? catch up with her
Learn the
best practices
to use each labels/data source (in terms of correctness and efficiency)
1.2.2. Labels & Data Analysis
How were the labels/data
generated
?
Is there any
bias
? How do we handle it?
How
dynamic
is this source? How often it changes? What part of it changes? Could it be changed in the future? Will we know about it?
Is the labels/data source
consistent
internally/with other labels/data sources?
Is there any
processing
(filtering, missing data handling, conversions etc.) needed? What exactly? Document it.
What's the
number of labels
we have from each source after processing?
How is the labels/data
distributed
(use histograms, percentiles, timelines, etc.)?
What's each labels source
reliability
? How we'll we take that into account?
Is the desired solution
applicable
considering this analysis?
How do we
combine
all labels sources? How do we handle
discrepancies
?
Coverage
– for how much of the labeled data each data source is applicable?
Do we need more work of the
clients/data engineers
?
1.2.3. Dataset Preparation
Compile
all the features to a single dataset, containing both labeled and unlabeled data.
Transform
– normalize, filter, alias, convert formats, use binning and one-hot encoding so that the dataset be even and ready to go.
Handle missing data
in a reasonable way.
1.2.4. Basic Dataset Exploration
Examine
some samples to get sense of what we’re dealing with. Can you tell which samples are impossible? Prove they aren’t exist in the data.
Run statistics
on your dataset. Find out if there are exceptional samples and scrutinize them with domain experts.
If the data is temporal, draw it as a
function of time
. Are there anomalies in time?
How's the dataset
distributed
(check with different features sets)? From how many
different distributions
it was sampled? Do they correlate with different class labels?
Are the labeled data and the unlabeled data distributed differently
? Can we handle it?
Draw the
features correlation matrix
. Which feature correlates with the target label the most? The least?
Repeat your experiments for
different splits of the data
(labeld/unlabeld, different classes, different time slices, etc.). Is there a difference? Can you explain it?
Summarize
your finding and
update
the dataset preparation code accordingly.
1.3. Project Design
1.3.1. Business Plan
Assign
roles
: product manager, R&D manager, researchers, engineers, clients, POCs
Revise
Problem Definition
and
Solution Definition
Is the project's goal
feasible
considering the business analysis and the data analysis conclusions?
What
side products
can we produce during the project?
Is the solution suggested
reusable
for other projects?
1.3.2. R&D Plan
What’s the project's
data architecture
? Means, what is the data flow from the data sources through the final product. Is there alternatives? Why do we choose this option?
What’s the project
software architecture
? Means, what are the different modules and how will they interact. Is there alternatives? Why do we choose this option?
What are the
metrics
we'll optimize? How do those metric match the KPIs?
Which
loss functions
we’ll use? How do they relate to the metrics we defined?
What are the
technical resources
needed? Are they available? If not, can we handle it?
1.3.3. Project Plan
Are there any
constraints
in terms of deadlines, working environments, production environments, etc. ?
What are the
high-level tasks
in the project? What are the tasks for the first round of the research?
What
risks
are entailed with this project? How severe are they? How do we manage them?
Write a
schedule
, and assign tasks.
What are the
human resources
(labor, time) needed? Are they available? If not, can we handle it?
1.3.4. Productization Plan
How the data architecture will be deployed on production?
What
resources
do we need to deploy this project?
How are we going to
monitor
our product's performance?
How are we going to monitor the labels, data sources and all the assumptions upon which the product is based?
How do we
measure
the product's KPIs along time?
2. Research
2.1. Research Cycle Planning
2.1.1. Feature Engineering
modify/add/reduce features
Given the last iteration results what features do we need?
Brainstorm
with the product manger and the clients.
Are there other ways to
process
some of the features? consider it
Try
Feature selection
and validate results with the product manager
Is there feature computation based on the data itself? Make sure you compute those features only on the training data.
Do we
normalize
the features based on the training data only?
Make sure
scaling
method isn’t sensitive to anomalies.
2.1.2. Dataset Preparation
Apply all steps from the data analysis phase (
compile, transform and missing data handling
)
Apply
data augmentation
if possible and neccessary.
2.1.3. Models Preparation
Are there any
heuristics
that apply for part of the data? If so, add them to the data architecture. Don’t leave it to the model.
Are the current models suffer from
overfitting/underfitting
? Decide what models should be tried next accordingly.
Review
literature & existing solutions
to address current challenges with your model.
2.2. Data Exploration
2.2.1. Basics
Train a simple model and look at the feature importance. Make sure there isn’t a suspiciously important feature – it might be a leakage.
2.2.2. Advanced
Run
dimensionality reduction
methods to find features' apriori importance, and visualize your data.
Use
regression
methods to find hidden connections between different numeric features.
Run
clustering
. Consult with the product manager in order to give each cluster a meaning.
Depends on the data:
SNA
,
Association Rules
.
Repeat your experiments for
different splits of the data
(labeld/unlabeld, different classes, different time slices, etc.). Is there a difference? Can you explain it?
2.2.3. Conclusions
Draw conclusions
. Think what evidence will prove those conclusions are wrong. Double check.
Consult with the
product manager
about your conclusions. Do they make sense?
Update the dataset preparation
according to the conclusions (anomalies, biases, etc.).
Refine
the project plan.
Apply all steps from the exploration in the data analysis phase
2.3. Research Cycle Evaluation
2.3.2. Running
Train
models
Predict
all samples (including unlabeled)
If relevant, did the loss function really stop improving?
2.3.3. Evaluation & Debugging
Was the
performance significantly
changed? Does it make sense? If not, look for bugs
Draw a
confusion matrix
Sample data
for which the model failed and verify it was properly represented in the training set.
Plot
learning curves
: performance as a function of model complexity and as a function of time, ROC Curves (per class)
Diagnose
- overfitting/underfitting?
Does the
feature importance
make sense to the client?
Which samples were most significant for the model training?
Can we estimate how adding/changing data will improve performance?
Can we obtain the same results with simpler model (less parameters)?
2.3.4. Conclusions
Did we make any
progress
since the last iteration? What
caused
it?
Does the model(s) meet the
project's goal
?
What measures will we take in the next iteration, and why?
Make sure you
document
this iteration as possible
2.3.1. Preparation
Does the model meet the
interpretation
requirement?
Check there's
ZERO intersection
between the different sets/folds.
Verify all sets/folds are
similarly distributed
.
Create
train, test and validation
sets or use
cross validation
.
Find out models'
null error rate
Does the
number of parameters
matches the amount of data?
What
hyper-parameters
are
tunable
?
Define
error significance
3. Productization
3.1. Deployment
3.1.1. Research Summary
Check the model’s
stability
. Does it meet the requirement? Investiage the instabilities and try to explain them.
What
assumptions
were made in the project? Write code modules that verify any assumption and decide what’s the production code flow in case an assumption is violated.
How do we expect performance to be affected as a result of violation of these assumptions?
In case we didn’t exhaust all directions we had in mind, write down directions and suggestions to future research.
Which feature takes the longest to compute? Does it worth it?
Is there a
discrepancy
in the data flow between labeled data and unlabeled data?
3.1.4. Business
What’s the final
features set
? What’s the
feature importance
?
How do we receive
feedback
from users? How do we process feedback to usable data?
Which
interactive parameters
will be in the product?
3.1.5. Monitoring
What's the
rollback policy
in case performance becomes significantly worse? Do we still update the model?
Build
dashboards
that continuously measures: performance, assumptions, data/labels distributions, known data patterns, feedbacks
3.1.2. Code
Write
logs
Unittests
everything possible. Make sure your tests pass.
Verify your code matches the
code conventions
Make sure the project is properly stored on some
source control
repository. Clone it from different machine and make sure it’s running as expected.
Only then – ask someone to
Code Review
you.
3.1.3. Performance
Do we assume specific
hardware specs
? check how reducing resources affects runtime.
What's the total
runtime
? Does it meet the requirements? Can we do better?
Is the data pipeline
scalable
in case we’ll have significantly more data samples (to train/predict on) in the future?
Can we
reuse
components to future projects? Are they designed for this purpose?
Deploy
everything and
schedule
processes. Make sure that there is absolutely no scheduled test running.
3.2. Delivery
3.2.1. KPIs Verification
Does the product manager approve all the
requirements are satisfied
? If not, will we deliver it anyway, completing lacks later?
Has anything changed while working on the project? Does the current product still work for the clients?
Are there
“quick wins”
that we can supply before delivery?
3.2.2. Growing Strategy
Which users are going to use the product eventually? Who’s going to use it at the beginning? What phases are between? Draw a plan.
If there’s a current solution deployed that we replace, how is that going to be? A/B testing? Immediate substitution?
What number of users & uses we expect on the first week? First month? Eventually? Monitor it
3.2.3. SLA
What is the preformance threshold from which the research should be renewed?
How often does the data (labels or data sources) change? Does it need a special research treatment?
4. Maintainance
4.1. Periodically Health Checks
Is the performance seemed from feedback correlates with the empirical performance (on test set)?
Was there a drop in the number of feedbacks?
Was the overall performance significantly changed recently?
Were the data sources significantly changed recently?
Were the labels significantly changed recently?
Was there a significant drop in the number of users/uses?
Was the labels/data distribution changed?
Has the gap between the labeled data distribution and the unlabeled data distribution changed?
4.2. Intentional Label/Data Changes
4.2.1. Re-run Labels & Data Analysis
4.2.3. Re-run Dataset Preparation
4.2.2. Re-run Basic Dataset Exploration
4.2.4. Re-run models evaluation
4.2.5. Update code if necessary and make sure tests pass
4.2.6. Does the product still satisfy the requirements?
4.2.7. If performance was impaired, what do we do? Can we work on past data? stop scheduling and work with the last version for a couple of months? initiate new research?
4.3. Problem Re-definition
4.3.1. What changed? Should several classes be added/removed? Or should we categorize all samples differently? In the later case, we need to re-run every