Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Stewardship plans, DMHub, DMPonline, DMHub, DMPonline, RightField,…
Data Stewardship plans
Chapter 1: Administrative data
Research Data Life Cycle
Chapter 2
Data cycle step 1: reusing data
Is there pre-existing data?
Will you use Pre-existing data (including Other People’s Data)?
Reference data can be1. Core resources like UniProt or PDB2. Things like a “human reference Genome” that you use to define how your data differs
What existing (non-reference) data sets will you use?
Do you need to Harmonize different sources of existing data
Do you know what data already exists?
Will you use any data that needs to be made computer readable first?
(1.11) What/how/who will integrate existing data
(1.11.1) Will you need to add data from literature?
(1.11.2) Do you need integrate or to link to a different type of data?
Will you allow others to contribute data (open contribution)?
Chapter 3
Data cycle step 2:
Creating data
Data Interoperability?!
Foreach data format you will be using?
Is this a standard format?
Does this format enable sharing and long term archiving?
Will you be converting to a format more suitable for archiving later? (G F6)
What volume of data of this format do you expect?
5ba53879-eb48-47f2-a73b-f7f7d83bf030
Will you be using new types of data?
Do you need to create vocabularies or ontologies for any of your data items?
How will you design the format for your data?
Will you describe your data format for others?
How will you be storing Metadata? Explore!
Do suitable “Minimal Metadata About” standards exist for you?
Do you know how and when you will be collecting the necessary metadata?
Did you consider Re-usability of your data beyond your original purpose?
Do you need to exchange your data with others?
Did you consider how to monitor data Integrity?
How will you make sure data are what they should be?
Will you Keep checksums of certified/verified/correct/canonical data (G F2)
Will you Define ways to detect file/sample swaps, e.g. by measuring something independently (G F5)
Does all data have a license?
Will you store licenses with the data?
How will you keep provenance?
How will you do file naming and file organization?
Agree on a SOP for naming files (G F4)
How will you handle file versioning
How will you ensure consistent usage of the file naming?
Are all metadata that is in the file names also available in the proper metadata?
It is not self-obvious that you always need to collect new data for a study. More and more re-use can accomplish the same thing
Which experimental data will you collect
How many subjects do you need to be able to get statistically meaningful results?
Selection of analysis technique
Which database will you use to store the data?
Are there any data format considerations?
(1.15.1) What is the volume of each anticipated data set
(1.15.2) What data formats do the machines yield
(1.15.3) What preprocessing is needed
Are there potential issues regarding data ownership and access control?
(1.16.1) Who needs access?
(1.16.2) Where will servers be placed?
(1.16.3) What level of data protection is needed
(1.16.4) What will the IP situation be?
(1.18.4.1) Who will decide about opening up data? e.g. after the project finishes
How do you take care of quality control of data capture?
Are you logging what happens exactly to samples?
Will different collection sites be using comparable protocols, formats and identifiers?
Harmonize?
Will your data be able to answer your scientific question?
Give a list of data sets you will acquire using equipment.
Is the data capture equipment and protocol completely standardized?
Is special care needed to get the raw data ready for processing?
Do you have non-equipment data capture?
Questionnaires?
Case report forms?
Electronic patient records?
Specify a list of data sets
Is there a proper data integration tool that can handle and combine all the data types you are dealing with in your project
Will you be storing samples?
Chapter 4:
Data cycle step 3:
Processing data
Data Processing Setup
Are you using a Virtual Research Environment for compute and data sharing?
Will you need a shared working space to work with your data?
How will you work with your data?
What kind of data will be in your workspace?
Do you need the storage close to compute capacity?
Will you keep data in work format that is different from archival?
What is the capacity profile? Will you need the same storage quantity during the whole project?
Will you need to temporarily archive data sets (to tape?)
If you will be starting with a high volume of data, how will that initial data come in?
How will project partners access the work space?
How available must the workspace be?
What is the acceptable risk for “total loss”?
Can all files in the workspace be recomputed quickly?
Is there software in the workspace?
What percentage of time should the data be available? During work hours? Nights? In weekends?
How will you do Backups and other Copy data management?
Do you need to backup any data stored elsewhere related to your project in your workspace?
If not: Are all data from all project members adequately backed up and traceable?
It access control to the files in the working area well arranged?
Make sure to give write access only to people who need it
Make sure to give read access only to people that are explicitly allowed, especially when privacy sensitive data are involved
Is there a process in place for offboarding leaving project members?
Removing access
Is there a process in place for onboarding new project members?
Giving access
Instructing about responsibilities and accountability
Developing Workflows: Has this been arranged or is more guidance desired?
Will you be running a bulk/routine workflow, or develop a research analysis?
What data will workflow developers use? Can workflow developers work with subset of new data? Is there pre-existing data available for this?
List existing software you will use in the analysis workflow.
List new software components you will develop for the analysis workflow
Did you choose the workflow engine?
What features do you need?
How is Integrity of the tools in the workflow guaranteed (G H3)
Running Workflows
How will you make sure to know what exactly has been run?
How do you validate the integrity of the results?
Will you run part of the data set repeatedly to catch unexpected changes in results
e87ef779-2c4b-4c0d-a1f1-821290123a3c
Compute Capacity Planning
Determine needs in Memory/CPU/IO ratios
Suitable system
Grid/Cluster/Cloud?
Data Transport needed?
Do you have in house experience with the used compute architectures?
Do your people need training for Grid/Cloud/Hadoop?
Use shared infrastructure?
Purchase special needs?
Is all Compute capacity needed available close to the working storage?
If not, you need to plan the necessary network capacity
If not, Can the data legally be transported to the compute capacity?
Will different groups work on different parts of the workflow, and will parts of the computing be done on local infrastructure?
Is there sufficient network capacity?
Do groups have local infrastructure that can be used?
Is the Risk of information loss / leaks / vandalism acceptable?
Is any of the data privacy sensitive?
Do project members store data or software on computers in the lab or external hard drives connected to those computers?
Do people carry data with them?
Are researchers using cloud accounts?
Are data or reports sent over e-mail or other messaging services?
Do the data centers where data is stored have Certifications?
Are all project web services used via https?
Project members have been instructed?
Did you do an impact analysis?
Information loss?
Information leak?
Information vandalization?
What will you do if the compute facility is down?
Chapter 5:
83438863-0aa0-4458-b14b-2b2c0d4f811d
Data cycle step 4: Interpreting data
How will you be doing the integration of different data sources?
For each data type you use, answer the following
How is the data structured?
Answer: Differently
Answer 4: a domain specific format for this data? (VCF)
Answer 0: A simple table (data records) for each data set
Answer 2: Complex data, like graph.
Will you use a workflow e.g. with tools for database access or conversion?
Will you use a Linked data approach?
Will you use linked data sources?
Will you make your results semantically interoperable data?
Will you be using common or exchangeable units?
Will you be using common ontologies?
Will there be potential issues with statistical normalization?
Will you be integrating different data sources in order to get more samples or more data points?
Will you be integrating different data sources in order to get more information about the same samples / subjects / data points?
Do you have all tools to couple the necessary data types?
LS Will you be doing structure modeling?
LS Will you do Systems biology modeling?
Will you be doing (automated) knowledge discovery
Chapter 6:
Data cycle step 5: Archiving/Publishing data
d5784d24-0e66-4821-bd62-a711fb6d7a40
During the project, will you be archiving data for long term preservation?
Give a list of data sets. For each of these:
What kind of repository?
Answer: self-hosting
Answer: in a domain-specific repository
Answer: in a repository provided by your institute
Answer: in a national repository?
Will you be adding this data set to a catalogue?
How will you make sure that blocks of data deposited in different repositories and catalogues can be recognized as belonging to the same study?
Did you work out the financial aspects of making the data available?
Who will pay for open access publishing?
Who keeps data access running? Recurring fees?
Will you be archiving data after the project in “cold storage”?
553834d6-ff71-4b76-b4b4-b90d19a3f0a4
Will data formats be upgraded if they grow obsolete?
5a192c70-d824-49d2-965c-dca90deb04ac
Will storage media be upgraded if they grow obsolete?
Will you publish also if the results are negative?
List of software packages?
Is there an open source license
Where will it be available?
Will it be listed in a catalogue?
How will you be making sure there is good provenance of the Data Analysis?
Will reference data be created?
What will the IP be like?
How will you maintain it?
How will the release schedule be?
xref: reuse of existing reference data
Chapter 7:
6be88f7c-f868-460f-bba7-91e1c659adfd
Data cycle step 6: Giving access to data
Will you be working with the philosophy “as open as possible” for your data?
Can all of your data become completely open immediately?
Are there legal reasons why (some of your) data can not be completely open?
Privacy reasons?
IP reasons
Will you be using authenticated access?
85a9d872-3d41-4560-82c4-b850a6e2d5ac
Are there business reasons why some of your data can not be completely open? Patents?
c10f9098-5b1c-4abc-adaa-bdef2fb537ca
Are there other reasons?
Will you use a limited embargo period?
Do you know how your data could be re-used?
Data access committee
Consent
9b3e6391-d5c3-4d82-bf60-342ed2ac1f43
Will there be Translational returns / valorization that you can participate in?
DMHub
DMPonline
DMHub
DMPonline
RightField
DMHub
DMPonline
DMHub
DMPonline
EBI Data Submission
wizard
UseGalaxy
R
WorkflowHub
NeLS
choosealicense
SEEK
Ontology Lookup Service
SEEK
JWS
BioStudies
NeLS/SBI
transMART
openBIS
openBIS
div. repositories
FAIRsharing
Re3data
jupyter
gitlab/github/...
ISAtools
SEEK
COPO
COPO
openBIS
Amnesia
EGA, ...
Open Data License Guide