Please enable JavaScript.
Coggle requires JavaScript to display documents.
Sequence data formats, Sequence Data formats - Coggle Diagram
Sequence data formats
Sequence Data formats
Processed data - Whole genomes and annotations present
Assembled genomes -
.fasta
Draft - multiple contigs, complete
Annotated genome (like labelled with genes) -
.gbk,.gff
Aligned reads -
.sam, .bam
Variants -
.vcf
VCF
.vcf.gz
- compression version of .vcf
Lower storage requirements
.tbi
- can accompany .vcf.gz - an index file of the .gz compressed vcf
Provides
index based access
to compressed data without the need for decompression - faster and lesser storage required
ID Columns
- contain variant identifiers from online sequence repositories; absence means variant doesn't exist in repository
prefix 'esv' - belongs to DGVa database
prefix 'rs' - belongs to dbSNP
In the header
, all the possible values that can come under each column (like format, filter, etc.) are mentioned (after a #) with brief description
Protein sequences - translated from predicted genes or from assembled transcripts
Raw data - directly from sequencer - contains low quality sequences and contaminants
Amplicons (PCR products) -
.fasta
Readsets from WGS -
.fastq/fastq.gz
Locus - overlapping amplicons assembled (portion of the genome sequenced) -
.fasta
Prepared libraries - E.g., RNAseq, etc. -
.fastq.gz
.fasta/.fastq
Lecture 6-1, slide 6-7
.gff
Lecture 6-1, slide 9-10
.gb/.gbk (genbank file format)
lecture 6-1, slide 12
Curated Data - have purpose and Quality control is performed on the data
Lecture 6 - present and past year