Please enable JavaScript.
Coggle requires JavaScript to display documents.
BIOINFORMATICS AND PROTEINS - Coggle Diagram
BIOINFORMATICS AND PROTEINS
FROM SEQUENCE TO STRUCTURE
INTRODUCTION
Concepts
Speciation: evolution of a new gene that is genetically independent of the ancestral gene
Homolog: gene related to a second one by a common ancestral gene by specification.
Ortholog: genes in different species that come from a common ancestor and retain the same function.
Paralog: genes related by duplication of a common ancestor that evolves new functions.
Convergent evolution: similar properties in genes of different genetic lineages.
The general folding pattern Is usually preserved. Structural distortions increase locally with the increase in aa sequence divergence between 2 proteins but they are not uniformly distributes, the core preserves the folding pattern.
Neutral evolution vs selection
Neutral theory (Kimura): most evolutionary changes at the molecular level are the result of random genetic drift of mutant alleles that are selectively neutral. Neutral = not affect to survive or reproduce. At molecular level.
Natural selection (Darwin): traits that enhance survival and reproduction become more common in successive generations of a population. It drives to an adaptive evolution. At phenotypic level.
Both coexists, are complementary theories.
The structure is more conserved than the sequence, but function is more conserved that the structure. Significant sequence similarity allows to assign function to unknown proteins based on the properties of known proteins. This is due to they are based on homology.
DIVERGENT AND COVERGENT EVOLUTION
Divergent: homologous from a common ancestor with structural similarities despite sequence divergence. Examples: serine proteases, hemoglobin and myoglobin.
Convergent: evolve independently to have similar structures due to similar functions pressures, despite different sequences. Examples: subtilizing and chymotrypsin and echolocation proteins.
HOMOLOGY MODELING
It is a knowledge-based approach to predict the 3D structure of a protein (target) using the known structure of a similar protein (template).
Basis: proteins with similar sequences, so same folding, proteins with high sequence similarity, not for remote homologous (<30% pairwise identity).
Steps
Find a template: BLAST to search for homologous protein sequences in the PDB
Make an alignment: align the target with the template by MSA tools to identify structurally conserved regions.
Sequence alignments of proteins are much more complicated but are more informative because they involve 20 degrees of freedom
BLOSUM 62: specific substitution matrix that helps in identifying homologous sequences by scoring alignments based on likelihood of one aa being substituted for another. 62% strikes balance between sensitivity and specificity.
Multiple Sequence Alignments
Local alignment: regions of similarity in larges sequences. Useful or finding conserved domains, motifs or functional sites.
Global alignment: sequences of similar length to analyze overall similarity. To compare entire sequences to understand their overall similarity and evolutionary relationships.
Clustal Algorithm
Pairwise comparison
Guide tree creation: UPGMA or NJ
Final alignment: a MSA that reflects the evolutionary relationships and similarities between the sequences.
Create a homology model
Backbone modeling: backbone based on structurally conserved regions
Loop modeling: based on the template and fragment libraries, knowledge-based potentials and constraints from the aligned structure.
Side chain modeling: rotamer optimization to ensure the side chains are in the most favorable conformations.
Energy minimization: refine the model to minimize energy and improve accuracy. Optimíze the constraints using molecular dynamics with simulated annealing.
Validate your structure: quality of the model, compare models generated by different prediction methods, reliability by MQAP, amount of 2 structures, ensure the model resembles true protein structures.
Applications: proteins with no experimentally determined structures, understanding protein function, interactions and designing experiments.
Limitations: accuracy, less effective for proteins with low sequence similarity.
The accuracy is proportional to the similarity in primary sequences
<25: no homology enough
25-50: accuracy limitation factor
50-75: the problem is the quality of the model
75: speed of modeling
As higher homology, higher accuracy
PROFILE-BASED THREADING
Structure prediction system for those sequences without homology. It is a knowledge-based approach.
A score is needed to evaluate how well a sequence fits onto a structure (potential).
AB-INITIO
Also call de-novo, physics-based or free modeling. Is preferred when there is no similarity. It is the most difficult approach. It is based on thermodynamic hyphotesis in which the native structure corresponds to the global minimum free energy.
Two main approaches
Molecular dynamics: proteins in water that naturally fold into the native structure. Problems: atoms, huge number of time steps.
Minimal energy: the folded form is the minimal energy conformation of a protein.
CONSTRAINS
Incorporating experimental data (NMR, Cryo-EM, chemical cross-linking or small angle scattering) or co-evolutionary information to improve predictions.
Experimental data: improved accuracy, validation and refinement.
Co-evolutionary information: MSA + direct coupling analysis. Contant prediction and functional insights
Combining both can improve the accuracy, especially for regions where experimental data are sparse.
FROM STRUCTURE TO FUNCTION
Involves determining the roles and activities of proteins based on their sequences and structures.
Levels of prediction
Residue-level (mutations)
Mutation impact analysis: PolyPhen-2, SIFT and PROVEAN
Active site prediction: residues involved in ligand binding
Structural annotations: 2 structure, disorder prediction, surface accessibility.
Protein-level (homology)
Homology-based function inference
Gene Ontology
Critical assessment of function annotation (CAFA)
Complex-level (interactions)
Protein-Protein interactions (PPI)
Docking
Sequence-based predictions
Interface prediction
Critical assessment of predicted interactions (CAPRI)
Function prediction
Sequence-based approaches: homology and motifs (BLAST)
Structure-based approaches: 3D structures and docking (MODELLER)
Motif-based approaches
Guilt-by-association approaches: leveraging AL and large datasets
Integrative approaches: combining multiple data sources