Biological Modeling
1
Random Walks and Turing Patterns
zebra stripes?
Random Walk Theorem
After n steps of unit length in a random walk, a particle will on average find itself a distance from its origin that is proportional to √n.
consider more than one type of particle are randomly walking
Reaction-Diffusion System
parameters
kill rate ( \( k \) )
predator-prey rate ( \( r \) )
\( A + 2B \rightarrow 3B\)
reactants \( \rightarrow \) products
the emergent patterns are called Turing patterns
feed rate ( \( f \) )
\( \frac{d[B]}{dt} = k [B] \)
diffusion rate ( \( d_A \) )
if we raise the diffusion rate \( \rightarrow \) more of the reaction A + 2B → 3B.
if f = 2k we can see zebra stripes
we want to simulate system
cellular automaton model
concentration of a particle
\( [ A ] \)
simulation
diffusion
convolution
feed
add \(f.(1-[A])\) to each cell
kill
delete \(k[B]\) from each cell
reaction
A + 2B → 3B
subtract \(r.[A][B]^2\) from concentration of A and add it to concentration of B
in our sample model
\([A]_{\text{new}} = [A] + \triangle A + f.(1 - [A]) - r.[A][B]^2\) \([B]_{\text{new}} = [B] + \triangle B - k.[B] + r.[A][B]^2\)
Applying these reaction-diffusion computations
a cellular automaton called the Gray-Scott model
fine-tuned system
very slight changes in parameter values can lead to significant changes in the system.
robust system
perturbations such as parameter variations do not lead to substantive changes in the ultimate behavior of the system
2
Network of Proteins
model protein interactions
network
biological networks has
network motifs
frequently recurring structures hidden within biological networks
central dogma
regulation
level
Protein
transcription factor
DNA
transcription factors
extracellular stimuli can activate a transcription factor via a system of signaling molecules that convey a signal through relay molecules to the transcription factor
week binding affinity to DNA in general
but strong for sequence motif
find the set of genes to which a transcription factor binds
ChIP-seq
chromatin immunoprecipitation sequencing
transcription factors and genes make a network
transcription factor network
nodes
organism’s proteins
edges
x to y if x is TF for y
loops
autoregulation
in E coli
most are negative
some parts are more common -> Motifs
Negative Autoregulation Motif
simulation
image activation of TF Y
reaction
kill rate
proteases enzymes degrade proteins
diffusion rate
same for both paricles
we ignore X kill rate because the system produce it in a way that its concentration become steady
Y repress itself
X -> X + Y
2Y -> Y
Repressilator
three components are transcription factors, and they repress themself in a loop
reaction
X + Y -> X
Y + Z -> Y
Z + X -> Z
all three particles are produced as the result of an activation process by some other transcription factor
hidden particle
I -> I + X
I -> I + Y
I -> I + Z
kill rate
same for all
diffusion rate
same for all
Start: X
because the biological systems are random
Noise is a feature of biological systems
The Feedforward Loop Motif
type-1 incoherent feedforward loop
simulation
image activation of protein Z, X and Y should be transcription factors
reaction
X -> X + Y
Y + Z -> Z
kill rate
we ignore X kill rate because the system produce it in a way that its concentration become steady
Y and Z kill rates are same
diffusion rate
same for both paricles
damped oscillation
have we non damping oscillator?
Repressilator
diagram
diagram
negative autoregulation makes system response time shorter
diagram
X -> X + Z
we have different type of that
result
result
result
The repressilator is robust to disturbance
3
chemotaxis
How does a simple organism like E. coli sense an attractant or repellent in its environment?
The movement of organisms like E. coli in response to a chemical stimulus is called chemotaxis.
E. coli and other bacteria
move toward
attractants like glucose and electron acceptors
move away from
repellents like \(\text{Ni}^{2+}\) and \(\text{Co}^{2+}\)
Every E. coli cell has between five and twelve flagella distributed on its surface that can rotate both clockwise and counter-clockwise.
all flagella are rotating counter-clockwise
propel the cell forward at about 20 μm per second.
it is about ten times the length of the cell per second
a single flagellum rotates clockwise
bacterium stops and rotates.
alternate between periods of “running” in a straight line and then “tumbling”
random walk through its environment
Bacteria are amazingly diverse.
but
variations in bacterial tumbling frequencies are relatively small.
In the absence of an attractant or repellent, E. coli stops to tumble once every 1 to 1.5 seconds, which is similar to most other bacteria.
we wonder why evolution might hold tumbling frequency constant across species.
Chemotaxis
perceive a change in its environment and react accordingly
signal transduction
cell identifies a stimulus outside the cell and then transmits this stimulus into the cell.
certain molecule’s extracellular concentration increases
receptor proteins
have more frequent binding with these molecules
This signal is then “transduced” via a series of internal chemical processes.
For example
cascade begins that eventually changes a transcription factor into an active state, so that it is ready to activate or repress the genes that it regulates.
E. coli
receptor proteins that detect attractants such as glucose by binding to and forming a complex with these attractant ligands.
To model ligand-receptor dynamics, we will use a reversible reaction that proceeds continuously in both directions at possibly different rates.
type
receptor-ligand complex will dissociate
L + T <--> LT
if we have \(l_o\) from L and \(t_0\) from T the system should become equilibrium in some point
ligand collides with a receptor
LT -> L + T
L + T -> LT
reverse reaction
means
the rates of the forward
and reverse reactions are equal.
1
\(k_{bind}\).[L].[T] = \(k_{dissociate}\).[LT]
conservation of mass
2
[L] + [LT] = \(l_0\)
[T] + [LT] = \(t_0\)
\(k_{bind}\).[L].[T] = \(k_{dissociate}\).[LT]
[L] + [LT] = \(l_0\)
[T] + [LT] = \(t_0\)
we can find [L], [T] and [LT]
4
Analyzing the Coronavirus Spike Protein
2003
severe acute respiratory syndrome, or SARS
This mysterious new disease had crossed the Pacific Ocean within a week of entering Hong Kong.
SARS was deadly, killing close to 10% of those who became sick.
2017
researchers published the results of sampling horseshoe bats for five years in a cave in Yunnan province. They found that these bats harbored coronaviruses with remarkable genetic similarity to the virus causing SARS.
but no body cares 😭
coronavirus disease 2019 (COVID-19)
Coronaviruses
SARS coronavirus 2 (SARS-CoV-2)
their outer membranes are covered in a layer of spike proteins
ACE2 enzyme on a human cell’s membrane
both SARS-CoV and SARS-CoV2 has spike protein
but
SARS-CoV-2 is more infectious
SARS-CoV-2
genome : nearly 30,000 nucleotides
Upon sequence comparison
related to several coronaviruses isolated from bats and distantly related to SARS-CoV
spike protein : 1,273 amino acid sequence
what is the spike protein 3D shape?
Nature’s magic protein folding algorithm
This folding process occurs spontaneously for all proteins and without any outside influence
the same polypeptide chain will almost always fold into the same 3-D structure in a manner of microseconds.
how does SARS-CoV-2 spike protein structure and function differ from the same protein in SARS-CoV?
Determine the spike protein’s shape from its sequence of amino acids?
how
experimentally
cryo-electron microscopy (cryo-EM)
X-ray crystallography
crystallize many copies of a protein
shine an intense beam of X-rays at the crystal
The light hitting the protein is diffracted
newer method
preserve thousands of copies of a protein in non-crystalline ice
examine these copies with an electron microscope
expensive
An X-ray crystallography experiment for a single protein costs upward of $2,000, and building an electron microscope can cost millions.
Protein structures that have been determined experimentally are typically stored in the PDB, which contains over 160,000 protein structures.
This number may seem large, but a recent study estimated that the 20,000 human genes translate into between 620,000 and 6.13 million protein isoforms (i.e., protein variants with slightly different structures).
computationally
hard
fine-tuned with respect to some mutations but robust with respect to others
As a result, two very different amino acid sequences can fold into proteins with similar structure and comparable function.
example
hemoglobin subunit alpha taken from three species: humans, short-fin mako sharks, and emus.
Polypeptide
Amino Acid
peptide bond
the bond is between C and N
the angle of this bond is fix but it's more flexible for C_alpha
So we can have different shape for a chain of amino acids
secondary strucrure
tertiary structure
quaternary structure
spike protein
three chain -> three identical protein structures
each chain
two subunit, S1 and S2
each subunit
protein domains
the SARS-CoV-2 spike protein has a receptor binding domain (RBD) located on the S1 subunit of the spike protein that is responsible for interacting with the human ACE2 enzyme
side chain
twenty types
nonpolar = hydrophobic
nine -> polar
tend to find these amino acids tucked away on the interior of the protein.
A central theme of Chapter 3 was that a system of chemical reactions moves toward equilibrium. The same principle is true of protein folding; when a protein folds into its final structure, it reaches a conformation that is as chemically stable as possible.
potential energy = free energy = bonded energy + non-bonded energy
bonded energy
protein’s covalent bonds
non-bonded energy
electrostatic interactions
bond angles between adjacent amino acids
torsion angles
van der Waals interactions
attraction and repulsion forces from the electric charge between pairs of charged amino acids.
Two nearby amino acids of opposite charge may interact to form a salt bridge.
Conformations that contain salt bridges and keep apart similarly charged amino acids will therefore have a lower free energy component contributed by electrostatic interactions.
due to random chance, the negatively charged electrons in an atom could momentarily be unevenly distributed on one side of the nucleus.
negative charge on the side
induced dipole
one side of the atom may attract only the oppositely charged components of another atom
Van der Waals forces refer to the attraction and repulsion between atoms because of induced dipoles.
As the protein folds, it seeks a conformation of lowest total potential energy
creating patterns from which the position of every atom in the protein can be inferred.
\( \min_{structures~for~P} potential~energy~(strucrure)\)
ab initio
Predicting a protein’s structure using only its amino acid sequence
What we have?
force fields
local search
Slow!
.pdb
How to compare to 3D structures?
for shapes S and T
translate S -> same center of mass
sample n, \(s_i\) and \(t_i\) points from two shapes
RMSD = \( \sqrt{\frac{1}{n} \sum (s_i - t_i)^2}\)
rotate and flip S so minimize RMSD
Kabsch algorithm
\(d(S, T) = \min RMSD\)
the potential energy of a candidate protein shape
Homology Modeling
(comparative modeling)
the information contained in known structures to help us predict the structure of a protein with unknown structure.
SARS-CoV spike protein -> SARS-CoV2 spike protein
fragment assembly
input
primary structure
find template with MSA
use template for structures of conserved parts
use fragment library for non-conserved parts
SWISS-MODEL
Robetta
GalaxyWEB
ProDy
Critical Assessment of protein Structure Prediction (CASP)
In 2020, the 14th iteration of this contest (CASP14) was won in a landslide.
The second version of AlphaFold, a DeepMind project, vastly outperformed the world’s foremost structure prediction approaches
based on deep learning
Local Difference in Proteins
SARS-CoV
receptor binding motif (RBM)
has mutated so much
Do the mutations that SARS-CoV-2 has accumulated somehow make it easier for the virus to infect human cells?
What about Structure?
In addition to verifying the structure of the spike protein in both SARS-CoV and SARS-CoV-2, researchers also determined the structure of the RBD complexed with ACE2 in both SARS-CoV (PDB entry: 2ajf) and SARS-CoV-2 (PDB entry: 6vw1).
RMSD -> for global structure alignment
Not good for find local differences
M(i, j) = 1 if amino acid i is more near to j than a threshold
contact map
SARS-CoV-2
SARS-CoV
similarity of i-th alpha carbon position
k = 2 or 3 (2 for first and last)
\(\sigma_{i, j}^2 = |i-j|^2\)
VMD
left : SARS-CoV right : SARS-CoV2
difference1
silver part
hydrophobic pocket
Met82, Leu79, Tyr83
hidden away from the outside of the ACE2 enzyme
yellow part
SARS-CoV-2, Phe486
inserts itself into the pocket
main-chain hydrogen bond
Asn487
Ala475
we can compute potential energy for ACE2 and Spike protein Complex
this complex has fewer energy in ACE2 - SARS-CoV2 Spike binding domain
What about dynamic movements of protein structure?
we use static structure
elastic network model (ENM)
Gaussian network model (GNM) is a kind of ENM
nodes: alpha carbons
springs: two C nearer than threshold
The fluctuation of carbons are Gaussian, meaning that the alpha carbon deviates randomly from its equilibrium position according to a normal (bell-shaped) distribution.
Although atomic fluctuations are powered by randomness, the movements of protein atoms are heavily correlated.
\(R_{ij} = \) distance between i and j
\(R_i^0 = \) the equilibrium position of node i
\(\Delta R_i = \) displacement from equilibrium
what is relation between \(\Delta R_i\) and \(\Delta R_j\) ?
cross-correlation of alpha carbons i and j
\(C_{ij} = \frac{<\Delta R_i, \Delta R_j>}{\sqrt{<\Delta R_i, \Delta R_i> . <\Delta R_j, \Delta R_j>}}\)
from -1 to 1
1 means completely correlated
\(n \times n\) cross-correlation matrix C
which parts are more flexible?
B-factor = \(\frac{8 \pi^2}{3} <\Delta R_i, \Delta R_i>\)
Normal mode analysis (NMA)
deconvolving a protein’s movement into individual normal modes
left is for SARS-CoV-2
its NTD is more flexible
what about direction?
anisotropic network model (ANM)
5
Classifying White Blood Cells
what are blood cells?
red blood cell (RBC)
transport oxygen via the hemoglobin protein
white blood cell (WBC)
How Are Blood Cells Counted?
classic method
device
hemocytometer
a technician filters a small amount of blood onto a gridded slide
counts the number of cells of each type in the squares on the grid
estimate the number of each type of cell per volume
identify and attack foreign cells as part of your immune system
divide into families based on their structure and function
monocytes
lymphocytes
granulocytes
computational methods
identify WBC by color threshold
a binarized image
not bad!
How to classify them?
(granulocytes, lymphocytes, and monocytes)
KNN
What is our features?
Binary Image ->
Find Boundary ->
take n samples ->
\(\textbf{s} = (s_1, s_2, \dots, s_n)\)
When two shapes are near?
\(RMSD(s, t) = \sqrt{\frac{1}{n} [d(s_1, t_1)^2 + \dots + d(s_n, t_n)^2]}\)
but we should rotate and move!
Kabsch algorithm
for every pair! so expensive 😢
rotate all images concurrently
major axis
the longest line segment that connects two points on the outside of the image and crosses through the image’s center of mass.
our desired shape space 😃
\(\textbf{s} = (s_1, s_2, \dots, s_n)\)
curse of dimensionality
principal components analysis (PCA)
have a multilobular nucleus
subtipe
eosinophils
neutrophils
d < n
\(\textbf{s} = (s_1, s_2, \dots, s_d)\)
KNN
F1-Score
The best known classification algorithms for WBC image
deep learning
basophils
forward reaction
\( \frac{d[A]}{dt} = f \)
activate or repress
SARS coronavirus (SARS-CoV)
\(k_{dissociate}\)
\(k_{bind}\)
if we add more particles, solving equations become hard
Gillespie’s stochastic simulation algorithm
input
\(l_0~~ t_0~~ k_{bind}~~ k_{dissociate}\)
output
[L], [T] and [LT]
how we find \(k_{bind}\) and \(k_{dissociate}\)?
experimentally
what is units?
concentration
reaction rate
\(k_{bind}[L][T] \rightarrow \frac{mol}{\mu m^3 . s}\)
\(k \rightarrow\) different
\(k_{dissociate}[LT] \rightarrow \frac{mol}{\mu m^3 . s}\)
\([A] \rightarrow \frac{mol}{\mu m^3}~~~~~\)
detour
a well-mixed environment of particles
on average, \(\lambda\) customers enter a store in a single hour
\( X \sim \) number of customers enter the store in next hour
\(X \sim Poisson\)
\(Pr(X = n) = \frac{\lambda^n e^{-\lambda}}{n!}~~~~~~\)
\( X \sim \) number of customers enter the store in next \(t\) hours
\(X \sim Poisson\)
\(Pr(X = n) = \frac{(\lambda t)^n e^{-\lambda t}}{n!}~~~~~~\)
how long we will typically have to wait for the next customer to arrive?
\(T \sim \) exponential distribution
\(Pr(T > t) = Pr(X = 0, T = t) = e^{- \lambda t}\)
if t increases the probability of Pr(T > t) decrease exponentially
a reaction involving those particles taking place at some average rate
how long should we expect to wait before this reaction occurs somewhere in the environment?
customers : instances of a chemical reaction
\(\lambda\) : the rate r at which the reaction occurs
\(E(T) = \frac{1}{\lambda}\)
we have n reactions with rates \(r_1, r_2, \dots, r_n\)
average waiting time
\(\frac{1}{r_1, r_2, \dots, r_n}\)
we can generate random time from exponential distribution
now we should do a reaction but witch
select one with probability \(P_i = \frac{r_i}{r_1 + r_2 + \dots + r_n} \)
example
\(P(L + T \rightarrow LT) = \frac{r_{bind}}{r_{bind} + r_{dissociate}}\)
\(P(LT \rightarrow L + T) = \frac{r_{dissociate}}{r_{bind} + r_{dissociate}}\)
process of signal transduction
phosphorylation
phosphoryl group usually comes from
dephosphorylation
\(ATP \rightarrow ADP + PO_{3}^{-}\)
a molecule loses its phosphoryl group
a chemical reaction that attaches a phosphoryl group (\(PO_{3}^{-}\)) to an organic molecule.
Phosphoryl modifications serve as an information exchange
activate or deactivate certain enzymes
phosphorylation cascade
a sequence of phosphorylation and dephosphorylation events
in bacterial chemotaxis serves to
transmit information within the cell about the amount of ligand binding being detected on the cell’s exterior.
material
receptor
methyl-accepting chemotaxis proteins (MCPs)
in the cell exterior
bind to ligand
in the cell interior
bind to proteins
The pathway also includes a number of additional proteins, which all start with the prefix Che (short for “chemotaxis”).
MSP bind to CheW and CheA
in absence of MCP-ligand, this complex is more stable
CheA molecule autophosphorylates
phosphorylated CheA pass on its phosphoryl group to CheY
phosphorylated CheY interact with flagellar motor switch
flagellar motor switch
change of flagellar rotation from counter-clockwise to clockwise
responsible for controlling the direction of flagellar rotation.
tumble
which in the absence of an increase in attractant occurs every 1 to 1.5 seconds
in presence of MCP-ligand, this complex is unstable
CheA is less able to autophosphorylate
less likely to tumble
run for a longer period of time
We Can add this reactions to Gillespie algorithm
increase in Ligand
decrease in phosphorylated CheY
less tumbling
adaptation
attractant remains constant for a period of time
regardless of the absolute value of the concentration
cell returns to the same background tumbling frequency
Bacteria remember past concentrations using methylation
methylation
a methyl group (\(−CH_3\)) is added to an organic molecule
Every MCP receptor contains four methylation sites
demethylation
On the plasma membrane
MCPs, CheW, and CheA molecules form an array structure
Methylation reduces the negative charge on the receptors
stabilizing the array
facilitating CheA autophosphorylation
how increase tumbling frequency?
ligand
methylation
CheR -> Methylation of MCPs
When bound to MCPs, CheR methylates ligand-bound MCPs faster
the rate of MCP methylation by CheR is higher if the MCP is bound to a ligand
when ligand increase
autophosphorylated CheY \(\downarrow\)
run
after a time
CheR methylate MCP
autophosphorylated CheY \(\uparrow\)
tumble
CheB -> Demethylation of MCPs
CheA -> Phosphorylate CheB
autophosphorylated CheA \(\uparrow\)
autophosphorylated CheB \(\uparrow\)
demethylation of MCP \(\uparrow\)
autophosphorylated CheA \(\downarrow\)
autophosphorylated CheY \(\downarrow\)
combinatorial explosion : means that building realistic models of biochemical systems at scale can be daunting.
rule-based modeling : a paradigm in which a potentially enormous number of reactions are specified by a much smaller collection of “rules” from which all reactions can be inferred.
Bacterial tumbling is robust to large sudden changes in attractant
\(l_0 = 100,000\)
\(l_0 = 1,000,000\)
\(l_0 = 10,000,000\)
in each state it returns to equilibrium quick and in same time
orange is CheY+P concentration
x side is time
Traveling up an attractant gradient
Imagine a glucose cube in an aqueous solution. As the cube dissolves, a gradient will form, with a glucose concentration that decreases outward from the cube.
to simulate
we increase \([L]\) exponentially
\([L] = l_0 e^{k.t}\)
k represents the steepness of the gradient
\(l_0 = 1000\) and \(k = 0.1\)
The Beauty of E. coli’s Random Exploration Algorithm
we simulate two possible scenario
average waiting time is fix
average waiting time depend on environment