================================
# Sequencing DNA With Neural Networks
- Michael P.S. Brown, Pacific Biosciences
- 23 August 2019
- Bay Area Bioinformatics Meetup
-
- __As much DNA sequence as possible, as accurately as possible.__
- __Perfection by eliminating noise.__
================================
# Characterization of a Single Molecule, Real-Time DNA Sequencing Machine
pca | tsne
-- | --
|
- http://projector.tensorflow.org/?config=http://openboundlabs.com/embed/projector.config
- Neural network
- to predict correct DNA sequence
- from raw noisy sequencer reads
- using embedding of 5-mer sequences
- Information in our raw sequencing data has structure across the
different base contexts.
- What causes that structure and how to best use it to improve accuracy?
================================
# DNA Sequencing From First Principles
- DNA is a linear sequence of discrete symbols... Just like a computer
program. Coincidence?
- DNA bases are small about 50 atoms, 3.4 nm between DNA bases
- Current disks use about 1 million atoms to store a bit.
- Life keeps fresh copies everywhere!
- You can bury it in the cold ground for 700,000 years and still read it off (with degradations)
- Instructions for all life on earth.
- Origin of Life:
- structured molecules storing information that can _replicate_ in the physical universe.
- Evolution works from there.
- Entities that can replicate their information the best win!
================================
# DNA Replication
- So the key to evolution is replication.
- Nature built a little copy machine that replicates DNA very well:
the DNA polymerase
- a template DNA strand comes in
- polymerase senses what base should be incorporated
- when that base comes floating around, it pulls it in and incorporates it
crystal | dynamics
-- | --
|
================================
# How to sequence DNA
- Polymerase replicates DNA very well and is the basis of life on
earth.
- __Just read off what bases are being incorporated.__
- Did I mention DNA bases are really small...
- How? Use lasers and tiny holes !
- Bases being incorporated have a dye that reacts to laser light.
- Put each _SINGLE DNA molecule_ in a tiny 90nm hole to limit
interference.
ZMW | Output Trace
-- | --
|
- Do this for 8 million holes fabricated on a silicon chip.
-Throw water and DNA on top of chip and record the data!
================================
# Philosophical Sidenote
- The universe started with
- a linear sequence of discrete symbols
- replicating obeying physical universe rules ( quantum, atomic, thermodynamic, ...)
- A competitive game of evolution where replication encourages greater
complexity if it improves fitness.
- Eventually you get to the complexity of brains.
- More fit; efficiency increased by understanding what the universe
can do to them.
- Brains reach a level of complexity such that they have
- language (communicate information across brains)
- theories of computation (information storing Turing machines)
- science (the laws of the universe)
- technology (lasers, nanoscale fabrication)
- Such that they can read off the DNA code of themselves
- start reasoning about how DNA can be improved.
- What does this mean about the nature of the physical universe as
manifested in our ability to read off the code that created us?
- Ok let's stop there before we all become philosophers...
- __Back to work!__
================================
# Noise and Statistics
- We are observing a SINGLE molecule of DNA
- reading off 50 atoms at each step
- using a "spongy" biological molecule as our sequence machine.
- There's plenty of room at the bottom but also pretty noisy down here.
- We observe the molecular dynamics of this tiny biological machine in
real time!
- Our raw error rate is about 10-15%.
- We need some extra power: __independent observations__!
- Circularize the DNA using SMRTbells. 10,000-20,000 DNA bases on each strand.
- Sequence such that we can see each strand 10 times=300,000 bases
from a single molecule.
- You can do statistics on these sample size and drive down error:
consensus sequencing.
================================
# Life Is A Bit More Complicated
- Theoretically you expect an exponential increase in accuracy for
every additional observation: $log(accuracy) \propto N*(\log(\frac{Perr}{1-Perr}))$
- This scaling law applies well but plateaus at about QV30-QV40
- (one error in 10^3 - 10^4 bases)
- Pareto Principle the last bit of accuracy is the hardest.
- Accuracy is KEY! A single base error can cause proteins to
become non-functional and cause disease.
- Model based approaches (Hidden Markov Models) have done very well
- average QV30 (Wenger et.al 2019)
- However models
- might be missing information (missing observed variable) or
- have mismatched model assumptions (Markov independent, interactions)
- There is a Bayes optimal limit to performance given the noise.
- Are we theoretically optimal?
- Does there _exist_ any other method that can get better performance?
================================
# A Universal Approximator
- Rather than build model with assumptions
- Throw a lot of data with known answers to a universal
approximator: a neural network
- Will it pick up on signals that we might not have thought of and
increase performance?
- INPUT: MSA of multiple raw read passes from same hole against a "proposed" noisy reference
- OUTPUT: a consensus reference that has fewer errors
-
```
proposed T....T....T....G....C....T....T....G....A....A....C....A....T....C....T....T....T....G....G....G....G.
reference T....T....G....G....C....T....T....G....A....Aac..C....A....G....C....T....T....T....G....G....G....G.
subreads
T....T....T....G....C....T....T....G....A....A....C....A....T....C....T....T....T....G....G....G....G.
T....C....Tcc..T....C....T....T....G....A....Aac..C....A....G....C....T....T....Tgtg+G....G....G....G.
T....T....G....G....Ct...T....T....G....A....Aact+C....A....T....C....T....T....Tgc..G....G....G....Gt
-....T....G....G....C....-....T....G....A....Ac...C....A....G....C....T....T....T....G....G....G....Gt
T....C....G....G....C....T....G....A....A....Ac...C....A....G....C....T....T....Tta..G....G....G....G.
-....T....G....G....C....T....T....G....A....Ac...C....A....G....C....T....T....T....G....G....G....G.
T....T....G....G....C....T....T....Ga...A....A....C....A....G....C....T....T....T....G....G....G....G.
T....Ta...G....G....C....T....T....G....A....Aac..C....A....G....C....T....T....T....G....G....G....G.
```
- We have been working with great people at __Nvidia__ to do this modeling
on GPUs (4x Tesla V100).
================================
# Experiment: Read Sequence Context To Help Predict Consensus
- Sequence _known_ trusted sample (Human HG002) to generate data.
- Neural network takes noisy DNA bases and predicts the known correct base.
- Input tensors of MSA against "proposed" noisy reference, one raw read.
- Input 5-mer contexts of "proposed" raw reference as additional input.
- Embedded 5-mer into a 10-dimensional space
```
inputmsa = KK.layers.Input(shape=(args.rows,args.cols,args.baseinfo), name="inputmsa")
inputctx = KK.layers.Input(shape=(args.cols,), name="inputctx")
baseadj = KK.layers.Conv2D(256,kernel_size=(16,21),strides=(16,1),activation='relu',padding="same",name="baseadj")(inputmsa)
baseadj2 = KK.layers.Lambda( lambda xx: tf.squeeze( xx, 1), name="baseadj2")(baseadj)
embed = KK.layers.Embedding(input_dim=1024, output_dim=10,name="embed")(inputctx)
merged = KK.layers.Concatenate(axis=2,name="merged")([baseadj2,embed])
predBase = KK.layers.TimeDistributed( KK.layers.Dense(5, activation='softmax',name="predBase"))(merged)
```
- Train on data by following gradients.
================================
# Result: Read Sequence Context To Help Predict Consensus
- Plot of 10-dimensional 5-mer embedding space.
pca | tsne
-- | --
|
- This is not uniform! It has structure!
- As expected the first read base appears to be the biggest discriminator.
- There are interesting trailing structure.
- "context effects" where the polymerase when physically touching
these bases
- Crystal structure shows that 14-15 bases physically touch the polymerase
during incorporation
- This informative structure is exploited by the network to increase
consensus performance.
- Push information back into our engineered polymerase and measurement
system to scientifically improve!
================================
# Forward Opportunities
- Current NN models are improving in performance but are not yet
better than our HMM model-based approaches.
- Challenges:
- We are already Bayes optimal?
- Different data representation?
- Better NN model architectures?
- More data needed?
- Better training?
- Our human data is publicly available:
- https://github.com/PacificBiosciences/DevNet/wiki/Genome-In-A-Bottle-(GIAB)-Human-Genomes-sequenced-on-PacBio-Platforms
- __THANKS!__ to the great people at __Nvidia__ (George Vacek, Mike
Vella, Avantika Lal, Johnny Israeli, David Nola, Brian Welker) and
__PacBio__ !
================================