Sequencing DNA With Neural Networks

================================ # Sequencing DNA With Neural Networks - Michael P.S. Brown, Pacific Biosciences - 23 August 2019 - Bay Area Bioinformatics Meetup - - __As much DNA sequence as possible, as accurately as possible.__ - __Perfection by eliminating noise.__ <div class=vspace></div> ================================ # Characterization of a Single Molecule, Real-Time DNA Sequencing Machine pca | tsne -- | -- <img width=512 src="pca.png"> | <video width=512 autoplay controls loop> <source src="tsne.mp4" type="video/mp4"></video> - http://projector.tensorflow.org/?config=http://openboundlabs.com/embed/projector.config - Neural network - to predict correct DNA sequence - from raw noisy sequencer reads - using embedding of 5-mer sequences - Information in our raw sequencing data has structure across the different base contexts. - What causes that structure and how to best use it to improve accuracy? <div class=vspace></div> ================================ # DNA Sequencing From First Principles <video width=512 autoplay controls loop><source src="dnaunzip_-qYsW0jIFH5A.mp4" type="video/mp4"></video> <div style="float:right;"><img width=200 src="https://planetary.s3.amazonaws.com/assets/images/charts-diagrams/20150206_treeoflifecomplete1.png"></div> - DNA is a linear sequence of discrete symbols... Just like a computer program. Coincidence? - DNA bases are small about 50 atoms, 3.4 nm between DNA bases - Current disks use about 1 million atoms to store a bit. - Life keeps fresh copies everywhere! - You can bury it in the cold ground for 700,000 years and still read it off (with degradations) - Instructions for all life on earth. - Origin of Life: - structured molecules storing information that can _replicate_ in the physical universe. - Evolution works from there. - Entities that can replicate their information the best win! <div class=vspace></div> ================================ # DNA Replication - So the key to evolution is replication. - Nature built a little copy machine that replicates DNA very well: the DNA polymerase - a template DNA strand comes in - polymerase senses what base should be incorporated - when that base comes floating around, it pulls it in and incorporates it crystal | dynamics -- | -- <img width=512 src="phi29-2PY5.jpg"> | <video width=512 autoplay controls loop> <source src="wehiDNApoly_-sKe3UgH1AKg.mp4" type="video/mp4"></video> <div class=vspace></div> ================================ # How to sequence DNA - Polymerase replicates DNA very well and is the basis of life on earth. - __Just read off what bases are being incorporated.__ - Did I mention DNA bases are really small... - How? Use lasers and tiny holes ! <img width=200 src="laserbeams.jpg"> <img width=100 src="sciencecover.png"> - Bases being incorporated have a dye that reacts to laser light. - Put each _SINGLE DNA molecule_ in a tiny 90nm hole to limit interference. ZMW | Output Trace -- | -- <video width=500 autoplay controls loop> <source src="introSMRTSeq_-NHCJ8PtYCFc.mp4" type="video/mp4"></video> | <img width=600 src="pacbiotrace.png"> - Do this for 8 million holes fabricated on a silicon chip. -Throw water and DNA on top of chip and record the data! <div class=vspace></div> ================================ # Philosophical Sidenote <div style="float:right;"><img width=200 src="https://www.ancient.eu/img/r/p/500x600/1165.jpg"><img width=200 src="https://i.pinimg.com/474x/90/13/d3/9013d33f5239978087a273f83cf7254b--im-back-back-to-work.jpg"></div> - The universe started with - a linear sequence of discrete symbols - replicating obeying physical universe rules ( quantum, atomic, thermodynamic, ...) - A competitive game of evolution where replication encourages greater complexity if it improves fitness. - Eventually you get to the complexity of brains. - More fit; efficiency increased by understanding what the universe can do to them. - Brains reach a level of complexity such that they have - language (communicate information across brains) - theories of computation (information storing Turing machines) - science (the laws of the universe) - technology (lasers, nanoscale fabrication) - Such that they can read off the DNA code of themselves - start reasoning about how DNA can be improved. - What does this mean about the nature of the physical universe as manifested in our ability to read off the code that created us? - Ok let's stop there before we all become philosophers... - __Back to work!__ <div class=vspace></div> ================================ # Noise and Statistics <div style="float:right;"><video width=300 autoplay controls loop> <source src="long-read_-y3OOUhQGFeE_methDynamicsTrace.mp4" type="video/mp4"></video></div> - We are observing a SINGLE molecule of DNA - reading off 50 atoms at each step - using a "spongy" biological molecule as our sequence machine. - There's plenty of room at the bottom but also pretty noisy down here. - We observe the molecular dynamics of this tiny biological machine in real time! - Our raw error rate is about 10-15%. - We need some extra power: __independent observations__! - Circularize the DNA using SMRTbells. 10,000-20,000 DNA bases on each strand. <video width=600 autoplay controls loop> <source src="ccs_-z-D5X5eOOV8.mp4" type="video/mp4"></video> - Sequence such that we can see each strand 10 times=300,000 bases from a single molecule. - You can do statistics on these sample size and drive down error: consensus sequencing. <div class=vspace></div> ================================ # Life Is A Bit More Complicated <div style="float:right;"><img width=400x src="passes.png"></div> - Theoretically you expect an exponential increase in accuracy for every additional observation: $log(accuracy) \propto N*(\log(\frac{Perr}{1-Perr}))$ - This scaling law applies well but plateaus at about QV30-QV40 - (one error in 10^3 - 10^4 bases) - Pareto Principle the last bit of accuracy is the hardest. - Accuracy is KEY! A single base error can cause proteins to become non-functional and cause disease. - Model based approaches (Hidden Markov Models) have done very well - average QV30 (Wenger et.al 2019) - However models - might be missing information (missing observed variable) or - have mismatched model assumptions (Markov independent, interactions) - There is a Bayes optimal limit to performance given the noise. - Are we theoretically optimal? - Does there _exist_ any other method that can get better performance? <div class=vspace></div> ================================ # A Universal Approximator - Rather than build model with assumptions - Throw a lot of data with known answers to a universal approximator: a neural network - Will it pick up on signals that we might not have thought of and increase performance? - INPUT: MSA of multiple raw read passes from same hole against a "proposed" noisy reference - OUTPUT: a consensus reference that has fewer errors - <img width=600 src="ccs-ml.svg"> ``` proposed T....T....T....G....C....T....T....G....A....A....C....A....T....C....T....T....T....G....G....G....G. reference T....T....G....G....C....T....T....G....A....Aac..C....A....G....C....T....T....T....G....G....G....G. subreads T....T....T....G....C....T....T....G....A....A....C....A....T....C....T....T....T....G....G....G....G. T....C....Tcc..T....C....T....T....G....A....Aac..C....A....G....C....T....T....Tgtg+G....G....G....G. T....T....G....G....Ct...T....T....G....A....Aact+C....A....T....C....T....T....Tgc..G....G....G....Gt -....T....G....G....C....-....T....G....A....Ac...C....A....G....C....T....T....T....G....G....G....Gt T....C....G....G....C....T....G....A....A....Ac...C....A....G....C....T....T....Tta..G....G....G....G. -....T....G....G....C....T....T....G....A....Ac...C....A....G....C....T....T....T....G....G....G....G. T....T....G....G....C....T....T....Ga...A....A....C....A....G....C....T....T....T....G....G....G....G. T....Ta...G....G....C....T....T....G....A....Aac..C....A....G....C....T....T....T....G....G....G....G. ``` - We have been working with great people at __Nvidia__ to do this modeling on GPUs (4x Tesla V100). <div class=vspace></div> ================================ # Experiment: Read Sequence Context To Help Predict Consensus - Sequence _known_ trusted sample (Human HG002) to generate data. - Neural network takes noisy DNA bases and predicts the known correct base. - Input tensors of MSA against "proposed" noisy reference, one raw read. - Input 5-mer contexts of "proposed" raw reference as additional input. - Embedded 5-mer into a 10-dimensional space ``` inputmsa = KK.layers.Input(shape=(args.rows,args.cols,args.baseinfo), name="inputmsa") inputctx = KK.layers.Input(shape=(args.cols,), name="inputctx") baseadj = KK.layers.Conv2D(256,kernel_size=(16,21),strides=(16,1),activation='relu',padding="same",name="baseadj")(inputmsa) baseadj2 = KK.layers.Lambda( lambda xx: tf.squeeze( xx, 1), name="baseadj2")(baseadj) embed = KK.layers.Embedding(input_dim=1024, output_dim=10,name="embed")(inputctx) merged = KK.layers.Concatenate(axis=2,name="merged")([baseadj2,embed]) predBase = KK.layers.TimeDistributed( KK.layers.Dense(5, activation='softmax',name="predBase"))(merged) ``` - Train on data by following gradients. <div class=vspace></div> ================================ # Result: Read Sequence Context To Help Predict Consensus - Plot of 10-dimensional 5-mer embedding space. pca | tsne -- | -- <img width=512 src="pca.png"> | <video width=512 autoplay controls loop> <source src="tsne.mp4" type="video/mp4"></video> - This is not uniform! It has structure! - As expected the first read base appears to be the biggest discriminator. - There are interesting trailing structure. - "context effects" where the polymerase when physically touching these bases - Crystal structure shows that 14-15 bases physically touch the polymerase during incorporation - This informative structure is exploited by the network to increase consensus performance. - Push information back into our engineered polymerase and measurement system to scientifically improve! <div class=vspace></div> ================================ # Forward Opportunities - Current NN models are improving in performance but are not yet better than our HMM model-based approaches. - Challenges: - We are already Bayes optimal? - Different data representation? - Better NN model architectures? - More data needed? - Better training? - Our human data is publicly available: - https://github.com/PacificBiosciences/DevNet/wiki/Genome-In-A-Bottle-(GIAB)-Human-Genomes-sequenced-on-PacBio-Platforms - __THANKS!__ to the great people at __Nvidia__ (George Vacek, Mike Vella, Avantika Lal, Johnny Israeli, David Nola, Brian Welker) and __PacBio__ ! <img width=450px src="pacb.png"> <img src="nvidia.png"> <div class=vspace></div> ================================ <xmp theme="united" style="display:none;">