Deep Learning for Biochemistry

Introduction to Deep Learning for Precision Medicine, Genomics, Protein Folding, Computational Chemistry. Biomedicine, Virus Identification.


Deep Learning for Precision Medicine

  • Historical milestones related to precision medicine and artificial intelligence.


  • Complex unresolved problems in neurodevelopmental disorders that artificial intelligence algorithms can create an impact


Deep Learning for Genomics

  • Gene Editing
  • Genome Sequencing
  • Clinical workflows
  • Consumer genomics products
  • Pharmacy genomics
  • Genetic screening of newborns
  • Agriculture

Artificial intelligence in clinical and genomic diagnostics


Deep Learning for GWAS


Deep Learning in Biomedicine

Course: Deep Learning in Genomics and Biomedicine


Biopython

pip3 install biopython


Genome Basics

Differences Between DNA and RNA

DNA vs. RNA – 5 Key Differences and Comparison


Genome, Transcriptome, Proteome, Metabolome

  • Genome (基因組)
  • Transcriptome (轉錄組)
  • Proteome (蛋白質組)
  • Metabolome (代謝組)


RNA-Seq (核糖核酸測序)

RNA-seq (核糖核酸測序)也被稱為Whole Transcriptome Shotgun Sequencing (全轉錄物組散彈槍法測序)是基於Next Generation Sequencing(第二代測序技術)的轉錄組學研究方法


Deep DNA sequence analysis

Basset

Train deep convolutional neural networks to learn highly accurate models of DNA sequence activity such as accessibility (via DNaseI-seq or ATAC-seq), protein binding (via ChIP-seq), and chromatin state.


ENCODE Project Common Cell Types

The Encyclopedia of DNA Elements (ENCODE) Project seeks to identify functional elements in the human genome.

  • Tier 1:
    • GM12878: is a lymphoblastoid cell line (淋巴母細胞系)
    • K562 is an immortalized cell line (永生細胞系). It is a widely used model for cell biology, biochemistry, and erythropoiesis (紅血球細胞生成)
    • H1 human embryonic stem cells
  • Tier 2:
    • HeLa-S3 is an immortalized cell line that was derived from a cervical cancer (宮頸癌) patient.
    • HepG2 is a cell line derived from a male patient with liver carcinoma (肝癌).
    • HUVEC (human umbilical vein endothelial cells) (人臍靜脈內皮細胞)
  • Tier 2.5
    • SK-N-SH, IMR90 (ATCC CCL-186), A549 (ATCC CCL-185), MCF7 (ATCC HTB-22), HMEC or LHCM, CD14+, CD20+, Primary heart or liver cells, Differentiated H1 cells

DeepCTCFLoop

Code: https://github.com/BioDataLearning/DeepCTCFLoop
DeepCTCFLoop is a deep learning model to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs
DeepCTCFLoop was evaluated on three different cell types GM12878, Hela and K562

  • Training
    • python3 train.py -f Data/GM12878_pos_seq.fasta -n Data/GM12878_neg_seq.fasta -o GM12878.output
  • Motif Visualization
    • python3 get_motifs.py -f Data/GM12878_pos_seq.fasta -n Data/GM12878_neg_seq.fasta

DARTS

Blog: 邢毅團隊利用深度學習強化RNA可變剪接分析的準確性
Paper: Deep-learning Augmented RNA-seq analysis of Transcript Splicing
Code: https://github.com/Xinglab/DARTS


Coda


SNP (Single Nucleotide Polymorphism) 單核苷酸多型性

SNP(單核苷酸多型性): DNA序列中的單一鹼基對(base pair)變異,一般指變異頻率大於1%的單核苷酸變異。

  • 於所有可能的DNA序列差異性中,SNP是最普遍發生的一種遺傳變異。在人體中,SNP的發生機率大約是0.1%,也就是每1200至1500個鹼基對中,就可能有一個SNP。
  • 目前科學界已發現了約400萬個SNPs。平均而言,每1kb長的DNA中,就有一個SNP存在;換言之每個人的DNA序列中,每隔1kb單位長度,就至少會發生一個「單一鹼基變異」。由於SNP的發生頻率非常之高,故SNP常被當作一種基因標記(genetic marker),已用來進行研究。

DeepCpG

Paper: DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning
Code: https://github.com/cangermueller/deepcpg


DeepTSS (Transcription Start Site)

Paper: Genome Functional Annotation across Species using Deep CNN
Code: https://github.com/StudyTSS/DeepTSS
Dataset: The TSS positions are collected from the reference genomes for human (hg38) and mouse (mm10) species. http://hgdownload.soe.ucsc.edu/
TSS positions over the entire human and mouse genomes data http://egg.wustl.edu/, the gene annotation is taken from RefGene


DeepFunNet

Paper: DeepFunNet: Deep Learning for Gene Functional Similarity Network Construction

  • http://geneontology.org/docs/ontology-documentation/

Population Genetic Inference

Paper: The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference
Code: https://github.com/flag0010/pop_gen_cnn


GANs for Biological Image Synthesis

Paper: GANs for Biological Image Synthesis
Code: https://github.com/aosokin/biogans
Code: https://github.com/VladSkripniuk/gans
Dataset: LIN dataset
LIN dataset contains photographs of 41 proteins in fission yeast cells.


DeepGP

Genomic Selection is the breeding strategy consisting in predicting complex traits using genomic-wide genetic markers and it is standard in many animal and plant breeding schemes.
Paper: A Guide on Deep Learning for Complex Trait Genomic
Code: DLpipeine
Code: DeepGP
The DeepGP package implements Multilayer Perceptron Networks (MLP), Convolutional Neural Network (CNN), Ridge Regression and Lasso Regression to Genomic Prediction purposes.


Biochemistry Tools

PubChem Sketcher V2.4


Molview


Protein Folding

Attention Based Protein Structure Prediction

Kaggle: https://www.kaggle.com/code/basu369victor/attention-based-protein-structure-prediction


AlphaFold 2

Blog: AlphaFold reveals the structure of the protein universe

Paper: Highly accurate protein structure prediction with AlphaFold

Blog: DeepMind’s AlphaFold 2 reveal: Convolutions are out, attention is in

Code: https://github.com/deepmind/alphafold


AlphaFold Protein Structure Database

AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

  • Q8W3K0: A potential plant disease resistance protein. Mean pLDDT 82.24.

Deep Learning for Computational Chemistry

OpenChem

OpenChem is a deep learning toolkit for Computational Chemistry with PyTorch backend.
Code: https://github.com/Mariewelt/OpenChem


Organic Chemistry Reaction Prediction

Paper: Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models
Code: Organic Chemistry Reaction Prediction using NMT with Attention
The model in version 2 is slightly based on the model discussed in Asynchronous Bidirectional Decoding for Neural Machine Translation.

Retrosynthesis Planner

Paper: Planning chemical syntheses with deep neural networks and symbolic AI
Slides: CSC2547_learning_to_plan_chemical_synthesis.pdf
Code: https://github.com/frnsys/retrosynthesis_planner


Step-wise Chemical Synthesis prediction

Code: A GGNN-GWM based step-wise framework for Chemical Synthesis Prediction


Retrosynthesis

Paper: Decomposing Retrosynthesis into Reactive Center Prediction and Molecule Generation


RetroXpert

Paper: RetroXpert: Decompose Retrosynthesis Prediction like a Chemist
Code: https://github.com/uta-smile/RetroXpert


Neural Message Passing for Quantum Chemistry

Paper: Neural Message Passing for Quantum Chemistry
A Message Passing Neural Network predicts quantum properties of an organic molecule by modeling a computationally expensive DFT calculation

Code: https://github.com/priba/nmp_qc


Biomedicine

DeepChem

Paper: Low Data Drug Discovery with One-Shot Learning
Code: https://github.com/deepchem/deepchem


druGAN

Paper: druGAN
Code: Gananath/DrugAI
Code: kumar1202/Drug-Discovery-using-GANs


MoleculeNet

Paper: MoleculeNet: A Benchmark for Molecular Machine Learning
Datasets: In most datasets, SMILES strings are used to represent input molecules

  • QM7/QM7b datasets are subsets of the GDB-13 database, a database of nearly 1 billion stable and synthetically accessible organic molecules
  • QM8 dataset comes from a recent study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules
  • QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database
  • ESOL is a small dataset consisting of water solubility data for 1128 compounds
  • FreeSolv provides experimental and calculated hydration free energy of small molecules in water. Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. This dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds
  • PCBA is a database consisting of biological activities of small molecules generated by high-throughput screening
  • MUV group is another benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis, contains 17 challenging tasks for around 90 thousand compounds
  • HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds
  • Tox21 contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways
  • SIDER is a database of marketed drugs and adverse drug reactions (ADR)
  • ClinTox compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons

TDC Datasets

  • To install PyTDC
    pip3 install PyTDC
    
  • To obtain the dataset:
    from tdc.Z import Y
    data = Y(name = ‘X’)
    splits = data.split()
    
  • To obtain the Caco2 dataset from ADME therapeutic task in the single-instance prediction problem:
    from tdc.single_pred import ADME
    data = ADME(name = 'Caco2_Wang’) 
    df = data.get_data() 
    splits = data.get_split() 
    

新型抗生素開發

Blog: 新型抗生素開發,機器學習立大功

  • 消息傳遞神經網路
    (圖片來源:M. Abdughani et al., 2019.)

References:

  1. C. Ross, “Aided by machine learning, scientists find a novel antibiotic able to kill superbugs in mice”, STAT, 2020
  2. J. Gilmer et al., “Neural Message Passing for Quantum Chemistry”, arXiv.org, 2017
  3. G. Dahl et al., “Predicting Properties of Molecules with Machine Learning”, Google AI blog, 2017
  4. M. Abdughani et al., “Probing stop pair production at the LHC with graph neural networks”, Springer, 2019

Deep Learning in Proteomics

Paper: Deep Learning in Proteomics


Peptide MS/MS spectrum prediction

  1. pDeep3
  2. Prosit
  3. DeepMass
  4. Predfull
  5. Guan et al.
  6. MS2CNN
  7. DeepDIA
  8. pDeepXL:
  9. Alpha-Frag:
  10. Prosit Transformer:
  11. PrAI-frag

Peptide retention time prediction

  1. AutoRT
  2. Prosit
  3. DeepMass
  4. Guan et al.
  5. DeepDIA:
  6. DeepRT:
  7. DeepLC:
  8. xiRT:

Peptide CCS prediction

  1. DeepCollisionalCrossSection:

Peptide detectability prediction

  1. CapsNet_CBAM:

MS/MS spectrum quality prediction

  1. SPEQ:

Peptide identification

  1. DeepNovo: De novo peptide sequencing
  2. DeepNovo-DIA: De novo peptide sequencing
  3. SMSNet: De novo peptide sequencing
  4. DeepRescore: Leveraging deep learning to improve peptide identification
  5. PointNovo: De novo peptide sequencing
  6. pValid 2: Leveraging deep learning to improve peptide identification
  7. Casanovo: De novo peptide sequencing
  8. PepNet: De novo peptide sequencing
  9. DePS: De novo peptide sequencing
  10. DeepSCP: Utilizing deep learning to boost single-cell proteome coverage

Data-independent acquisition mass spectrometry

  1. Alpha-XIC
  2. DeepDIA:
  3. DeepPhospho: impoves spectral library generation for DIA phosphoproteomics

Protein post-translational modification site prediction

  1. DeepACE:a tool for predicting lysine acetylation sites which belong of PTM questions.
  2. Deep-PLA: for prediction of HAT/HDAC-specific acetylation
  3. DeepAcet: to predict the lysine acetylation sites in protein
  4. DNNAce
  5. DeepKcr
  6. DeepGly
  7. Longetal2018
  8. MUscADEL
  9. LEMP
  10. DeepNitro
  11. MusiteDeep
  12. NetPhosPan:Prediction of phosphorylation using CNNs
  13. DeepPhos
  14. EMBER
  15. DeepKinZero
  16. CapsNet_PTM: CapsNet for Protein Post-translational Modification site prediction.
  17. GPS-Palm
  18. CNN-SuccSite
  19. DeepUbiquitylation
  20. DeepUbi
  21. Sohoko-Kcr

MHC-peptide binding prediction

  1. ConvMHC
  2. HLA-CNN
  3. DeepMHC
  4. DeepSeqPan: Prediction of peptide-MHC bindings
  5. AI-MHC
  6. DeepSeqPanII
  7. MHCSeqNet
  8. MARIA
  9. MHCflurry
  10. DeepHLApan
  11. ACME
  12. EDGE
  13. MHC-I
  14. MHCnuggets
  15. DeepNeo
  16. DeepLigand
  17. PUFFIN
  18. NeonMHC2
  19. USMPep
  20. MHCherryPan
  21. DeepAttentionPan

Benchmarking

  1. Xu R, Sheng J, Bai M, et al. “A comprehensive evaluation of MS/MS spectrum prediction tools for shotgun proteomics”. Proteomics, 2020, 20(21-22): 1900345.
  2. Wenrong Chen, Elijah N. McCool, Liangliang Sun, Yong Zang, Xia Ning, Xiaowen Liu. “Evaluation of Machine Learning Models for Proteoform Retention and Migration Time Prediction in Top-Down Mass Spectrometry”. J. Proteome Res. (2022).
  3. Emily Franklin, Hannes L. Röst, “Comparing Machine Learning Architectures for the Prediction of Peptide Collisional Cross Section”. bioRxiv (2022).

Reviews about deep learning in proteomics

  1. Wen, B., Zeng, W.-F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., Zhang, B., “Deep Learning in Proteomics”. Proteomics 2020, 20, 1900335.
  2. Meyer, Jesse G. “Deep learning neural network tools for proteomics”. Cell Reports Methods (2021): 100003.
  3. Matthias Mann, Chanchal Kumar, Wen-Feng Zeng, Maximilian T. Strauss, Artificial intelligence for proteomics and biomarker discovery. Cell Systems 12, August 18, 2021.
  4. Yang, Y., Lin L., Qiao L., “Deep learning approaches for data-independent acquisition proteomics”. Expert Review of Proteomics 17 Dec 2021.

Virus Identification

ViraMiner

Paper: ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples
Code: https://github.com/NeuroCSUT/ViraMiner


SAR-CoV-2

Database: SARS-CoV-2, taxid:2697049 (Nucleotide)

  • SARS-CoV-2 related compounds, substances, pathways, bioassays, and more in PubChem
    • Compounds used in SARS-CoV-2 clinical trials
    • Compounds found in COVID19-related PDB structures

SARS-CoV-2 accurate identification

Paper: Accurate Identification of SARS-CoV-2 from Viral Genome Sequences using Deep Learning
Code: https://github.com/albertotonda/deep-learning-coronavirus-genome
Kaggle: rkuo2000/coronavirus-genome-identification


SARS-CoV-2 primers

Paper: Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning
Code: https://github.com/steppenwolf0/primers-sars-cov-2


Coronavirus Typing Tool


OpenVaccine COVID-19 mRNA Vaccine Degradation Prediction

Kaggle: OpenVaccine: GCN (GraphSAGE)+GRU+KFold



This site was last updated June 29, 2024.