A guide to machine learning for biologists

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

206,07 € per year

only 17,17 € per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

Current progress and open challenges for applying deep learning across the biosciences

Article Open access 01 April 2022

Ensemble deep learning in bioinformatics

Article 17 August 2020

Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Article 04 October 2021

References

  1. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface15, 20170387 (2018). This is a thorough review of applications of deep learning to biology and medicine including many references to the literature. PubMedPubMed CentralGoogle Scholar
  2. Mitchell, T. M. Machine Learning (McGraw Hill, 1997).
  3. Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).
  4. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet.16, 321–332 (2015). CASPubMedPubMed CentralGoogle Scholar
  5. Zou, J. et al. A primer on deep learning in genomics. Nat. Genet.51, 12–18 (2019). CASPubMedGoogle Scholar
  6. Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol.16, 440–456 (2020). PubMedGoogle Scholar
  7. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods16, 687–694 (2019). CASPubMedGoogle Scholar
  8. Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. & Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol.3, e116 (2007). This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods. PubMedPubMed CentralGoogle Scholar
  9. Silva, J. C. F., Teixeira, R. M., Silva, F. F., Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review. Plant. Sci.284, 37–47 (2019). CASPubMedGoogle Scholar
  10. Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol.6, 366 (2015). PubMedPubMed CentralGoogle Scholar
  11. Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci.10, 94 (2016). PubMedPubMed CentralGoogle Scholar
  12. Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell.2, 573–584 (2020). Google Scholar
  13. Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res.47, W402–W407 (2019). CASPubMedPubMed CentralGoogle Scholar
  14. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.26, 990–999 (2016). CASPubMedPubMed CentralGoogle Scholar
  15. Altman, N. & Krzywinski, M. Clustering. Nat. Methods14, 545–546 (2017). CASGoogle Scholar
  16. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol.35, 128–135 (2017). CASPubMedPubMed CentralGoogle Scholar
  17. Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics28, 664–671 (2012). CASPubMedPubMed CentralGoogle Scholar
  18. Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.12, 2825–2830 (2011). Google Scholar
  19. Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw.28, 1–26 (2008). Google Scholar
  20. Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw.5, 2704 (2020). Google Scholar
  21. Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol.20, 659–660 (2019). CASPubMedGoogle Scholar
  22. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.33, 831–838 (2015). CASPubMedGoogle Scholar
  23. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature577, 706–710 (2020). Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered. CASPubMedGoogle Scholar
  24. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). CASPubMedPubMed CentralGoogle Scholar
  25. Tegunov, D. & Cramer, P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat. Methods16, 1146–1152 (2019). CASPubMedPubMed CentralGoogle Scholar
  26. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution. CASPubMedGoogle Scholar
  27. Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).
  28. Adebayo, J. et al. Sanity checks for saliency maps. NeurIPShttps://arxiv.org/abs/1810.03292 (2018).
  29. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML48, 1050–1059 (2016). Google Scholar
  30. Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics21, 119 (2020). PubMedPubMed CentralGoogle Scholar
  31. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B.58, 267–288 (1996). Google Scholar
  32. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B.67, 301–320 (2005). Google Scholar
  33. Noble, W. S. What is a support vector machine? Nat. Biotechnol.24, 1565–1567 (2006). CASPubMedGoogle Scholar
  34. Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol.609, 223–239 (2010). CASPubMedGoogle Scholar
  35. Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. & Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol.4, e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks. PubMedPubMed CentralGoogle Scholar
  36. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet.46, 310–315 (2014). CASPubMedPubMed CentralGoogle Scholar
  37. Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods16, 1037–1044 (2019). CASPubMedPubMed CentralGoogle Scholar
  38. Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods15, 5–6 (2018). CASPubMedPubMed CentralGoogle Scholar
  39. Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem.38, 169–177 (2017). PubMedGoogle Scholar
  40. Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics19, 84 (2018). PubMedPubMed CentralGoogle Scholar
  41. Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput.23, 192–203 (2018). PubMedPubMed CentralGoogle Scholar
  42. Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res.47, 1044 (2019). PubMedGoogle Scholar
  43. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). CASPubMedGoogle Scholar
  44. Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.31, 651–666 (2010). Google Scholar
  45. Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining.96, 226–231 (1996). Google Scholar
  46. Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol.15, e1006907 (2019). CASPubMedPubMed CentralGoogle Scholar
  47. Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol.37, 1482–1492 (2019). CASPubMedPubMed CentralGoogle Scholar
  48. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008). Google Scholar
  49. Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun.10, 5416 (2019). This article provides a discussion and tips for usingt-SNE as a dimensionality reduction technique on single-cell transcriptomics data. PubMedPubMed CentralGoogle Scholar
  50. Crick, F. The recent excitement about neural networks. Nature337, 129–132 (1989). CASPubMedGoogle Scholar
  51. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell.2, 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real-world data. Google Scholar
  52. Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol.202, 865–884 (1988). CASPubMedGoogle Scholar
  53. deFigueiredo, R. J. et al. Neural-network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA92, 5530–5534 (1995). CASPubMedPubMed CentralGoogle Scholar
  54. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci.3, 80 (2016). Google Scholar
  55. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA117, 1496–1503 (2020). CASPubMedPubMed CentralGoogle Scholar
  56. Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell.3, 601–609 (2021). PubMedPubMed CentralGoogle Scholar
  57. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol.36, 983–987 (2018). CASPubMedGoogle Scholar
  58. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods17, 1111–1117 (2020). PubMedPubMed CentralGoogle Scholar
  59. Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics32, i121–i127 (2016). CASPubMedPubMed CentralGoogle Scholar
  60. Yao, R., Qian, J. & Huang, Q. Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules. Bioinformatics36, 1252–1259 (2020). CASPubMedGoogle Scholar
  61. Si, D. et al. Deep learning to predict protein backbone structure from high-resolution cryo-EM density maps. Sci. Rep.10, 4282 (2020). PubMedPubMed CentralGoogle Scholar
  62. Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.2, 158–164 (2018). PubMedGoogle Scholar
  63. AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst.8, 292–301.e3 (2019). CASPubMedPubMed CentralGoogle Scholar
  64. Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics33, 2842–2849 (2017). CASPubMedGoogle Scholar
  65. Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model.58, 472–479 (2018). PubMedGoogle Scholar
  66. Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc.56, 301–318 (2016). PubMedPubMed CentralGoogle Scholar
  67. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.44, e107 (2016). PubMedPubMed CentralGoogle Scholar
  68. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods16, 1315–1322 (2019). CASPubMedPubMed CentralGoogle Scholar
  69. Vaswani, A. et al. Attention is all you need. arXivhttps://arxiv.org/abs/1706.03762 (2017).
  70. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXivhttps://arxiv.org/abs/2007.06225 (2020).
  71. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589 (2021). CASPubMedPubMed CentralGoogle Scholar
  72. Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXivhttps://arxiv.org/abs/1806.01261 (2018).
  73. Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell181, 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad-spectrum antibiotic activities in mice. CASPubMedGoogle Scholar
  74. Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods17, 184–192 (2020). CASPubMedGoogle Scholar
  75. Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst.11, 402–411.e4 (2020). CASPubMedGoogle Scholar
  76. Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun.12, 3168 (2021). CASPubMedPubMed CentralGoogle Scholar
  77. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics34, i457–i466 (2018). CASPubMedPubMed CentralGoogle Scholar
  78. Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Sci. Rep.9, 9237 (2019). PubMedPubMed CentralGoogle Scholar
  79. Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXivhttps://arxiv.org/abs/1903.02428 (2019).
  80. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol.37, 1038–1040 (2019). CASPubMedGoogle Scholar
  81. Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep.6, 19598 (2016). CASPubMedPubMed CentralGoogle Scholar
  82. Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst.11, 49–62.e16 (2020). CASPubMedPubMed CentralGoogle Scholar
  83. Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep.8, 16189 (2018). PubMedPubMed CentralGoogle Scholar
  84. Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun.12, 1882 (2021). CASPubMedPubMed CentralGoogle Scholar
  85. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32, 8024–8035 (2019). Google Scholar
  86. Abadi M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283 (USENIX, 2016).
  87. Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE8, e67863 (2013). CASPubMedPubMed CentralGoogle Scholar
  88. Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform17, 831–840 (2016). This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly. CASPubMedGoogle Scholar
  89. Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol.21, 282 (2020). PubMedPubMed CentralGoogle Scholar
  90. Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J.5, 823–826 (1986). CASPubMedPubMed CentralGoogle Scholar
  91. Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr. Opin. Struct. Biol.21, 404–411 (2011). PubMedGoogle Scholar
  92. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics20, 473 (2019). PubMedPubMed CentralGoogle Scholar
  93. Sillitoe, I. et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res.47, D280–D284 (2019). CASPubMedGoogle Scholar
  94. Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol.10, e1003926 (2014). PubMedPubMed CentralGoogle Scholar
  95. Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein-ligand interactions. J. Chem. Inf. Model.57, 1007–1012 (2017). CASPubMedGoogle Scholar
  96. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med.15, e1002683 (2018). PubMedPubMed CentralGoogle Scholar
  97. Szegedy, C. et al. Intriguing properties of neural networks. arXivhttps://arxiv.org/abs/1312.6199 (2014).
  98. Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science362, 347–350 (2018). CASPubMedPubMed CentralGoogle Scholar
  99. Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes12, e005122 (2019). PubMedPubMed CentralGoogle Scholar
  100. Konečný, J., Brendan McMahan, H., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXivhttps://arxiv.org/abs/1610.02527 (2016).
  101. Pérez, A., Martínez-Rosell, G. & De Fabritiis, G. Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol.49, 139–144 (2018). PubMedGoogle Scholar
  102. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science365, 6457 (2019). Google Scholar
  103. Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxivhttps://www.biorxiv.org/content/10.1101/103663v1 (2017).
  104. Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol.16, e9198 (2020). PubMedPubMed CentralGoogle Scholar
  105. Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxivhttps://doi.org/10.1101/2020.07.22.211482 (2020). ArticleGoogle Scholar
  106. Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXivhttps://arxiv.org/abs/1907.07587 (2019).
  107. Ingraham J., Riesselman A. J., Sander C., Marks D. S. Learning protein structure with a differentiable simulator. ICLRhttps://openreview.net/forum?id=Byg3y3C9Km (2019).
  108. Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. Trajectory-based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu-hours. PLoS Comput. Biol.14, e1006578 (2018). PubMedPubMed CentralGoogle Scholar
  109. Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXivhttp://arxiv.org/abs/2010.01196 (2020).
  110. Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHubhttp://github.com/google/jax (2018).
  111. Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods16, 315–318 (2019). This work provides a software library based on PyTorch providing functionality for biological sequences. CASPubMedPubMed CentralGoogle Scholar
  112. Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun.11, 3488 (2020). CASPubMedPubMed CentralGoogle Scholar
  113. Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXivhttps://arxiv.org/abs/1912.04232 (2019).
  114. Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol.37, 592–600 (2019). CASPubMedPubMed CentralGoogle Scholar
  115. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods18, 203–211 (2020). PubMedGoogle Scholar
  116. Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol.16, e9380 (2020). CASPubMedPubMed CentralGoogle Scholar
  117. AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics20, 311 (2019). PubMedPubMed CentralGoogle Scholar
  118. Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXivhttps://arxiv.org/abs/2012.04035 (2020).
  119. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst.32, 9689–9701 (2019). PubMedPubMed CentralGoogle Scholar
  120. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins87, 1011–1020 (2019). CASPubMedPubMed CentralGoogle Scholar
  121. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol.20, 244 (2019). CASPubMedPubMed CentralGoogle Scholar
  122. Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics36, 5322–5329 (2020). CASPubMed CentralGoogle Scholar
  123. Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom. Intell. Lab. Syst.44, 77–98 (1998). CASGoogle Scholar
  124. Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all gene ontology domains. Sci. Rep.6, 31865 (2016). CASPubMedPubMed CentralGoogle Scholar
  125. Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics10, 159 (2009). PubMedPubMed CentralGoogle Scholar
  126. Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res.33, W480–W482 (2005). CASPubMedPubMed CentralGoogle Scholar
  127. Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet.10, 1077 (2019). CASPubMedPubMed CentralGoogle Scholar
  128. Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem.30, 865–871 (2004). Google Scholar
  129. Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Understanding protein flexibility through dimensionality reduction. J. Comput. Biol.10, 617–634 (2003). CASPubMedGoogle Scholar
  130. Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXivhttps://arxiv.org/abs/1703.06103 (2019).
  131. Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods15, 805–815 (2018). CASPubMedPubMed CentralGoogle Scholar
  132. Antczak, M., Michaelis, M. & Wass, M. N. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun.10, 3100 (2019). PubMedPubMed CentralGoogle Scholar
  133. Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics18, 277 (2017). PubMedPubMed CentralGoogle Scholar
  134. Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun.12, 1340 (2021). CASPubMedPubMed CentralGoogle Scholar
  135. Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics35, 3313–3319 (2019). PubMedGoogle Scholar
  136. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res.42, W314–W319 (2014). CASPubMedPubMed CentralGoogle Scholar
  137. Yuan, Y. & Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl Acad. Sci. USA116, 27151–27158 (2019). CASPubMed CentralGoogle Scholar
  138. Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics17, S9 (2016). Google Scholar
  139. Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S. & Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem.91, 12407–12413 (2019). CASPubMedPubMed CentralGoogle Scholar
  140. Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods16, 299–302 (2019). PubMedGoogle Scholar
  141. Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites10, 243 (2020). CASPubMed CentralGoogle Scholar
  142. Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods18, 176–185 (2021). CASPubMedPubMed CentralGoogle Scholar
  143. Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun.11, 3877 (2020). CASPubMedPubMed CentralGoogle Scholar
  144. Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng.5, 613–623 (2021). CASPubMedGoogle Scholar
  145. Gligorijevic, V., Barot, M. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics34, 3873–3881 (2018). CASPubMedPubMed CentralGoogle Scholar
  146. Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).
  147. Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci.7700, 437–478 (2012). Google Scholar
  148. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell.3, 199–217 (2021). This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model. Google Scholar
  149. List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol.13, e1005265 (2017). PubMedPubMed CentralGoogle Scholar
  150. Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S. The need for open source software in machine learning. J. Mach. Learn. Res.8, 2443–2466 (2007). Google Scholar

Acknowledgements

The authors thank members of the UCL Bioinformatics Group for valuable discussions and comments. This work was supported by the European Research Council Advanced Grant ProCovar (project ID 695558).