Project (Johannes Söding)

Sequence searching using deep neural networks

Sequence similarity searches are a mainstay of bioinformatics and essential tools in protein functional annotation, protein structure prediction and evolutionary studies. Our group has developed two popular software packages: HH-suite for highly sensitive sequence searches and MMseqs2 for very fast searches are among the most widely used packages. In this project, you will develop a method for boosting the sensitivity of HH-suite and MMseqs2 using deep learning architectures such as transformers and temporal convolutional networks. Applications demonstrating the usefulness of the method will be part of the project. The project requires an interest in machine learning and an affinity for programming and prior programming experience beyond lecture exercises (ideally in C++).

Statistical tools for finding the pathways underlying complex diseases
To better treat and prevent noninfectious, complex diseases such as coronary artery disease, diabetes, or Alzheimer’s disease, we need to find out how they originate. Millions of healthy and diseased patients have been genotyped in the last 10 years in genome-wide association studies (GWAS). For each disease dozens of loci have been found at which single nucleotide polymorphisms (SNPs) predict disease risk. But the results are difficult to interpret because the identified SNPs affect the regulation of unknown target genes that are often far away. To identify the target genes eQTL studies such as GTEx measure the gene expression of hundreds of genotyped patients and associate SNPs with gene expression changes. The most promising type of approach trains a machine learning method on such eQTL data to predict gene expression levels given the patient genotype [1-3]. It then predicts gene expression levels for thousands of GWAS patients and finds genes whose overexpression is associated with higher disease risk. Inhibiting the corresponding proteins might slow disease development making these protein good targets for drug development. In this project you will work on developing Bayesian statistical methods (more precisely stochastic variational inference methods) to improve the detection of causal SNPs and the prediction of gene expression from genotype with the goal of improving the prediction of genes and pathways involved in disease origination.

These are representative project suggestions. Other topics can be discussed.

Homepage Research Group

Google Scholar profile

For more information see for instance:

  • Steinegger, M., Mirdita, M., and Söding, J. (2019) Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods 16, 603–606., bioRxiv:
  • Banerjee, S., Zeng, L., Schunkert, H., and Söding, J. (2018) Bayesian multiple logistic regression for GWAS analysis. PloS Genetics 14, e1007856.
  • Banerjee, S., Simonetti, F., Detrois, K., Kaphle, A., Mitra, R., Nagial, R., and Söding, J. (2020) Reverse regression increases power for detecting trans-eQTLs. bioRxiv.
  • Vorberg, S., Seemayer, S. and Söding, J. (2018) Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction. PLoS Comput. Biol.14, e1006526. bioRxiv M, Söding J (2018) Clustering huge protein sequence sets in linear time. Nature Commun.
  • Steinegger, M., and Söding, J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnol. 35, 1026–1028.
  • Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods 9, 173-175.