Project (Johannes Söding)

Making the petabytes of deposited metagenomics sequences accessible through an ultrafast search method

You will develop and apply a method that, according to our estimates, should be able to search for similar sequences in huge databases several orders of magnitude faster than existing algorithms. The method will make the petabytes of metagenomics sequences lying around in public databases accessible, for example to find new enzymes for biotechnology (e.g. new CRISPR-CAS proteins), or new biosynthetic gene clusters producing antimicrobial compounds. Applications demonstrating the usefulness of the method will be part of the project. The project requires an affinity for programming and prior programming experience beyond lecture exercises (ideally in C++).

Statistical and machine learning methods for residue-residue protein contact prediction and protein structure prediction

The statistical coupling between columns in a multiple protein sequence alignment (MSA) with sufficiently many sequences can be used to predict direct physical contact between the corresponding amino acid residues. From the reliably prediceed contacts, the protein structure can be predicted. The currently best methods train undirected graphical models (Markov Random Fields) and predict the coupled residues from the strongest edges (indicating statistical coupling) in the undirected model. In this project, you will develop general machine learning algorithms to efficiently train the models using the true likelihood instead of the commonly used pseudolikelihood. You will then develop a Bayesian statistical model that should be able to predict residue contacts with fewer sequences in the MSA than currently possible. These statistical advances should allow us to predict protein structures for the many protein families that contain too few sequences to make reliable contact and structure predictions with current methods.

Other project topics within our research interests are possible and can be discussed.

Homepage Research Group

For more information see for instance:
  • Söding, J, Zwicker, D, Sohrabi-Jahromi, S, Boehning, M, Kirschbaum, J. (2019) Mechanisms of active regulation of biomolecular condensates. Trends Cell Biol., accepted. bioRxiv: doi:

  • Steinegger, M., Mirdita, M., and Söding, J. (2019) Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods 16, 603–606., bioRxiv:

  • Banerjee, S., Zeng, L., Schunkert, H., and Söding, J. (2018) Bayesian multiple logistic regression for GWAS analysis. PloS Genetics 14, e1007856.

  • Vorberg, S., Seemayer, S. and Söding, J. (2018) Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction. PLoS Comput. Biol.14, e1006526. bioRxiv M, Söding J (2018) Clustering huge protein sequence sets in linear time. Nature Commun.

  • Steinegger, M., and Söding, J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnol. 35, 1026–1028.