Theses

Areas of topics

We offer theses topics from the field of mixed and joint modelling, quantile regression and mixture density networks. All those models are implemented in Bayesian frameworks as well as statistical learning approaches such as gradient boosting. Please feel free to contact us, if you are interested in a topic in one of these areas.

Further topics in the field of statistics are available at the chair of statistics.

Former and current Theses topics

Bachelor Theses

Predicting match results in the German Bundesliga using model-based gradient boosting
Modelling the number of antenatal care visits in selected West African countries via a GAMLSS: A comparison of two gradient-based boosting approaches
Productivity in the German Gaming Industry: The Influence of the subsidy directive "Computerspieleförderung des Bundes"

Master Theses

Correcting biased Random Effects Estimation for Generalised Additive Mixed Models based on Gradient Boosting
Prediction-based Variable Selection for Componentwise Gradient Boosting
Before the Whistle: Bundesliga predictions based on pre-match player data
Silent Partys: A Cluster Analysis of Voting Behavior in the European Parliament
Scalar-on-Function Regression with Spatial Dependence: A Gradient Boosting Framework for Neurophysiological Data
Gradient Boosting for Dirichlet Regression: On the Impact of Protests on Election Results in the Great Recession
Distributional Regression for Lungfunction of Cystic Fibrosis Patients with a Special Focus on Spatial and Random Effects

Topics for Bachelor Theses

Sentiment analysis on attitudes towards migration

Immigration remains a key factor in shaping U.S. political attitudes, especially when distinguishing between long-established immigrant populations (“stocks”) and recent arrivals (“flows”). To better understand these dynamics, a nationwide online survey was conducted in December 2024 with 1,600 participants, including an embedded information experiment. Open-ended responses on immigration are to be analyzed using sentiment analysis to quantify attitudes. These sentiment scores should then be linked to demographic characteristics, voting behavior, and media use through regression analysis. Given the complexity of sentiment distributions, generalized additive models (GAMs) are explored to allow for flexible modeling of non-linear relationships.

This thesis is in cooperation with Prof. Sarah Langlotz.

Contact: lars.knieper@uni-goettingen.de

Dirichlet Regression (Measuring the influence of protests on election results)

After the financial crisis in 2008 worldwide protests again the political/economic measures started and a shift in the voting behavior was occurred. We want to analyse the influence of the protests and further economic and sociological variables on election outcomes using Dirichlet regression.

Contact: elisabeth.bergherr@uni-goettingen.de

Quantifying the impact of post-match variables on soccer predictions via generalized mixed models:

The allure of professional soccer stems from the element of surprise, as numerous unpredictable events occur within the game. Consequently, both bookmakers and passionate fans endeavor to predict game outcomes, driven by financial incentives or sheer fandom.

This thesis is based on an extensive data set comprising pre-match and post-match variables from Europe's top 5 leagues over recent years. The primary objective of this thesis is to quantify the contribution of post-match variables to a model's predictive ability. The prediction targets include the number of goals scored and the match outcome (win, draw, or loss). The proposed models are generalized linear mixed models.

Given high correlation among variables, the research question allows for at least two separate theses using the following mehtods:

(I) The first approach involves model-based gradient boosting for mixed models, which simultaneously performs variable selection and estimates the effects.

(II) The second approach utilizes data engineering techniques to incorporate dimensionality reduction (e.g. PCA, Canonical Correlation) and variable selection before estimating the mixed models through the maximization of their penalized likelihood.

In addition to the aforementioned methodologies, this research will encompass a brief review of prior studies in the field of soccer match predictions, an explorative data analysis, and the application of the mentioned methods to gain deeper insights into soccer predictions.

Contact:lars.knieper@uni-goettingen.de

Categorical Regression Models in Social Sciences

Categorical Regression Models such as multinomial, ordered logit and sequential models deal with non-binary categorical outcomes. Those outcomes can often be found in social science surveys and research questions.
This bachelor thesis investigates the usage of these types of models in a specific field of social sciences (e.g. election results, agreement on social issues, (life) satisfaction). After describing the methods and a short literature overview, at least one model type is applied to a real world data set (e.g. ALLBUS, SOEP, Eurobarometer,...). The results are interpreted properly.

The content focus is oriented towards the interests of the applicant.
The thesis can be written in German or English.

Contact: sophie.potts@uni-goettingen.de

Topics for Master Theses

Model-based boosting modifications - an extensive comparison

Model-based component-wise gradient boosting is a popular tool for data-driven variable selection in regression models. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed (e.g probing, stability selection, deselection). This thesis gives an overview of the modifications and compares their performance for Generalized Additive Models regarding variable selection and prediction accuracy based on an extensive simulation study and applied on (common R or real world) data sets. A special focus can be set on the different types of base-learners (linear, splines, tree-based, spatial).
Contact: sophie.potts@uni-goettingen.de

Gradient-based boosting for GAMLSS – Performance of an adaptive step length approach in case of correlated covariates

Component-wise gradient boosting methods are known for their good performance in case of correlated covariates, since in each iteration the model is updated only using a small fixed step length. When boosting GAMLSS, using adaptive step lengths can result in more balanced submodels for the different distributional parameters. The aim of this thesis is to investigate the performance of an adaptive step length approach compared to an approach using fixed step lengths in a setting with correlated covariates. This is done with an extensive simulation study.
Contact: alexandra.daub@uni-goettingen.de

Gradient-based boosting for GAMLSS – Comparison of different selection criteria for base-learner updates

When estimating GAMLSS, variable selection is often an issue. Due to their intrinsic variable selection, component-wise boosting approaches can provide a remedy with respect to that, where naturally the order of the updates plays an important role. This order of updates can depend on the base-learner selection criteria. There are two natural criteria for the selection of the base-learner update in GAMLSS: the inner and the outer loss (see for example Thomas et al. (2018), Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates), which can however in some cases yield different base-learner updates. The aim of this thesis is to investigate how this selection criterion affects the variable selection as well as the estimated coefficients in the overall model, i.a. considering different types of covariates. This is done with an extensive simulation study.
Contact: alexandra.daub@uni-goettingen.de

Robust Distributional Regression via ANNs

The application of regression models such as GLMs oder GAMs is commonly based on distributional assumptions. As a consequence, outliers that violate these assumptions have the potential to heavily dominate the results of a model. Aim of this thesis is to implement different penalization strategies for distributional regression models via the framework of artificial neural networks.
Contact: tbs.hepp@fau.de

Boosting Mixtures of Regression Models

Mixture regression models are used in scenarios with unobserved heterogeneity, i.e. the (assumed) presence of different latent classes in the data. Depending on the number of latent classes and the corresponding (distributional) regression models, this quickly results in many unknown parameters to be modeled via separate prediction functions. Given their identifiability, component-wise boosting algorithms are generally able to estimate the unknown coefficients of this model framework, but the optimal specification of the algorithm remains unclear. Therefore, this thesis will investigate and evaluate different initialization and updating strategies inside the boosting algorithm.
Contact: tbs.hepp@fau.de

Compare the Saddlepoint and Laplace approximation for spatial models

Numerical approximation methods allow us to find a solution for complex problems; this means that generally they can’t be solved analytically. These types of problems are commonly viewed in spatial models where we introduce covariates to explain a response variable, so that we need to consider how they also can vary in the space as well. Thus, the structure of the model becomes difficulties the analytical solution of the problem. For the above the Laplace approximation has gained popularity due to fast and easy way to express the approximation for the solution, especially for spatial models. However, is known that this method can fail for small data observations and other specific circumstances. For the above we could consider another approximation method called “Saddlepoint approximation” and evaluate its behavior in spatial models.
Contact: j.cavieres.g@gmail.com

Study the behavior of the posterior distribution via the Hamiltonian Monte Carlo method for a spatial random effect (GRF approximate by a GMRF) integrated out with the Saddlepoint approximation

Template Model Builder (TMB) is a frequentist statistical software that enables the estimation of parameters for non-linear models through Laplace approximation and automatic differentiation. This software offers computational efficiency, as the Laplace approximation method provides a close approximation to the true solution, while automatic differentiation automatically computes first and second derivatives. In contrast, Stan is a probabilistic software specifically designed for Bayesian inference. It employs the Hamiltonian Monte Carlo algorithm as the estimation method to derive posterior distributions for model parameters. By using of the “tmbstan” library and applying prior distributions to the parameters, we can easily transform a model developed in TMB into a Bayesian framework, so we can work directly with all the features of a Stan object.
Contact: j.cavieres.g@gmail.com

How to deal with non-continuous spatially confounded covariates?.

In geo-additive models spatial confounding might be observed when a covariate is correlated with the spatial field. This phenomenon occurs when the spatial field estimates mask the correlated fixed effects' estimate, thereby inducing bias. Recently, Dupont et al.'s "Spatial+" gained attention by proposing a straightforward approach, which can be used for generalized models as well. In accordance with previous research (restricted spatial regression, geo-additive structural equation model), the focus has been on continuous confounded covariates. The thesis aims to concentrate on binary and categorical confounded covariates and to compare different approaches to spatial confounding. This includes an overview of existing methods, suggest approaches in the Spatial+-manner and accompany these with an extensive simulation study.
Contact: lars.knieper@uni-goettingen.de

An application of joint models for longitudinal and time-to-event data:.

There is a large body of literature on the relationship between marital satisfaction (satisfaction with the relationship) and divorce. Some studies focus on the mediating effect of marital satisfaction, e.g. for the relationship between (perceived) inequity in the division of household labor and divorce. However, many studies do not take advantage of the richness of panel data i.e., they use only two time points or focus on only one of the two outcomes (marital satisfaction, risk of divorce). The class of joint models for longitudinal and time-to-event data overcomes this problem and is also able to deal with the endogeneity of marital satisfaction. Endogeneity arises from the fact that the trajectory of marital satisfaction over the course of the marriage is unlikely to be independent of the event outcome (divorce/no divorce). This thesis implements a joint model to disentangle the relationships between marital happiness, divorce and (perceived) inequity in the division of household labor. The pairfam dataset is used to exploit the richness of longitudinal data. As the structure of the data set allows a couple data approach, this may be included as well. Some prior knowledge of time-to-event analysis may be helpful but is not mandatory.
Contact: sophie.potts@uni-goettingen.de

An application of the joint latent class mixed model for longitudinal and time-to-event data:

Are there distinct groups of newlyweds' satisfaction trajectories? How are they characterized? Lavner & Bradbury (2010) found latent groups of satisfaction trajectories among newlyweds, but how they differ with respect to divorce was examined using rates rather than time-to-event analysis models. The class of joint latent class models for longitudinal and time-to-event data overcomes this problem and is also able to deal with the endogeneity of marital satisfaction. The endogeneity arises from the fact that the trajectory of marital satisfaction throughout the marriage is unlikely to be independent of the event outcome (divorce/no divorce). This paper implements a joint latent class mixed model for longitudinal and time-to-event data to enrich the findings of Lavner & Bradbury (2010). For this purpose, the pairfam dataset can be used. Some prior knowledge of time-to-event analysis is recommended.
Lavner, J. A., & Bradbury, T. N. (2010). Patterns of change in marital satisfaction over the newlywed years. Journal of Marriage and Family, 72(5), 1171-1187.
Contact: sophie.potts@uni-goettingen.de