Areas of topicsWe offer theses topics from the field of mixed and joint modelling, quantile regression and mixture density networks. All those models are implemented in Bayesian frameworks as well as statistical learning approaches such as gradient boosting. Please feel free to contact us, if you are interested in a topic in one of these areas.
Further topics in the field of statistics are available at the chair of statistics.
Topics for Bachelor Theses
- Dirichlet Regression (Measuring the influence of protests on election results): After the financial crisis in 2008 worldwide protests again the political/economic measures started and a shift in the voting behavior was occurred. We want to analyse the influence of the protests and further economic and sociological variables on election outcomes using Dirichlet regression.
- Quantifying the impact of post-match variables on soccer predictions via generalized mixed models:
The allure of professional soccer stems from the element of surprise, as numerous unpredictable events occur within the game. Consequently, both bookmakers and passionate fans endeavor to predict game outcomes, driven by financial incentives or sheer fandom.
This thesis is based on an extensive data set comprising pre-match and post-match variables from Europe's top 5 leagues over recent years. The primary objective of this thesis is to quantify the contribution of post-match variables to a model's predictive ability. The prediction targets include the number of goals scored and the match outcome (win, draw, or loss). The proposed models are generalized linear mixed models.
Given high correlation among variables, the research question allows for at least two separate theses using the following mehtods:
(I) The first approach involves model-based gradient boosting for mixed models, which simultaneously performs variable selection and estimates the effects.
(II) The second approach utilizes data engineering techniques to incorporate dimensionality reduction (e.g. PCA, Canonical Correlation) and variable selection before estimating the mixed models through the maximization of their penalized likelihood.
In addition to the aforementioned methodologies, this research will encompass a brief review of prior studies in the field of soccer match predictions, an explorative data analysis, and the application of the mentioned methods to gain deeper insights into soccer predictions.
You are also encouraged to propose your own research question utilizing this dataset.
- Categorical Regression Models in Social Sciences
Categorical Regression Models such as multinomial, ordered logit and sequential models deal with non-binary categorical outcomes.
Those outcomes can often be found in social science surveys and research questions.
This bachelor thesis investigates the usage of these types of models in a specific field of social sciences (e.g. election results, agreement on social issues, (life) satisfaction).
After describing the methods and a short literature overview, at least one model type is applied to a real world data set (e.g. ALLBUS, SOEP, Eurobarometer,...). The results are interpreted properly.
The content focus is oriented towards the interests of the applicant.
The thesis can be written in German or English.
- Model-based boosting modifications - an extensive comparison:
Model-based component-wise gradient boosting is a popular tool for data-driven variable selection in regression models.
In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed (e.g probing, stability selection, deselection).
This thesis gives an overview of the modifications and compares their performance for Generalized Additive Models regarding variable selection and prediction accuracy based on an extensive simulation study and applied on (common R or real world) data sets.
A special focus can be set on the different types of base-learners (linear, splines, tree-based, spatial).
- Gradient-based boosting for GAMLSS – Performance of an adaptive step length approach in case of correlated covariates:
Component-wise gradient boosting methods are known for their good performance in case of correlated covariates, since in each iteration the model is updated only using a small fixed step length. When boosting GAMLSS, using adaptive step lengths can result in more balanced submodels for the different distributional parameters. The aim of this thesis is to investigate the performance of an adaptive step length approach compared to an approach using fixed step lengths in a setting with correlated covariates. This is done with an extensive simulation study.
- Gradient-based boosting for GAMLSS – Comparison of different selection criteria for base-learner updates:
When estimating GAMLSS, variable selection is often an issue. Due to their intrinsic variable selection, component-wise boosting approaches can provide a remedy with respect to that, where naturally the order of the updates plays an important role. This order of updates can depend on the base-learner selection criteria. There are two natural criteria for the selection of the base-learner update in GAMLSS: the inner and the outer loss (see for example Thomas et al. (2018), Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates), which can however in some cases yield different base-learner updates.
The aim of this thesis is to investigate how this selection criterion affects the variable selection as well as the estimated coefficients in the overall model, i.a. considering different types of covariates. This is done with an extensive simulation study.
- Robust Distributional Regression via ANNs
The application of regression models such as GLMs oder GAMs is commonly based on distributional assumptions.
As a consequence, outliers that violate these assumptions have the potential to heavily dominate the results of a model.
Aim of this thesis is to implement different penalization strategies for distributional regression models via the framework of artificial neural networks.
- Boosting Mixtures of Regression Models
Mixture regression models are used in scenarios with unobserved heterogeneity, i.e. the (assumed) presence of different latent classes in the data.
Depending on the number of latent classes and the corresponding
(distributional) regression models, this quickly results in many unknown parameters to be modeled via separate prediction functions.
Given their identifiability, component-wise boosting algorithms are generally able to estimate the unknown coefficients of this model framework, but the optimal specification of the algorithm remains unclear.
Therefore, this thesis will investigate and evaluate different initialization and updating strategies inside the boosting algorithm.
- Compare the Saddlepoint and Laplace approximation for spatial models
Numerical approximation methods allow us to find a solution for complex problems; this means that generally they can’t be solved analytically. These types of problems are commonly viewed in spatial models where we introduce covariates to explain a response variable, so that we need to consider how they also can vary in the space as well. Thus, the structure of the model becomes difficulties the analytical solution of the problem. For the above the Laplace approximation has gained popularity due to fast and easy way to express the approximation for the solution, especially for spatial models. However, is known that this method can fail for small data observations and other specific circumstances. For the above we could consider another approximation method called “Saddlepoint approximation” and evaluate its behavior in spatial models.
- Study the behavior of the posterior distribution via the Hamiltonian Monte Carlo method for a spatial random effect (GRF approximate by a GMRF) integrated out with the Saddlepoint approximation.
Template Model Builder (TMB) is a frequentist statistical software that enables the estimation of parameters for non-linear models through Laplace approximation and automatic differentiation. This software offers computational efficiency, as the Laplace approximation method provides a close approximation to the true solution, while automatic differentiation automatically computes first and second derivatives. In contrast, Stan is a probabilistic software specifically designed for Bayesian inference. It employs the Hamiltonian Monte Carlo algorithm as the estimation method to derive posterior distributions for model parameters. By using of the “tmbstan” library and applying prior distributions to the parameters, we can easily transform a model developed in TMB into a Bayesian framework, so we can work directly with all the features of a Stan object.