Topics for Theses in Statistics

Bachelor Theses

  • Title: Apply Generalized Linear Mixed Model to an Experiment on Nudging Meal Choices
    Short description: In behavioural economics, the use of small interventions (nudges) to influence people's behaviour, often in a socially desirable way, is an area of considerable interest. This thesis analyses data from an experiment that investigated a way of nudging Mensa customers towards choosing meat-free meals. Since the outcome is categorical, an ordinary linear model is not viable. Further, it would be desirable to take inter-individual differences into account. This thesis will therefore employ a generalised linear mixed model to analyse the data.
    Contact: Johannes Brachem (brachem@uni-goettingen.de)
  • Title: Lohn- und Personalstrukturanalyse im niedersächsischen Gesundheitswesen
    Short description: Im Rahmen einer (Bachelor-)Abschlussarbeit sollen Daten zur Lohn- und Personalstruktur in niedersächsischen Krankenhäusern und vergleichbaren Gesundheitsbetrieben erhoben und analysiert werden.
    Contact: Alexander Silbersdorff (asilbersdorff@uni-goettingen.de)
  • Title: What catches the learning eye
    Short description: Using PyGaze the eye movement data of students watching introductory mathematics and statistics lectures should be recorded and analysed with respect to the students learning success.
    Contact: Alexander Silbersdorff (asilbersdorff@uni-goettingen.de)
  • Title: Care Penalty - Defining income inequality between the care and other industries
    Short description: Paid and unpaid care keeps society going during the COVID-19 crisis. However, we face a shortage in nursing stuff and care. This may be due to a low remuneration in the care industry. Feminist literature finds a systematic income inequality between care and other industries. For an own analysis of the care penalty the German SOEP or the Mexican ENOE dataset provide possible sources.
    Recommended literature: Folbre, Nancy, Leila Gautham, and Kristin Smith. "Essential Workers and Care Penalties in the United States." Feminist Economics (2020): 1-15.
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: Measuring time use and its implications for gender equality
    Short description: Multidimensional approaches measure poverty today, to go beyond single income measures. Time availability is an important component to assess poverty, as it enables and restricts any activity. Especially, when analyzing gender differences time use can shed light to persisting inequalities, due to the double work burden of women. However, time measures are barely included in multidimensional poverty assessment. Possible topics related to this issue are: discussing time measures, data shortcomings and the inclusion into multidimensional poverty measures. The German SOEP or the Mexican ENUT dataset provide possible sources for case studies.
    Recommended literature: http://www.levyinstitute.org/research/the-levy-institute-measure-of-time-and-income-poverty
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: The interplay of social sustainability and environmental sustainability over time
    Short description: Understanding the relationship of social and environmental sustainability is at the core for a transition to a safe and just future. However the current literature lacks two aspects in particular: first, how to define social sustainability and second a particular focus on the composition of the service sector, which seems to have a high relevance to understand the interplay between social and environmental sustainability, due to the highly differing energy use between services. To evaluate this question a data set that provides the information for empirical analysis needs to be composed. Interesting sources are the Oxfam Care Policy Score Card, Oxfam Better Life Index, Ecological Footprint data, World Bank Development Indicators and interpreting public health approaches in terms of sustainability. This topic provides different opportunities for bachelor and master theses. From an economic perspective a debate about minimum levels of well-being and its operationalization can be discussed and analyzed. From an statistical angle one can advance copula models to time use data. These theses can enable statistical support for the theory on the interlinkages of social and environmental sustainability and provide information for the direction of social and environmental policy.
    Literature: Dorn, F., Maxand, S., Kneib, T. (2021): The Dependence between Income Inequality and Carbon Emissions: A Distributional Copula Analysis, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3800302
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: Modeling extreme rain events in Germany
    Short description: Extreme rain events have grown more frequent since the beginning of the twentieth century. Thus, it is increasingly important to understand these phenomena. In this thesis, we use regression models and meteorological data from Germany to study these events. Since these events are likely to be strongly spatially correlated, a reflection on the spatial patterns behind it is recommended.
    Contact: Isa Marques (imarques@uni-goettingen.de)
  • Title: Using tree-based methods for environmental policy evaluation.
    Short description: The thesis will look in depth at recent applications of tree-based methods in economic research and available data sources. Additionally, the work will identify an appropriate environmental policy and a setting in which the impact on a firm-level outcome of choice can be studied. Finally, the method should be applied to the data on a basic level.
    Contact: Isea Cieply (isea.cieply@uni-goettingen.de)
  • Title: Variable selection for causal inference in observational data: state-of-the-art practices and propositions for GAMLSS
    Short description: In randomized controlled trials, "the gold standard" for establishing causality, covariates can be included if randomization is not perfect or to increase power. These variables are often based on some "theory of change" and are determined in the pre-analysis plan. When using observational data, statisticians usually base their variable selection can on some kind of information criterion (e.g. AIC, BIC) or apply methods such as lasso or boosting. However, a good statistical model may not be a good causal model. This thesis aims at scrutinizing which approaches to variable can serve in a causal context when randomization is not possible. The student is expected to present and explore different existing variable selection procedures, apply them to either a linear model or GAMLSS, and think about if and how they can be applied in a causal context.
    Contact: Maike Hohberg (mhohberg@uni-goettingen.de)
  • Title: Entwicklung von Nitratgehalten Harzer Trinkwasserseen
    Short description: Modellieren von Nitratgehalten der Trinkwasserseen im Harz.
    Contact: Jens Lichter (jens.lichter@uni-goettingen.de)

Master Theses

    Master Theses

    • Title: Distributed regression
      Short description: Due to restrictions on data sharing, statistical models sometimes have to be estimated in a distributed setting, i.e. without having access to one composed data set. Rather, the data set is stored in parts at different locations without access to the raw data. This master thesis should investigate how one joint regression model can be estimated in such a setting based on only minimal summary statistics available for the individual data sets.
      Contact: Thomas Kneib (tkneib@uni-goettingen.de)
    • Title: Confidence intervals for distributional regression models
      Short description: Distributional regression models, as implemented in the R package gamlss, report observed Wald standard errors for the estimated model coefficients. These lead naturally to Wald-type confidence intervals for the coefficients, which typically have lower than the nominal coverage, particularly when the response distribution is highly skewed. In this project the student will use simulation to compare the coverage of various alternatives to the Wald confidence intervals, including: profile likelihood, bootstrap and Bayesian intervals. The simulation will be performed for a variety of response distributions. The master project will be worked on in collaboration with Mikis Stasinopoulos and Gillian Heller.
      Contact: Thomas Kneib (tkneib@uni-goettingen.de)
    • Title: The C4 copula for flexibly modelling multivariate dependence
      Short description: Copulas provide a flexible, versatile approach for modelling the dependence between random quantities, where the specification of a joint density is decomposed into the specification of marginal distributions and a copula function determining the dependence structure. While most commonly applied copula specifications involve one single dependence structure, more flexible extensions such as the C4 copula can be utilized to model more general types of dependence. This thesis will explore the C4 copula and its extensions.
      Contact: Thomas Kneib (tkneib@uni-goettingen.de)
    • Title: Multivariate conditional transformation models and their implied dependence
      Short description: Multivariate conditional transformation models allow for the construction and analysis of complex multivariate regression models. Basic forms of such models are equivalent to Gaussian copula specifications, but additional flexibility can be achieved when relaxing the specification of the multivariate transformation function. This thesis will implement such relaxations and will study their implied dependence structure.
      Contact: Thomas Kneib (tkneib@uni-goettingen.de)
    • Title: LASSO regularization and group fixed effects
      Short description: Fixed effects specifications in panel data enable to control for various types of unobserved heterogeneity, but considerably inflate the number of parameters to be estimated. To overcome this problem, group fixed effects approaches aim at identifying sub-groups in the data that share the same fixed effects structure. In this thesis, regularization approaches such as the fused LASSO will be investigated with respect to their ability to identify group fixed effects in panel data.
      Contact: Thomas Kneib (tkneib@uni-goettingen.de)
    • Title: Apply Bayesian Discrete Conditional Transformation Modeling to an Experiment on Nudging Meal Choices
      Short description: In behavioral economics, the use of small interventions (nudges) to influence people's behavior, often in a socially desirable way, is an area of considerable interest. This thesis applies a novel, highly flexible regression approach, i.e. conditional transformation models (CTMs), to data from an experiment that investigated a way of nudging Mensa customers towards choosing meat-free meals. In CTMs, the conditional distribution of the response given a set of covariates is estimated directly from the data with the help of a reference distribution.
      Contact: Johannes Brachem (brachem@uni-goettingen.de)
    • Title: Investigate The Performance of Bayesian Conditional Transformation Models with different Reference Distributions
      Short description: Conditional Transformation Models (CTMs) are a highly flexible regression modeling approach. In CTMs, the conditional distribution of the response given a set of covariates is estimated directly from the data with the help of a reference distribution. This thesis explores the influence of the choice of reference distribution on the performance of CTMs in various settings.
      Contact: Johannes Brachem (brachem@uni-goettingen.de)
    • Title: Implement Bayesian Discrete Choice Models in Liesel
      Short description: Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov-Chain-Monte-Carlo (MCMC) algorithms. This thesis implements functionality for setting up and sampling discrete choice models with hierarchical priors and mixtures-of-normals-priors with the Liesel framework and validates their behavior through simulations and comparisons to existing implementations in the R package `bayesm`. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite.
      Contact: Johannes Brachem (brachem@uni-goettingen.de)
    • Title: Modelling ordinal explanatory variables
      Short description: In (generalized) linear models ordinal explanatory variables are, depending on the number of levels, often included as either categorical or metric variables. This thesis should investigate how ordinal variables can be modeled without wrongly assuming the variable to be continuous but with taking the neighborhood structure into account by using a Markov-Random-Field or one-dimensional P-Spline.
      Contact: Lea Dammann (leamaria.dammann@uni-goettingen.de), Benjamin Säfken (benjamin.saefken@uni-goettingen.de)
    • Title: Care Penalty - Defining income inequality between the care and other industries
      Paid and unpaid care keeps society going during the COVID-19 crisis. However, we face a shortage in nursing stuff and care. This may be due to a low remuneration in the care industry. Feminist literature finds a systematic income inequality between care and other industries. For an own analysis of the care penalty the German SOEP or the Mexican ENOE dataset provide possible sources.
      Recommended literature: Folbre, Nancy, Leila Gautham, and Kristin Smith. "Essential Workers and Care Penalties in the United States." Feminist Economics (2020): 1-15.
      Contact: Franziska Dorn (fdorn@uni-goettingen.de)
    • Title: Measuring time use and its implications for gender equality
      Multidimensional approaches measure poverty today, to go beyond single income measures. Time availability is an important component to assess poverty, as it enables and restricts any activity. Especially, when analyzing gender differences time use can shed light to persisting inequalities, due to the double work burden of women. However, time measures are barely included in multidimensional poverty assessment. Possible topics related to this issue are: discussing time measures, data shortcomings and the inclusion into multidimensional poverty measures. The German SOEP or the Mexican ENUT dataset provide possible sources for case studies.
      Recommended literature: http://www.levyinstitute.org/research/the-levy-institute-measure-of-time-and-income-poverty
      Contact: Franziska Dorn (fdorn@uni-goettingen.de)
    • Title: The interplay of social sustainability and environmental sustainability over time
      Understanding the relationship of social and environmental sustainability is at the core for a transition to a safe and just future. However the current literature lacks two aspects in particular: first, how to define social sustainability and second a particular focus on the composition of the service sector, which seems to have a high relevance to understand the interplay between social and environmental sustainability, due to the highly differing energy use between services. To evaluate this question a data set that provides the information for empirical analysis needs to be composed. Interesting sources are the Oxfam Care Policy Score Card, Oxfam Better Life Index, Ecological Footprint data, World Bank Development Indicators and interpreting public health approaches in terms of sustainability. This topic provides different opportunities for bachelor and master theses. From an economic perspective a debate about minimum levels of well-being and its operationalization can be discussed and analyzed. From an statistical angle one can advance copula models to time use data. These theses can enable statistical support for the theory on the interlinkages of social and environmental sustainability and provide information for the direction of social and environmental policy.
      Literature: Dorn, F., Maxand, S., Kneib, T. (2021): The Dependence between Income Inequality and Carbon Emissions: A Distributional Copula Analysis, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3800302
      Contact: Franziska Dorn (fdorn@uni-goettingen.de)
    • Title: Implement a package for Kriging in R (or Python)
      Short description: Kriging models (Cressie, 1993) , also known as Gaussian process models, are perhaps the most popular tool for performing spatial regression modeling. This method is widely used in a variety of research areas, such as environmental sciences, ecology, physics, epidemiology, amongst many others. Despite this fact, there are very few packages in R that do Kriging which allows the estimation of all the relevant parameters - spatial range and marginal variance-, and the few existing examples are perhaps not up to date. Thus, this fact both represents a large gap in terms of existing R-packages and a promising topic of research. In this thesis, a R-package will be implemented to perform Kriging. Upon request, the package can alternatively be implemented in Python.
      Contact: Isa Marques (imarques@uni-goettingen.de), Paul Wiemann (pwiemann@uni-goettingen.de)
    • Title: Determinants of tree growth in Northern Germany: a study on spatial confounding
      Short description: When studying tree growth, one often aims at understanding factors that affect it, such as soil type or precipitation. When the dataset is spatially indexed, one must also account for spatial correlation between observations. However, when including spatially varying covariates (such as soil type or precipitation) in a dataset, along side a spatial effect, spatial confounding between these effects can occur. This can hinder the analysis of the determinants of tree growth. In this thesis, you will analyze a dataset for tree growth in Northern Germany, while controlling for spatial confounding. Some references include Dupont, Wood, Augustin (2021).
      Contact: Isa Marques (imarques@uni-goettingen.de)
    • Title: Species Distribution Modeling using Spatial Point Processes
      Short description: Species distribution models are widely used in ecology to predict and understand spatial patterns, assess the influence of climatic and environmental factors on species occurrence, and identify rare and endangered species. In this thesis, you will use spatial point processes to understand this phenomenon. The analysis will use data from EMODNET (https://emodnet.ec.europa.eu/en) and the software R-INLA.
      Contact: Isa Marques (imarques@uni-goettingen.de)
    • Title: Analysing the effect of China's Emissions Trading Scheme (ETS) on firm innovations
      Short description: Following the work of Athey, Tibshirani, and Wager (2019) on Generalized Random Forests, this method will be used to estimate the impact of the ETS on firm innovation. The theoretical basis will be the Porter Hypothesis or the Stackelberg Model. Data can be accessed from the Chinese Statistical Yearbook.
      Contact: Isea Cieply (isea.cieply@uni-goettingen.de)
    • Title: Spatial random forests and its applications in environmental and energy economics
      Short description: This thesis will focus on possibilities of applying spatial random forests to environmental policy analysis. In addition to a detailed literature review of the relevant economic literature, the aim will be to integrate a spatial dimension into the generalized random forests framework (Athey et al. 2019). Data can be accessed from the Chinese Statistical Yearbook and additional sources.
      Contact: Isea Cieply (isea.cieply@uni-goettingen.de)
    • Title: GAMLSS in regression discontinuity designs: methods and applications in impact evaluation
      Short description: The regression discontinuity design (RDD) is a popular tool in economics and epidemiology to establish causality with observational data when the treatment is assigned based on a cutoff rule. Individuals just below and just above the cutoff are assumed to be similar but they differ only in terms of receiving the treatment. Therefore, their outcomes can be compared and no further covariates are needed. However, in some situation, covariates are introduced into the RDD for example when the groups clearly differ in an additional dimension. This thesis investigates how GAMLSS can be introduced in this context. GAMLSS have the advantage that they estimate not only the mean but the whole conditional distribution. In addition, to testing the benefits of such a "conditional distributional RDD", making the approach "unconditional" is another aim of this thesis.
      Contact: Maike Hohberg (mhohberg@uni-goettingen.de)
    • Title: Sampling under linear constrains in Liesel
      Short description: Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov-Chain-Monte-Carlo (MCMC) algorithms. Linear constraints, for example sum-to-zero-constraints, can be useful tools to ensure parameter identifiability in generalized additive models. This thesis implements and evaluates different samplers that allow researchers to specify linear constraints on the sampled parameters. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite. Contact: Johannes Brachem (brachem@uni-goettingen.de), Hannes Riebl (hriebl@uni-goettingen.de)
    • Title: Bayesian variable selection models in Liesel
      Short description: Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov-Chain-Monte-Carlo (MCMC) algorithms. This thesis implements functionality for regularisation priors like Lasso, Horseshoe, and Spike & Slab for comfortable use with the Liesel framework and validates their behaviour through simulations. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite.
      Contact: Johannes Brachem (brachem@uni-goettingen.de)
    • Title: Bayesian Discrete Choice Models in Liesel
      Short description: Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov-Chain-Monte-Carlo (MCMC) algorithms. This thesis implements functionality for setting up and sampling discrete choice models with hierarchical priors and mixtures-of-normals-priors with the Liesel framework and validates their behavior through simulations and comparisons to existing implementations in the R package `bayesm`. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite.
      Contact: Johannes Brachem (brachem@uni-goettingen.de)
    • Title: Random scaling effect models in Liesel
      Short description:Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov chain Monte Carlo (MCMC) algorithms. Random scaling effect models can be used to scale non-linear covariate effects (for example the effect of the price on the sales) with random effects (for example for the stores). This thesis implements random scaling effect models in Liesel, evaluates them in a simulation study, and applies them to an empirical problem. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite.
      Contact: Hannes Riebl (hriebl@uni-goettingen.de), Paul Wiemann (pwiemann@uni-goettingen.de)
    • Title: Measurement error correction models in Liesel
      Short description: Liesel is a Python framework for efficient probabilistic programming that consists of a model-building library and a library for Markov chain Monte Carlo (MCMC) algorithms. Standard regression models assume non-random covariates, which is especially problematic in applications that rely on multiple data sources. One possible remedy is the use of measurement error correction models, which are explored in a Bayesian context in this thesis. The models are implemented in Liesel, including aspects like dependent or non-Gaussian measurement errors in a concrete application case. Since this thesis has a strong focus on programming in Python, prior programming experience in Python is an essential prerequisite.
      Contact: Hannes Riebl (hriebl@uni-goettingen.de), Paul Wiemann (pwiemann@uni-goettingen.de)
    • Title: A Python package for conducting large-scale simulation studies in parallel
      Short description: Methodological research in statistics and machine learning often requires proof of concept implementations of new concepts. In empirical studies, including simulation studies, the novel methods are often compared to existing approaches. A simulation study typically consists of input variables (e.g., sample sizes, estimation methods, or the numbers of covariates) together with a set of values for each as well as a program or function that given one instance of input estimates a model and calculates some summary statistics. Those computations are desired for all combinations of these variables or a subset thereof. The goal of the thesis is to design and implement a framework in Python to facilitate this kind of study and its evaluation. An existing approach is implemented in R (see the simsalapar package and the associated paper).
      Contact: Hannes Riebl (hriebl@uni-goettingen.de), Paul Wiemann (pwiemann@uni-goettingen.de)
    • Title: Investigate the performance of MCMC pre-sampling strategies
      Short description:In Bayesian statistics, Markov chain Monte Carlo (MCMC) methods are often used to estimate model parameters in complex statistical models. However, a long warmup period can be required until the MCMC procedures generates valid samples from the equilibrium distribution. Initializing the MCMC procedure with a draw from the posterior distribution can help to shorten or avoid the warmup period. Unfortunately, drawing from the posterior is usually not possible while drawing from a distribution that is very similar to the posterior might be possible and still can significantly reduce the warmup time. This thesis should study different pre-sampling approaches (e.g., the recently published pathfinder method) in generalized additive models.
      Contact: Hannes Riebl (hriebl@uni-goettingen.de), Paul Wiemann (pwiemann@uni-goettingen.de)
    • Title: Implement Black Box Variational Inference with JAX
      Short description: Variational inference (see Variational Inference: A Review for Statisticians) is a method for approximating the posterior distribution through optimization. The authors of the paper Black Box Variational Inference present a "black-box" variational inference algorithm, which requires no model-specific derivations, proposing two ways of computing a low-variance gradient of the objective function. The goal of this thesis is to write an efficient implementation of the algorithm in Python using JAX, a library for just-in-time compilation and automatic differentiation.
      Contact: Gianmarco Callegher (gianmarco.callegher@uni-goettingen.de)
    • Title: Topics of Bayesian statistic in economics
      Short description: We want to explore the rise of Bayesianism and its topics in the last 20 years in the field of economics. We want to distinguish topics in which Bayesian methods were used as opposed to non-Bayesian methods by looking at a large data set of articles in economic science. Therefore, we need to develop appropriate metrics that can be used in the context of machine learning algorithms.
      Contact: Jens Lichter (jens.lichter@uni-goettingen.de)