Topics for Theses in Statistics

Bachelor Theses

  • Title: A Stochastic Frontier Analysis R Package
    Short description: In the field of stochastic frontier analysis a lot of research is done in STATA, thus there are only few R packages for the analysis. These packages happen to be quite limited in their applicability and unwieldy to use, thus the task is to write an R package for stochastic frontier analysis which is flexible and is easy to utilize.
    Useful Sources: Kumbhakar, Subal C.; Wang, Hongren; Horncastle, Alan P. A practitioner's guide to stochastic frontier analysis using Stata. Cambridge University Press, 2015.
    Contact: Rouven Schmidt (rouven.schmidt@uni-goettingen.de)
  • Title: Social Inequality in Education
    Short description: Social inequality in education is due to several mechanisms, e.g. effects of social origin on educational outcomes or educational decisions. Possible topics for applied theses in this regard are: effects of social origin on Mathematics/German/foreign language competencies or effects of social origin on educational decisions (at the end of primary school or upper secondary school). Methods to be applied are: linear models and generalized linear models.
    Contact: Jennifer Lorenz (Jennifer.lorenz@uni-goettingen.de)
  • Pro-Environmental Workplace Intention Behavior at the University
    The aim of this thesis is to identify and quantitatively assess the importance of psychosocial and organizational factors that influence employees' intentions to engage in pro-environmental behaviors at the university. Based on the theory of planned behavior a survey has to be conducted and evaluated in order to predict intentions and barriers to use alternative transportation for business trips or to travel to work.
    Contact: Anne Berner (anne.berner@uni-goettingen.de)
  • Zeitreihenanalyse der Stromdaten der Universität Göttingen
    Short description: Die Universität Göttingen erhebt Strom- und Energieverbrauchsdaten, um ihren Gesamtverbrauch zu reduzieren. Mit einem Anteil von 11% am Gesamtstromverbrauch der Stadt Göttingen gehört die Institution zu einem der Hauptverbraucher. Gegenstand der Abschlussarbeit ist sowohl eine ökonomische Analyse des universitären Energieverbrauchs als auch eine empirische Analyse der Verbrauchsdaten und deren Treiber im Zeitverlauf. (Datenverfügbarkeit vorauss. ab März)
    Contact: Anne Berner (anne.berner@uni-goettingen.de)
  • Title: Nutzenoptimierung mehrerer Güter unter Nebenbedingungen durch Anwendung der Methode der Lagrange-Multiplikatoren
    Short description: Mit der Methode der Lagrange-Multiplikatoren lassen sich Optimierungsprobleme unter Nebenbedingungen lösen. In der Bachelorarbeit soll die Nutzenoptimierung mehrerer Güter zunächst mathematisch dargelegt werden. Anschließend soll diese an einem konkreten Beispiel veranschaulicht werden.
    Contact: Sina Ike (sike@uni-goettingen.de)
  • Title: Overfitting and underfitting in statistics and how to avoid it
    Short description: Overfitting and underfitting are common problems in statistics (and especially in machine learning). Whereas overfitting describes a model that does only perform on the training data set, underfitting does not perform on any data set. This thesis will examine how overfitting and underfitting accrues and explores methods to avoid these.
    Contact: Sina Ike (sike@uni-goettingen.de)
  • Title: Predicting the 2021 Euro Football Champions
    Short description: This thesis aims to project the results of the European Football Championship in 2021 and to predict the winner. The modelling approaches used can range from the field of classical statistics to statistical/machine learning approaches.
    Contact: René-M. Kruse (rene-marcel.kruse@uni-goettingen.de)
  • Title: Is rain intensity influenced by the weekend effect? A linear regression model of rain in Germany
    Short description: It is popular belief that the weather is "bad" more frequently on weekends than on other days of the week (Umlauf, Mayr, Messner and Zeileis, 2011) . The meteorological literature does report some evidence for such human-induced weekly cycles. In this thesis, we use linear regression models and rain data from Germany, ranging from 1936 to 2016, to evaluate this claim. The final model could adjust for potential seasonal and/or spatial patterns.
    Contact: Isa Marques (imarques@uni-goettingen.de)
  • Title: Modeling spatial dependence structures in spatial and spatiotemporal regressions
    Short description: The selection of a spatial weighting matrix that captures the interrelations between the cross-sectional units of a sample is a key issue in the spatial econometrics literature. Classical approaches rely on deterministic, prespecified weighting matrices that are based on fixed and observable measures of social, economic or geographical proximity, which has led to considerable criticism and increased interest in alternative approaches. During the past decade, a variety of data-driven approaches have been proposed to estimate these matrices, such as the selection from candidate schemes, linear combinations or machine learning techniques. This thesis will investigate the consequences of misspecifications of spatial weighting matrices or contribute in developing alternative concepts.
    Contact: Miryam Merk (miryamsarah.merk@uni-goettingen.de)
  • Title: Regression und Eugenik: Waren Galton, Pearson und Fisher Rassisten?
    Short description: Die Gründungsväter der Statistik waren vielfach Vertreter und Treiber der Eugenik. In dieser Arbeit sollen die zweifelhaften Aussagen der drei Herren zum Thema Rasse untersucht werden und anhand von Beispieldatensätzen und Zitaten aus deren historischen Werken ein wenig Licht in dieses dunkle Kapitel der Statistik gebracht werden. Die Arbeit kann auf Deutsch und auf Englisch geschrieben werden.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Care Penalty - Defining income inequality between the care and other industries
    Paid and unpaid care keeps society going during the COVID-19 crisis. However, we face a shortage in nursing stuff and care. This may be due to a low remuneration in the care industry. Feminist literature finds a systematic income inequality between care and other industries. For an own analysis of the care penalty the German SOEP or the Mexican ENOE dataset provide possible sources.
    Recommended literature: Folbre, Nancy, Leila Gautham, and Kristin Smith. "Essential Workers and Care Penalties in the United States." Feminist Economics (2020): 1-15.
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: Measuring time use and its implications for gender equality
    Multidimensional approaches measure poverty today, to go beyond single income measures. Time availability is an important component to assess poverty, as it enables and restricts any activity. Especially, when analyzing gender differences time use can shed light to persisting inequalities, due to the double work burden of women. However, time measures are barely included in multidimensional poverty assessment. Possible topics related to this issue are: discussing time measures, data shortcomings and the inclusion into multidimensional poverty measures. The German SOEP or the Mexican ENUT dataset provide possible sources for case studies.
    Recommended literature: http://www.levyinstitute.org/research/the-levy-institute-measure-of-time-and-income-poverty
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: Variable selection for causal inference in observational data: state-of-the-art practices and propositions for GAMLSS
    Short description: In randomized controlled trials, "the gold standard" for establishing causality, covariates can be included if randomization is not perfect or to increase power. These variables are often based on some "theory of change" and are determined in the pre-analysis plan. When using observational data, statisticians usually base their variable selection can on some kind of information criterion (e.g. AIC, BIC) or apply methods such as lasso or boosting. However, a good statistical model may not be a good causal model. This thesis aims at scrutinizing which approaches to variable can serve in a causal context when randomization is not possible. The student is expected to present and explore different existing variable selection procedures, apply them to either a linear model or GAMLSS, and think about if and how they can be applied in a causal context.
    Contact: Maike Hohberg (mhohberg@uni-goettingen.de)

Master Theses

  • Title: Mixture density networks for paediatric reference intervals
    Short description: Reference intervals play an important role in clinical practice in deciding whether the value of a particular analyte measured on a patient can be considered normal or pathologic. As the recruitment of children for medical studies is subject to strict regulations, clean prospective data is not readily available for this age-group. The aim of this master thesis is to evaluate the use of artificial neural networks (i.e. mixture density networks, specifically) to estimate age-dependent intervals from available but unlabeled laboratory databases. This will include investigation of their performance on synthetic data in simulation studies and application to real laboratory data in comparison with previously established methods. The thesis is addressed to all students with interest in biostatistics/bioinformatics and is supervised in collaboration with the Department of Medical Informatics, Biometry and Epidemiology at the FAU Erlangen-Nuremberg.
    Contact: Tobias Hepp (tbs.hepp@fau.de)
  • Title: Confidence intervals for distributional regression models
    Short description: Distributional regression models, as implemented in the R package gamlss, report observed Wald standard errors for the estimated model coefficients. These lead naturally to Wald-type confidence intervals for the coefficients, which typically have lower than the nominal coverage, particularly when the response distribution is highly skewed. In this project the student will use simulation to compare the coverage of various alternatives to the Wald confidence intervals, including: profile likelihood, bootstrap and Bayesian intervals. The simulation will be performed for a variety of response distributions. The master project will be worked on in collaboration with Mikis Stasinopoulos and Gillian Heller.
    Contact: Thomas Kneib (tkneib@uni-goettingen.de)
  • Title: The C4 copula for flexibly modelling multivariate dependence
    Short description: Copulas provide a flexible, versatile approach for modelling the dependence between random quantities, where the specification of a joint density is decomposed into the specification of marginal distributions and a copula function determining the dependence structure. While most commonly applied copula specifications involve one single dependence structure, more flexible extensions such as the C4 copula can be utilized to model more general types of dependence. This thesis will explore the C4 copula and its extensions.
    Contact: Thomas Kneib (tkneib@uni-goettingen.de)
  • Title: Multivariate conditional transformation models and their implied dependence
    Short description: Multivariate conditional transformation models allow for the construction and analysis of complex multivariate regression models. Basic forms of such models are equivalent to Gaussian copula specifications, but additional flexibility can be achieved when relaxing the specification of the multivariate transformation function. This thesis will implement such relaxations and will study their implied dependence structure.
    Contact: Thomas Kneib (tkneib@uni-goettingen.de)
  • Title: LASSO regularization and group fixed effects
    Short description: Fixed effects specifications in panel data enable to control for various types of unobserved heterogeneity, but considerably inflate the number of parameters to be estimated. To overcome this problem, group fixed effects approaches aim at identifying sub-groups in the data that share the same fixed effects structure. In this thesis, regularization approaches such as the fused LASSO will be investigated with respect to their ability to identify group fixed effects in panel data.
    Contact: Thomas Kneib (tkneib@uni-goettingen.de)
  • Title: A Stochastic Frontier Analysis R Package
    Short description: In the field of stochastic frontier analysis a lot of research is done in STATA, thus there are only few R packages for the analysis. These packages happen to be quite limited in their applicability and unwieldy to use, thus the task is to write an R package for stochastic frontier analysis which is flexible and is easy to utilize.
    Useful Sources: Kumbhakar, Subal C.; Wang, Hongren; Horncastle, Alan P. A practitioner's guide to stochastic frontier analysis using Stata. Cambridge University Press, 2015.
    Contact: Rouven Schmidt (rouven.schmidt@uni-goettingen.de)
  • Title: Non Identifiability of the Closed Skew Normal Distribution
    Short description: The Closed Skew Normal distribution introduced by Gonzalez-Farias et al. is quite popular in the stochastic frontier analysis literature to model the composed error term. However, Valle and Azzalini have shown that this distribution is not identifiable due to the lack of scale constraints on the covariance matrix. The task is to evaluate how severe this is for the applied research.
    Useful Sources: Gupta, Arjun K.; González-Farías, Graciela; Gomínguez-Molina, J. Armando. A multivariate skew normal distribution. Journal of multivariate analysis, 2004, 89, 181-190; Arellano-Valle, Reinaldo B.; Azzalini, Adelchi. On the unification of families of skew-normal distributions. Scandinavian Journal of Statistics, 2006, 33, 561-574.
    Contact: Rouven Schmidt (rouven.schmidt@uni-goettingen.de)
  • Title: Primary effects of social origin in the COVID-19 pandemic
    Short description: This applied thesis will focus on the development of students' competencies during the COVID 19-pandemic and examines whether primary effects of social origin intensified due to school closings and digital instruction. This will be evaluated using mixed models with model averaging via augmented Lagrangian.
    Contact: Jennifer Lorenz (jennifer.lorenz@uni-goettingen.de), Benjamin Säfken (benjamin.saefken@uni-goettingen.de)
  • Title: Pro-Environmental Workplace Intention Behavior at the University
    The aim of this thesis is to identify and quantitatively assess the importance of psychosocial and organizational factors that influence employees' intentions to engage in pro-environmental behaviors at the university. Based on the theory of planned behavior a survey has to be conducted and evaluated in order to predict intentions and barriers to use alternative transportation for business trips or to travel to work.
    Contact: Anne Berner (anne.berner@uni-goettingen.de)
  • Title: Overfitting and underfitting in statistics and how to avoid it
    Short description: Overfitting and underfitting are common problems in statistics (and especially in machine learning). Whereas overfitting describes a model that does only perform on the training data set, underfitting does not perform on any data set. This thesis will examine how overfitting and underfitting accrues and explores methods to avoid these.
    Contact: Sina Ike (sike@uni-goettingen.de)
  • Title: Predicting the 2021 Euro Football Champions
    Short description: This thesis aims to project the results of the European Football Championship in 2021 and to predict the winner. The modelling approaches used can range from the field of classical statistics to statistical/machine learning approaches.
    Contact: René-M. Kruse (rene-marcel.kruse@uni-goettingen.de)
  • Title: Deep Space Image Classification
    Short description: One of the great strengths of Deep Learning approaches is to analyse and classify complex image data reliably. Among the largest open-source resources for high-quality image datasets is the Galaxy Zoo program, a crowdsource science project where anyone can participate in classifying galaxies based on their morphological properties. This thesis aims to build a neural network, based on various deep learning concepts, that delivers a reliable categorisation framework for this complex task.
    Contact: René-M. Kruse (rene-marcel.kruse@uni-goettingen.de)
  • Title: Statistical and Deep Learning Methods to Predict Sport Outcomes based on Weather Data
    Short description: Predicting sports results has been a popular topic ever since sports betting has been around. However, various difficulties arise, especially sports that is played outdoors shows a larger variance due to different external influences. One of the biggest influences of this kind are the weather conditions. The goal of this thesis is to create a machine/statistical learning model that incorporates weather conditions as one of the variables to reliably predict the results.
    Contact: René-M. Kruse (rene-marcel.kruse@uni-goettingen.de)
  • Title: Functional shrinkage with the Lasso estimator
    Short description: Recently, a new method for the shrinkage of functional effects was proposed (Shin et al. 2020, doi: 10.1080/01621459.2019.1654875). The particular innovation of this paper is the construction of a Bayesian shrinkage prior for a spline based functions such that they are shrunken towards a function from a user-specified vector space. The vector space is spanned by the observed values of the covariate or transformations of these. The task of this master thesis is to investigate to how the concept of subspace shrinkage can be integrated into the framework of penalized likelihood, e.g., with a LASSO-type estimator.
    Conctact: Paul Wiemann (pwiemann@uni-goettingen.de)
    Requirements: a good understanding of regression models and shrinkage
  • Title: Collecting Bikesharing data in Germany
    Short description: Bike sharing services have increased in popularity over the last decade. In many cities, the data, e.g. where bicycles are located and how they are moved, is principally available. In Berlin, however, no historical data is publically available. These data are needed for further research and analysis. The objective of this project is to set up a system to collect live data over a longer period of time. Initially, the interfaces of Deezer (NextBike) and Lidlbike (Call-A-Bike) will be addressed.
    Contact: Paul Wiemann (pwiemann@uni-goettingen.de)
    Requirements: advanced programming skills
  • Title: A kriging model for circular data based on independent projection of angular quantities
    Short description: Due to the circular geometry of the sample space, environmental and geophysical processes such as surface winds or waves require the reassessment of typical spatial models for non-periodic/non-circular data. Wang and Gelfand (2013) highlighted the advantages projected Gaussian distributions for modeling circular data. However, this distribution is obtained by radial projection of bivariate distributions on the plane, and the corresponding bivariate model is hard to estimate. Alternatively, we can model independent radial projections onn the plane (Wikle et al., 2001). In this thesis, we fit a kriging model for each of the independent radial projections and evaluate when independence is a reasonable assumption, preferably using Bayesian inference.
    Contact: Isa Marques (imarques@uni-goettingen.de)
  • Title: A slice sampler as an alternative to Gibbs sampling in kriging models
    Short description: The Gaussian process (GP) is a popular way to specify dependencies between random variables in spatial models. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Posterior estimation of the GP is typically performed with Gibbs sampling. Murray and Adams (2010) investigate a slice sampler for covariance hyperparameters of latent Gaussian models. In this thesis, we implement and test the performance of a slice sampler, assuming a linear-Gaussian observation model. Other distributions of the response might additionally be considered.
    Contact: Isa Marques (imarques@uni-goettingen.de)
  • Title: Modeling spatial dependence structures in spatial and spatiotemporal regressions
    Short description: The selection of a spatial weighting matrix that captures the interrelations between the cross-sectional units of a sample is a key issue in the spatial econometrics literature. Classical approaches rely on deterministic, prespecified weighting matrices that are based on fixed and observable measures of social, economic or geographical proximity, which has led to considerable criticism and increased interest in alternative approaches. During the past decade, a variety of data-driven approaches have been proposed to estimate these matrices, such as the selection from candidate schemes, linear combinations or machine learning techniques. This thesis will investigate the consequences of misspecifications of spatial weighting matrices or contribute in developing alternative concepts.
    Contact: Miryam Merk (miryamsarah.merk@uni-goettingen.de)
  • Title: Distributional Regression with Gaussian Mixture Responses
    Short description: Distributional regression is a modern regression approach with flexible response structures, where multiple aspects of the response distributions (e.g. location, scale or shape parameters) are linked to covariates. This thesis explores distributional regression models with Gaussian mixtures as responses. It studies a model specification where both the parameters of the mixture components and the mixture weights are linked to covariates, as well as possible inference algorithms and an application to income data.
    Contact: Hannes Riebl (hriebl@uni-goettingen.de)
  • Title: Model Diagnostics for Distributional Regression with Gaussian Process Responses
    Short description: Distributional regression is a modern regression approach with flexible response structures, where multiple aspects of the response distributions (e.g. location, scale or shape parameters) are linked to covariates. This thesis explores model diagnostics (e.g. based on the idea of posterior predictive checks) for distributional regression models with Gaussian processes as responses. One particular challenge for the development of graphical or otherwise interpretable model diagnostics will be the fact that Gaussian processes are stochastic processes with a continuous domain, complicating the application of traditional approaches for low-dimensional data.
    Contact: Hannes Riebl (hriebl@uni-goettingen.de)
  • Title: Deep Learning based additive models - Structured Predictors
    Short description: Frameworks for estimating neural networks such as Tensorflow or Pytorch can be employed to estimate structured additive regression models. This allows for an interpretable representation of structured effects. Based on an existing implementation of B-splines this approach can be extended to many different types of covariates. For instance Thin Plate Splines, Gaussian-Markov Random Fields, Gaussian Fields, Functional predictors and many more. The focus of this thesis is on implementing one or more of these in Python. Hence programming experience in Python is essential.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Deep Learning based additive models - ADAM
    Short description: In this project the well-known ADAM algorithm is to be implemented in order to determine smoothing parameters in deep learning based additive models. Conventional (Newton) methods are usually used to estimate hyperparameters on the basis of criteria such as the generalized cross-validation criterion. In order to accelerate such estimation methods, especially in the case of large amounts of data, a (stochastic) gradient method is employed. In order to increase the robustness of these optimization methods, an ADAM algorithm is to be implemented in Python, which has proven itself in more complex optimizations. Hence programming experience in Python is necessary.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Deep Learning based additive models - Loss
    Short description: The standard loss function that is commonly used in deep learning applications is the quadratic loss. For statisticians this is an implicit distributional assumption. In many if not most cases this is not appropriate. In this thesis the impact of the naïve choice of loss function will be investigated. The bias that is induced in rather simple settings can be assessed analytically and more complex settings can be simulated.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Estimation of additive terms in structured topic models
    Short description: Topic models can be supplemented with structured components. Metadata can also be integrated into the recognition of abstract "topics". So far the possibilities of integrating terms have been limited to simple (univariate) splines. In this project, further structured terms are to be integrated into the estimation of topic models, among other things to map geographical information. These can be Gauss-Markov random fields, random effects, multi-dimensional splines, etc. The existing implementation in the stm R package should be used for the estimation.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Stepwise fixed effects model selection with cAIC
    Short description: The R package cAIC4 allows for automated stepwise selection of random effects based on the conditional Akaike information. This project is about expanding the automated stepwise selection to the fixed effects terms of a mixed model. The interaction with the random effects selection is not straightforward and different paths forward are possible. Implementation of the developed methods into the existing R package is at the core of this master thesis.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Fractal smoothing parameter selection
    Short description: The two most common smoothing parameter selection criteria (REML & GCV) can be formulated in a common framework (see Reiss & Ogden 2009). This allows for a formulation of a general criterion with a certain order k. In this formulation, REML is of order 1 while GCV is of order 2. But what would be the result for an order of for instance 1.5? This shall be investigated in this thesis. Furthermore could the order also be estimated form the data?
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Degrees of freedom of the LASSO
    Short description: Numerical approximations of the degrees of freedom are straightforward to implement. In this thesis the degrees of freedom of LASSO estimates are investigated. Extensive simulations especially when the LASSO parameter approaches the boundary of the parameter space will be conducted. With this the model choice behavior of the LASSO can be analyzed. Next to implementations and simulations in R theoretical findings could be included into the thesis.
    Contact: Benjamin Säfken (bsaefke@uni-goettingen.de)
  • Title: Care Penalty - Defining income inequality between the care and other industries
    Paid and unpaid care keeps society going during the COVID-19 crisis. However, we face a shortage in nursing stuff and care. This may be due to a low remuneration in the care industry. Feminist literature finds a systematic income inequality between care and other industries. For an own analysis of the care penalty the German SOEP or the Mexican ENOE dataset provide possible sources.
    Recommended literature: Folbre, Nancy, Leila Gautham, and Kristin Smith. "Essential Workers and Care Penalties in the United States." Feminist Economics (2020): 1-15.
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: Measuring time use and its implications for gender equality
    Multidimensional approaches measure poverty today, to go beyond single income measures. Time availability is an important component to assess poverty, as it enables and restricts any activity. Especially, when analyzing gender differences time use can shed light to persisting inequalities, due to the double work burden of women. However, time measures are barely included in multidimensional poverty assessment. Possible topics related to this issue are: discussing time measures, data shortcomings and the inclusion into multidimensional poverty measures. The German SOEP or the Mexican ENUT dataset provide possible sources for case studies.
    Recommended literature: http://www.levyinstitute.org/research/the-levy-institute-measure-of-time-and-income-poverty
    Contact: Franziska Dorn (fdorn@uni-goettingen.de)
  • Title: GAMLSS in regression discontinuity designs: methods and applications in impact evaluation
    Short description: The regression discontinuity design (RDD) is a popular tool in economics and epidemiology to establish causality with observational data when the treatment is assigned based on a cutoff rule. Individuals just below and just above the cutoff are assumed to be similar but they differ only in terms of receiving the treatment. Therefore, their outcomes can be compared and no further covariates are needed. However, in some situation, covariates are introduced into the RDD for example when the groups clearly differ in an additional dimension. This thesis investigates how GAMLSS can be introduced in this context. GAMLSS have the advantage that they estimate not only the mean but the whole conditional distribution. In addition, to testing the benefits of such a "conditional distributional RDD", making the approach "unconditional" is another aim of this thesis.
    Contact: Maike Hohberg (mhohberg@uni-goettingen.de)