Workshop 2: Text as Data: Computational Methods of Data Collection and Text Analysis

Target group: Beginners
Language: English
Available seats: 12

Workshop description:
In today's data-driven world, text is an invaluable source of information, and understanding how to extract insights from it is a crucial skill. This workshop provides an applied introduction to computational text analysis. You will learn how to use computational tools and techniques to analyze large volumes of text, from social media posts to news articles and parliamentary debates. We will discuss the theoretical foundations of analyzing text as data, but the main goal is to gain hands-on experience in using the popular software package R to collect, prepare, analyze and visualize text data. We will dig into topics like using APIs, running NLP tools, determining topics, and classifying text.

Some familiarity with R is expected, although no advanced knowledge is required. In case you are not yet familiar with R, I will share a brief introduction into the R infrastructure and programming language before the workshop. Please make sure to have R and RStudio Desktop installed on your notebook before the workshop. The hands-on scripts are mainly based on the R tidyverse syntax; the text as data applications make use of the quanteda R package (and some other packages). You are more than welcome to bring your own text data to the course.

Relevant literature:

  • Kenneth Benoit. “Text as Data: An Overview.” In: Luigi Curini and Robert Franzese (eds.), The SAGE Handbook of Research Methods in Political Science and International Relations, chapter 26. SAGE Publications, 2020.
  • Christian Baden, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G van der Velden. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures, 16(1):1–18, 2022.
  • Kohei Watanabe and Stefan Müller. „Quanteda Tutorials“., 2023.

  • Certificate:
    To receive a certificate of active participation you need to fulfil the following requirements:

    • Before the workshop: Read assigned research paper and prepare a short presentation of the study.
    • After the workshop: Brief application, documentation and/or critical reflection of one of the discussed procedures to –ideally own– text data. (approx. 5 pages)

    Lecturer: Dr. Valentin Gold

    Valentin Gold is a postdoctoral researcher at the Institute of Methods and Methodological Principles in the Social Sciences at the University of Göttingen. He is currently coordinating the Deliberation Laboratory ( – an interdisciplinary project funded by the Volkswagen Foundation bringing together social science, computational linguistics and argument and ethos mining. His research focus is on developing and applying text-as-data procedures to various types of data.

    Selected publications of Dr. Gold:

    • Annette Hautli-Janisz, Katarzyna Budzynska, Conor McKillop, Brian Plüss, Valentin Gold, and Chris Reed. Questions in argumentative dialogue. Journal of Pragmatics, 188:56– 79, 2022. ISSN 0378-2166. doi:
    • Brian Plüss, Fabian Sperrle, Valentin Gold, Mennatallah El-Assady, Annette Hautli, Katarzyna Budzynska, and Chris Reed. Augmenting Public Deliberations through Stream Argument Analytics and Visualisations. In Stefan Jänicke, Ingrid Hotz, and Shixia Liu (editors), LEVIA’18: Leipzig Symposium on Visualization in Applications, 2018.
    • Valentin Gold, Mennatallah El-Assady, Annette Hautli-Janisz, Tina B ̈ogel, Christian Rohrdantz, Miriam Butt, Katharina Holzinger, and Daniel Keim. Visual linguistic analysis of political discussions: Measuring deliberative quality. Digital Scholarship in the Humanities, 32(1):141–158, 2017. doi: 10.1093/llc/fqv033.
    • Mennatallah El-Assady, Valentin Gold, Carmela Acevedo, Christopher Collins, and Daniel Keim. ConToVi: Multi-Party Conversation Exploration using Topic-Space Views. Computer Graphics Forum, 35(3):431–440, 2016. ISSN 1467-8659. doi: 10.1111/cgf.12919.