Linguistics in Göttingen - A platform for empirical and theoretical linguistics

Annelen Brunner (Mannheim): Automatic recognition of speech, thought and writing representation in German narrative texts

This talk presents a project which explored ways to recognize and classify a narrative feature ? speech, thought and writing representation (ST&WR) ? automatically, using surface information and methods of computational linguistics. The task was to detect and distinguish four types: direct ST&WR (She said: "I am hungry."), free indirect ST&WR (Well, where on earth could she get something to eat now?), indirect ST&WR (She said she was hungry.) and reported ST&WR (She talked about lunch.).

The research was based on a corpus of manually annotated German narrative texts (about 57 000 tokens). Rule-based as well as machine learning methods were tested and compared. The results were best for recognizing direct ST&WR (best F1 score: 0.87), followed by indirect (0.71), reported (0.58) and finally free indirect ST&WR (0.40). The rule-based approach worked best for ST&WR types with clear patterns, like indirect and marked direct ST&WR, and often gave the most accurate results. Machine learning was most successful for types without clear indicators, like free indirect ST&WR, and proved more stable. When looking at the percentage of ST&WR in a text, the results of machine learning methods always correlated best with the results of manual annotation.

The talk gives some detail about the methods used and addresses difficulties and ideas for further developments.