LadderWeb: A pragmatically annotated web-based corpus query interface for requests and cancellations in Italian L1 and L2

Project lead: Nicola Brocca

Institution: Department of subject-specific Education/Unit of Language-Education, University of Innsbruck

Project duration: 1.4.2023 – 31.03.2024

Corpus-based research in pragmatics has been limited by the lack of adequate corpora. This is particularly true for languages other than English and for openly available data. However, the production of pragmatically annotated corpora could greatly facilitate research on the interaction of different language levels and enable their use in more applied areas such as language teaching, textbook production or the training of language professionals.

Pragmatic annotation poses many challenges, such as encoding the meaning of utterances that depend on extratextual context and are dominated by implicit language. Another bottleneck is the annotation criteria, which can be ambiguous, especially when considering socio-pragmatic aspects in an ecological setting. The ideal scenario is to have robust pragmatic annotation tools.

New strategies have been developed to overcome these obstacles with the Ladder and DisDir corpora. The data are restricted to experimental settings where extratextual variables are controlled by standardised Discourse Completion Tasks (DCT). Thanks to the selection of specific speech acts, the applied taxonomy allows the coding of utterances that bypass the detection of implicit meaning. The annotation is automatically enriched by a machine learning application, which will form the basis for semi-automatic annotation.

The final product will provide an online platform that allows linguists to search the aforementioned corpora for speech acts of requests and cancellations in different languages and varieties.

keywords: TEI, foreign language education, corpus pragmatics, corpus query processor, IMS Open Corpus Workbench

(intermediary) Outcomes

The data sets were cleaned up and the existing annotation was checked so that the data set could be prepared for AI training and automatic annotation. In addition, the verbal data was transcribed and linked to metadata.
A workflow for the training of language models and the subsequent automatic annotation of example sentences has been developed. We are currently testing various tagging algorithms and checking the quality of their results.
When using hundreds of data sets, the result is not yet satisfactory, so manual post-processing is required. We are confident that the quality of the automatic annotation will steadily improve with the help of semi-automated annotation and the addition of data sets to the training corpus.

This data will be made publicly available to students, teachers and researchers via Ladder Web, a web application. Other data sets that are not included in the training due to a small number of specific language files are currently being cleaned up and will be archived without annotation together with the annotated data.

Links

LADDER. (Predecessor project) description: https://ladder.hypotheses.org/1