Machine Learning for Digital Scholarly Editions
In recent years, digital methods have opened up completely new horizons for research. Therefore, the CLARIAH-AT Summer School 2025 will take place from 8-12 September 2025 in Graz:
The Summer School 2025 is jointly organised by the Department of Digital Humanities (University of Graz), Know Center Graz, TU Graz and CLARIAH-AT.
Machine learning is increasingly shaping research in the Digital Humanities, offering powerful tools for analyzing and enriching textual data. Using the Python library BERTopic, participants will explore various steps of topic modeling. Building upon BERTopic’s modular architecture, students will be introduced to several essential machine learning methods, such as embedding, dimensionality reduction, and clustering. Through practical sessions, students will learn to apply these techniques to historical texts. The aim is to give non-experts a high-level practical overview of how to use the BERTopic library and the essential theory behind its modules.
The school is intended for both students and researchers with an interest in the intersection between digital scholarly editing and Machine Learning. After attending the school, participants will have a basic understanding of machine learning algorithms and be able to assess their possible applications as well as strengths and limitations. Participants will be able to practically use BERTopic on their own data.
Keynotes
Keynote 1: Tuesday, September 9, 6pm (CEST), Elisabethstraße 50b (SR 19.02)
Clemens Neudecker:
Context matters. Opportunities and challenges when working with artificial intelligence and cultural heritage data
The advances made in the field of machine learning/artificial intelligence (ML) offer a range of opportunities for libraries and digital scholarship. In projects such as Mensch.Maschine.Kultur, the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz (SBB) is developing ML technologies for a wide range of applications: from text and layout recognition and image analysis to information extraction, machine-assisted subject indexing and, last but not least, the provision of collections as data and their digital curation.
On the other hand, the historical and cultural contexts must always be taken into account when using ML technologies in combination with historical sources and cultural heritage materials. Collections digitized by libraries are heterogeneous in terms of the period covered, the perspectives, places or regions they contain and the cultural contexts in which they must be placed.
Historical documents often contain distortions that no longer correspond to today’s ethical values. While historians are trained to classify sources and apply source criticism as a methodological tool, AI systems developed by industry are primarily trained on modern texts from the Internet and cannot do this.
Using the example of SBB’s experience with machine learning and AI, this talk aims to provide insights into practical applications while at the same time raising awareness for a conscious and responsible approach to ML and cultural heritage data.
Keynote 2: Friday, September 12, 1:30pm (CEST) (online)
Ulrike Henny-Krahmer:
Machine learning and scholarly editing - a contradiction or an exciting partnership?
Traditionally, scholarly editions aim to produce a reliable text based on historical documents that can be used as a basis for further research in the respective subject area(s). Depending on the type of source, this methodologically requires a precise text comparison and a detailed examination of the nature of the underlying documents and their textual contents.
How does this fit in with machine learning methods that recognise patterns based on large amounts of data so that we can obtain models with which we can make probability-based predictions for further data? Are these approaches even compatible with each other and how can we resolve the methodological contradictions or seek connections between the methods?
The lecture will discuss these questions using the concrete example of letters from the edition of the works of the German writer Uwe Johnson (1934–1984), for which topic models were created. It will also be about how far humanities scholars, digital humanists, and computer scientists can delve into the other domain in order to understand the respective methods. This understanding not only provides exciting opportunities, it is also a prerequisite for the successful application of machine learning methods in the humanities.
Schedule
| Zeit | Montag (8.9.) | Dienstag (9.9.) | Mittwoch (10.9.) | Donnerstag (11.9.) | Freitag (12.9.) |
|---|---|---|---|---|---|
| 8:30 - 9:00 | Registration | ||||
| 9:00 - 10:30 | Welcome and setup (Georg Vogeler, Walter Scholger) (Roman Bleier, Martina Scholger) | Embeddings (Michael Jantscher) | Clustering (Max Toller) | Tokenization and weighting (Klara Venglarova) | Experiments |
| 10:30 - 11:00 | Coffee break | Coffee break | Coffee break | Coffee break | Coffee break |
| 11:00 - 12:30 | BERTopic: overview and example (Selina Galka) | Embeddings (Michael Jantscher) | Clustering (Max Toller) | Topic finetuning (Lucija Brozić) | Machine learning and DSE wrap up (Sarah Lang) |
| 12:30 - 13:30 | Lunch | Lunch | Poster Session | Lunch | Lunch |
| 13:30 - 15:00 | Introduction to Python | Dimensionality reduction (Bernhard Geiger) | Exkursion: | Built your BERTopic pipeline (Roman Bleier, Martina Scholger) | Keynote Ulrike Henny-Krahmer (online) |
| 15:00 - 15:30 | Coffee break | Coffee break | ”Buschenschank” | Coffee break | Goodbye coffee |
| 15:30 - 17:00 | Prepare a dataset (Roman Bleier, Martina Scholger) | Dimensionality reduction (Bernhard Geiger) | Experiments (Michael Otto) | ||
| 18:00 | Keynote | zurück in Graz um ca. 21:30 |
More detailed information, registration, details about the tutors, as well as the keynote speakers are available via the dedicated Summer School Website:
Summer School: ML for DSErelated Links:
- Call for Participation
- The Summer School was funded by CLARIAH-AT within the Funding Call 2024 - visit the Project entry here .