Skip to main content

Machine Learning for Digital Scholarly Editions

In recent years, digital methods have opened up completely new horizons for research. Therefore, the CLARIAH-AT Summer School 2025 will take place from 8-12 September 2025 in Graz:

The Summer School 2025 is jointly organised by the Department of Digital Humanities (University of Graz), Know Center Graz, TU Graz and CLARIAH-AT.


Machine learning is increasingly shaping research in the Digital Humanities, offering powerful tools for analyzing and enriching textual data. Using the Python library BERTopic, participants will explore various steps of topic modeling. Building upon BERTopic’s modular architecture, students will be introduced to several essential machine learning methods, such as embedding, dimensionality reduction, and clustering. Through practical sessions, students will learn to apply these techniques to historical texts. The aim is to give non-experts a high-level practical overview of how to use the BERTopic library and the essential theory behind its modules.

The school is intended for both students and researchers with an interest in the intersection between digital scholarly editing and Machine Learning. After attending the school, participants will have a basic understanding of machine learning algorithms and be able to assess their possible applications as well as strengths and limitations. Participants will be able to practically use BERTopic on their own data.


Keynotes

Keynote 1: Tuesday, September 9, 6pm (CEST), Elisabethstraße 50b (SR 19.02)

Clemens Neudecker:
Context matters. Opportunities and challenges when working with artificial intelligence and cultural heritage data

The advances made in the field of machine learning/artificial intelligence (ML) offer a range of opportunities for libraries and digital scholarship. In projects such as Mensch.Maschine.Kultur, the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz (SBB) is developing ML technologies for a wide range of applications: from text and layout recognition and image analysis to information extraction, machine-assisted subject indexing and, last but not least, the provision of collections as data and their digital curation.

On the other hand, the historical and cultural contexts must always be taken into account when using ML technologies in combination with historical sources and cultural heritage materials. Collections digitized by libraries are heterogeneous in terms of the period covered, the perspectives, places or regions they contain and the cultural contexts in which they must be placed.

Historical documents often contain distortions that no longer correspond to today’s ethical values. While historians are trained to classify sources and apply source criticism as a methodological tool, AI systems developed by industry are primarily trained on modern texts from the Internet and cannot do this.

Using the example of SBB’s experience with machine learning and AI, this talk aims to provide insights into practical applications while at the same time raising awareness for a conscious and responsible approach to ML and cultural heritage data.

Keynote 2: Friday, September 12, 1:30pm (CEST) (online)

Ulrike Henny-Krahmer:
Machine learning and scholarly editing - a contradiction or an exciting partnership?

Traditionally, scholarly editions aim to produce a reliable text based on historical documents that can be used as a basis for further research in the respective subject area(s). Depending on the type of source, this methodologically requires a precise text comparison and a detailed examination of the nature of the underlying documents and their textual contents.

How does this fit in with machine learning methods that recognise patterns based on large amounts of data so that we can obtain models with which we can make probability-based predictions for further data? Are these approaches even compatible with each other and how can we resolve the methodological contradictions or seek connections between the methods?

The lecture will discuss these questions using the concrete example of letters from the edition of the works of the German writer Uwe Johnson (1934–1984), for which topic models were created. It will also be about how far humanities scholars, digital humanists, and computer scientists can delve into the other domain in order to understand the respective methods. This understanding not only provides exciting opportunities, it is also a prerequisite for the successful application of machine learning methods in the humanities.

Schedule

ZeitMontag (8.9.)Dienstag (9.9.)Mittwoch (10.9.)Donnerstag (11.9.)Freitag (12.9.)
8:30 - 9:00Registration
9:00 - 10:30Welcome and setup (Georg Vogeler, Walter Scholger) (Roman Bleier, Martina Scholger)Embeddings (Michael Jantscher)Clustering (Max Toller)Tokenization and weighting (Klara Venglarova)Experiments
10:30 - 11:00Coffee breakCoffee breakCoffee breakCoffee breakCoffee break
11:00 - 12:30BERTopic: overview and example (Selina Galka)Embeddings (Michael Jantscher)Clustering (Max Toller)Topic finetuning (Lucija Brozić)Machine learning and DSE wrap up (Sarah Lang)
12:30 - 13:30LunchLunchPoster SessionLunchLunch
13:30 - 15:00Introduction to PythonDimensionality reduction (Bernhard Geiger)Exkursion:Built your BERTopic pipeline (Roman Bleier, Martina Scholger)Keynote Ulrike Henny-Krahmer (online)
15:00 - 15:30Coffee breakCoffee break”Buschenschank”Coffee breakGoodbye coffee
15:30 - 17:00Prepare a dataset (Roman Bleier, Martina Scholger)Dimensionality reduction (Bernhard Geiger)Experiments (Michael Otto)
18:00Keynotezurück in Graz um ca. 21:30

More detailed information, registration, details about the tutors, as well as the keynote speakers are available via the dedicated Summer School Website:

Summer School: ML for DSE

related Links: