Zum Hauptinhalt springen

Ein Rückblick auf das ÖNB Labs Symposium 2024 zum Thema "Zeitungen als Datensätze

Am 25. und 26. November 2024 fand im Oratorium der Österreichischen Nationalbibliothek eine Neuauflage des ÖNB Labs Symposiums statt!

Die Veranstaltung stand unter dem Motto „Newspapers as Datasets“ und umfasste vier Panels, die sich über zwei halbe Tage erstreckten und eine Reihe unterschiedlicher Themen behandelten, wie Sammlungen als Daten und insbesondere Zeitungen als Datensätze, künstliche Intelligenz und ihre Anwendungen in Bibliotheken. Die Panels beinhalteten Präsentationen von Forschenden, die an internationalen Projekten beteiligt sind, aktuelle Arbeitsberichte von Forschenden und Präsentationen von Mitarbeitenden nationaler Bibliotheken und KI-Labors. Eine Zusammenfassung der Präsentationen sowie Links zu den Folien der Vortragenden finden Sie auf der ÖNB Labs Website .

OeNB Labs 2024 Publikum

© Österreichische Nationalbibliothek

Bitte beachten Sie, dass der folgende Bericht über das Symposium nur in englischer Sprache verfügbar ist.

Program Summary

Christoph Steindl, head of the ONB Labs Team, welcomed all speakers and guests to the Austrian National Library and presented the general topic and the individual panels of the Symposium, before moderating the first panel.

Clemens Neudecker (Berlin State Library): Under the title “Newspapers as Data: What’s the News for AI and DH?,” Clemens first looked back to ten years ago, when the Info Day of Europeana Newspapers was held in the Oratorium of the ONB. He then presented the most recent projects that were done at the Berlin State Library, such as OCR-D , Qurator , and, most recently and still ongoing Mensch.Maschine.Kultur . Emphasizing the persistent challenge of layout recognition in the semantic analysis of historical newspapers, he noted that the library is leveraging multiple GPUs to advance progress in this area using machine learning techniques.

Sally Chambers (The British Library & DARIAH-EU): In her presentation “Towards Sustainable Workflows for Newspapers as Datasets: an Infrastructural Perspective,” Sally gave an overview on the digital humanities infrastructures in Europe (e.g. SSH Open Marketplace , Common European Data Space for Cultural Heritage , DARIAH-Campus ), especially with regards to their respective roles in the research on and education through newspapers. Using the British Library Research Repository as an example, she distinguished two different types of collections as data, one coming first from the institutions, meant to address as many potential user groups as possible, and another one from the perspective of the individual researcher who should be able to build specified subsets of data.

Sébastien Cretin (Bibliothèque Nationale de France): Continuing the topic and challenge of layout recognition in historical newspapers, Sébastien spoke about “The FINLAM Project: Outlining the State of the Art in Newspaper Segmentation.” The acronym of the project title stands for “Foundation INtegrated models for Libraries, Archives and Museums,” the project’s goal is to create a model that is able to exhaustively segment and understand historical documents. Being a cooperation between the BnF, the LITIS Lab , and TEKLIA , the project team also produces synthetic newspapers in order to test model performance optimization. Sébastien mentioned that they have tried a lot of different models already, the situation however is not yet optimal and the team aims to conduct a quantitative analysis of the model performances soon.

Andy Stauder (Transkribus / READ co-operative): In his talk “Before the LLM Magic Happens: Clean, Controllable, Reliable Data Extraction with End-to-End ATR Models,” Andy first demonstrated the error-proneness of customary large language models (LLMs) before going on to argue for a unification of layout and text analysis (OCR). This is the case in end-to-end ATR (Automatic Text Recognition) models. He showed that these types of models are able to handle the complex reading orders of historical newspapers, that they are trainable on midrange hardware, and that they are more predictable than LLMs. One ongoing challenge is the large size of newspapers, which complicates the implementation of these models.

Panel 2

Antoine Doucet (Université de La Rochelle): Antoine was the project lead of the NewsEye project (2018–2022) that enabled insights into historical newspapers by semantically analyzing the data and by providing a search, analysis, and export interface that uses the semantic annotations. Looking back on this project more than two years after its completion, and keeping the current state of the art in mind, he asked: “What Would NewsEye 2 Have to Achieve?” Among the many ideas he highlighted were a focus on NER (Named Entity Recognition) in order to address the relatively high quantity of this technology in user queries as well as the importance and recent progress in reliable article separation (e.g. via the LIAS and STRAS methods).

Maud Ehrmann (Ecole Polytechnique Fédérale de Lausanne) & Marten Düring (Luxembourg Centre for Contemporary and Digital History): In their presentation titled “Impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio,” Maud and Marten gave an insight into the Impresso 2 project, which not only aims to analyze and present historical newspapers but also historical radio broadcasts. As part of this international and interdisciplinary project, the team is also building the Impresso DataLab which, alongside the Impresso WebApp, will provide users with access to an API, a Python library, models, and a variety of Jupyter Notebooks to interact with enriched historical media collections. The team also identified an intriguing dilemma for GLAM Labs, uncovered through a survey: there seems to be a discrepancy between the relatively small numbers of users of a lab, their perceived importance and their need to demonstrate their impact. The discussion revealed that the sustainability of Jupyter Notebooks remains an issue insufficiently addressed within the project and the broader research community.

Eva Pfanzelter (University of Innsbruck): Turning to broader issues in the digital humanities, Eva’s talk, “Old Challenges, New Solutions? Changing Approaches for Historical Newspaper Research,” explored, among other topics, the ethical concerns surrounding the creation and use of data sets. As a means of counteracting potential negative or harmful results, she stressed the importance of collaboration during the stage of data set curation as well as the importance of accessibility. When asked about the historian’s role in addressing biases in AI-driven research, she observed that historians often resist new technologies. However, more open-minded historians could act as mediators, bridging the gap to reach a broader audience.

Christoph Steindl & Johannes Knüchel (ONB Labs): In the final presentation of the first day, Christoph and Johannes looked back on more than five years of ONB Labs, presented new data sets , updates on already existing ones ( Musical Manuscripts , Planned Languages , and Papyri ) and further new features (such as the glossary ) that were implemented in 2024. They also gave an outlook on what’s potentially to come in the next months and years.

Panel 3

Tan Lu (KBR – Royal Library of Belgium): Using the collections of BelgicaPress as an example, Tan demonstrated the specific challenges of “Recognizing Front Pages of Historical Newspapers: From Deep Learning to AI Explainability.” He provided an in-depth overview of the ResNeSt model used by his project team, showcasing visualizations of representative samples in three-dimensional space and as a Gaussian Mixture. Using modeling, he also demonstrated how visual concepts revealed which parts of each page the neural network relied on to make its decisions.

Javier de la Rosa (National Library of Norway): Javier presented the “Mímir Project,” which aims at “Evaluating the Impact of Copyrighted Materials on Generative Large Language Models for Norwegian Languages.” He explained how this forthcoming project was started by a letter from Norwegian rights holders’ organizations to the government, demanding compensation for the use of their materials. The project’s goal is thus to scientifically assess the value of copyrighted material within Norwegian language models. One of the early indications seems to be that copyrighted material improves model performance.

Simon Mayer (Austrian National Library): In his presentation entitled “Bibliotheca Eugeniana: Using Machine Learning in DH Research,” Simon discussed the recently completed Bibliotheca Eugeniana Digital project, detailing how the project team applied machine learning to rediscover books from the ONB collection that once belonged to Prince Eugene of Savoy. He also showcased prototypes of the digital edition of the handwritten catalogue and of a rich and explorable visualization of the collection within the State Hall.

Jörg Lehmann (Berlin State Library): Jörg began his presentation entitled “Intermediaries, Crafted by Trustees: Datasheets for Digital Cultural Heritage” with the observation that cultural heritage institutions enjoy a high degree of public trust. He noted that machine learning models have several downsides if not used in a specific context or if used in combination with data sets that reflect undesirable social biases. Datasheets , according to him, can be “intermediaries” between the spheres of cultural heritage and machine learning. In order to facilitate their production, Jörg and his colleagues are working on a web application which will be released soon after funding has been secured.

Panel 4

Christian Lendl (University of Vienna): Christian opened the panel on reports from researchers by presenting his ongoing PhD project on “The Wiener Salonblatt as a Social Network of the Habsburg Nobility.” He is investigating the social relationship network of this specific group of people over time, using the digitized newspaper as a basis. To extract structured data from the newspaper corpus he uses custom Transkribus models. As a part of his investigations, he also found out about the development of the role of advertisements through time.

Sarah Oberbichler (Leibniz Institute of European History): In her presentation “Large-Scale Research with Historical Newspapers: A Turning Point through Generative AI,” Sarah discussed her ongoing habilitation project and the role of generative AI within it. Since her goal is to analyze a wide variety of international and multilingual historical newspapers, she stressed the fact that corpus building remains the most challenging task before training models that can further support the research. Several models are currently being tested, and Sarah shared insights into her workflow integrating LLMs and evaluation results of preliminary tests.

Nina C. Rastinger (Austrian Academy of Sciences): To conclude the Symposium, Nina presented her PhD project in a talk entitled “Love for Lists: Rediscovering an underrated newspaper text type.” She investigates a broad range of different German historical newspapers in the period of 1600–1850 by first identifying lists and list types, then analyzing textual patterns within them, and finally conducting a case study on automatic information extraction. Her mixed method approach for identifying lists encapsules reuse of existing annotations, layout recognition, full text search and word reuse detection as well as close reading. Nina’s preliminary findings make it clear that there was a great variety in periodically published lists in historical newspapers and that they constitute both a valuable research object and source.

Präsentationen

Die Präsentationen werden auf der GitLab Plattform der ÖNB Labs gehostet, siehe labs-symposium-2024 . Direkte Links zu den Folien der Vortragenden als PDF-Dokumente finden Sie auf der ÖNB Labs Website .

ÖNB Labs ÖNB Labs auf GitLab