Developing HuBERT: a Natural Language Processing algorithm for extending the Seshat Global History Databank

Project lead: Maria del Rio-Chanona

Institution: Complexity Science Hub Vienna

Project duration: 01.03.2023 – 31.12.2024

This project addresses the challenges of expanding and increasing the cross-utilization of historical datasets by developing a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. Firstly, the project involves the re-organization of the Seshat Literature Repository, which contains over 8,000 academic articles and books on various cultures and civilizations. This re-organization makes the repository easily and automatically accessible to researchers, increasing the usability of the data contained within it. Secondly, the project develops an algorithm called HuBERT that partially automates document screening and data collection from archaeological and historical materials and existing databases. This helps overcome the challenges of slow data collection and error-prone manual entry, thus increasing the translatability of data across projects and enabling researchers to answer a wider range of research questions. In addition to these contributions, the project includes updates to the project website, tutorials, and other resources to make the data and algorithm accessible to a broad range of researchers across multiple disciplines.

Project

A wide range of historical and archaeological works have documented the dynamics of past complex societies from across the globe. Recent projects have focused on recompiling this information across time for different societies (e.g. Seshat and D-place databases), enabling research from multiple disciplines across the Humanities and Social Sciences [1-5]. Increasing the usability of these databases and increasing the information on them would allow for many more research questions to be answered.

The main challenge for expanding and increasing the cross-utilization of historical datasets is that data collection is slow and that translating information across projects is tedious and prone to error. It takes many human hours to screen through the existing literature, thoroughly read selected articles, and manually record variables or reenter the information into a different framework. The goal of this project is to develop a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. In particular, we will build upon the recent BERT language model [6] to partially automate document screening and data collection from archeological and historical materials and existing databases.

This project is organized in three parts:

  1. Organizing and refining labels of the repository of research articles.
    The Seshat team has recompiled more than 8000 academic articles and books on different cultures and civilizations. However, there is currently no structured repository that allows for easy or automatic access. Here we will build the Seshat Literature Repository by updating and organizing the current repository so that access can be automatized. We also refine references to link to the particular paragraph(s) where the information was inferred from. This detailed referencing system allows for other researchers to access the Seshat data more easily and is useful for fine tuning NLP algorithms.
  2. Developing a Natural Language Process to aid data extraction data extraction.
    Recent developments in NLP have allowed researchers to train language models with the whole English Wikipedia text corpus [6]. Here we will develop HuBERT, a BERT model fine-tuned with the text of social science and humanities research articles in the Seshat literature repository. In particular we aim for HuBERT to able to both screen research articles and pre-select those that might have information on a specific variable and then allow researchers to query Seshat about variables not yet defined there; i.e. HuBERT will do the pre-selection of articles and identify data suitable for reuse by future researchers.
  3. Maintenance and expansion of documentation and tutorials.
    We will continue to update and document the Seshat database and the website. We will add data visualizations and documentation, and tutorials for the Seshat Literature Repository and HumanBERT.

Outcomes

This project’s goal was to expand and foster the cross-utilization of historical datasets. The Seshat database has been successfully expanded so that its rich metadata is amendable to large-scale, automated analysis. This includes the links between historical facts, their source in the literature, and additional free-text explanations that together provide a unique dataset for developing and training natural language processing (NLP) solutions. Moreover, until November 2024, funding could be secured to continue the efforts to automate data extraction through large language models.

Data Collection and Cleaning

The Seshat databank was originally created with the main goal of allowing the quantitative analysis of historical data by automated computational methods. However, so far this included only the data points (i.e. historical facts) themselves, while the corresponding metadata was stored in a free text format. Hence, an online tool was developed that allowed efficient processing of this metadata, linking over 12,000 citations to Zotero database objects. Consequently, 140,000 data points in the Seshat databank could be linked to these objects. Furthermore, 50,000 unique research assistant explanations could be processed in the Seshat databank. These explanations were used to create a dataset with two tables:
The first table contains 7,000 text segments from source documents that were referenced in the explanations. The second table contains 40,000 of the research assistant descriptions that are longer than 50 characters. Both of these tables hold one or more topic labels for each of the text segments and further information on the facts and the polities to which the texts refer.  

polityvariablevaluereferencedescriptionquote
InGaroLWeightpresentBurling, Robbins 1963The following seems to indicate that …: ‘When women cook rice, they measure the … When women cook rice, they measure the …
Example of a data-point

Data release

The main result of this project is the dataset of references with text explanations that link to historical facts (i.e. data points). This dataset can be reused to train large language models, in particular as a labeled text dataset, where a text is labeled with the information that can be extracted.

The data repository can be found via Figshare: https://figshare.com/s/d6c947b89e57b89a272c

This dataset provides expert-validated facts in a specialized domain (i.e. history), and thus can serve as the basis for devloping LLMs with domain-specifix knowledge or models that are able to link information to their sources, a problem that current NLP-based solutions still struggle with. The current dataset will be updated continuously, the final version will be published via Zenodo.

Dissemination

Future directions

With the secured funding until November 2024, the team aims to i) write a Nature Scientific data paper on the NLP-Seshat dataset, where text can be inked to historical information and ii) train a large language model, such as BERT to automate the data extraction. Code will be made public and uploaded to Zenodo.