Developing HuBERT: a Natural Language Processing algorithm for extending the Seshat Global History Databank

Project lead: Maria del Rio-Chanona

Institution: Complexity Science Hub Vienna

Project duration: 01.03.2023 – 31.12.2024

This project addresses the challenges of expanding and increasing the cross-utilization of historical datasets by developing a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. Firstly, the project involves the re-organization of the Seshat Literature Repository, which contains over 8,000 academic articles and books on various cultures and civilizations. This re-organization makes the repository easily and automatically accessible to researchers, increasing the usability of the data contained within it. Secondly, the project develops an algorithm called HuBERT that partially automates document screening and data collection from archaeological and historical materials and existing databases. This helps overcome the challenges of slow data collection and error-prone manual entry, thus increasing the translatability of data across projects and enabling researchers to answer a wider range of research questions. In addition to these contributions, the project includes updates to the project website, tutorials, and other resources to make the data and algorithm accessible to a broad range of researchers across multiple disciplines.

Project

A wide range of historical and archaeological works have documented the dynamics of past complex societies from across the globe. Recent projects have focused on recompiling this information across time for different societies (e.g. Seshat and D-place databases), enabling research from multiple disciplines across the Humanities and Social Sciences [1-5]. Increasing the usability of these databases and increasing the information on them would allow for many more research questions to be answered.

The main challenge for expanding and increasing the cross-utilization of historical datasets is that data collection is slow and that translating information across projects is tedious and prone to error. It takes many human hours to screen through the existing literature, thoroughly read selected articles, and manually record variables or reenter the information into a different framework. The goal of this project is to develop a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. In particular, we will build upon the recent BERT language model [6] to partially automate document screening and data collection from archeological and historical materials and existing databases.

This project is organized in three parts:

  1. Organizing and refining labels of the repository of research articles.
    The Seshat team has recompiled more than 8000 academic articles and books on different cultures and civilizations. However, there is currently no structured repository that allows for easy or automatic access. Here we will build the Seshat Literature Repository by updating and organizing the current repository so that access can be automatized. We also refine references to link to the particular paragraph(s) where the information was inferred from. This detailed referencing system allows for other researchers to access the Seshat data more easily and is useful for fine tuning NLP algorithms.
  2. Developing a Natural Language Process to aid data extraction data extraction.
    Recent developments in NLP have allowed researchers to train language models with the whole English Wikipedia text corpus [6]. Here we will develop HuBERT, a BERT model fine-tuned with the text of social science and humanities research articles in the Seshat literature repository. In particular we aim for HuBERT to able to both screen research articles and pre-select those that might have information on a specific variable and then allow researchers to query Seshat about variables not yet defined there; i.e. HuBERT will do the pre-selection of articles and identify data suitable for reuse by future researchers.
  3. Maintenance and expansion of documentation and tutorials.
    We will continue to update and document the Seshat database and the website. We will add data visualizations and documentation, and tutorials for the Seshat Literature Repository and HumanBERT.

(intermediary) Outcomes

This project’s goal is to expand and foster the cross-utilizing historical datasets. The first stage of this project involves curating a comprehensive and clean dataset with historical references. We are close to finalizing this stage. 

Data Collection and Cleaning

Online tool for cleaning references. The first step was developing an online tool that allowed the historical research assistant team from Oxford to refine references. That is, making sure there were no repetitions, spurious or untracktable references. These references were compiled in a zotero repository (see deliverables). This tool facilitated the Oxford team to clean all of the 12,000 references. 

Variable Identification: After cleaning references, we are working on linking references, pdfs and text to the historical information. We have implemented a data processing pipeline that extracts all relevant information from the free text representation used in the original Seshat data collection process, including page numbers and quotes included in the database. Quotes from the Seshat database are then matched to their occurrence in each corresponding PDF document. An example of this is shown in the table below. 

polityvariablevaluereferencedescriptionquote
InGaroLWeightpresentBurling, Robbins 1963The following seems to indicate that …: ‘When women cook rice, they measure the … When women cook rice, they measure the …
Example of a data-point

We currently have roughly 140,000 data-points linking to the 12,000 references. From these data-points we have fully cleaned and processed 15,000 of them. We are improving our current algorithm, which should considerably speed-up the process. We expect to finish 80% of the sample within a month. 

Next Steps: 
Data Processing: Prioritize finalizing the cleaning process for the outstanding variables. 
Algorithm Training: After data preparation, the subsequent step will be fine-tuning a large language model, such as GPT, Llama or Bert, that can be used for detecting historical variables.   This will be considerably facilitated by the recent advances in large language models, including the release of open source models such as Llama 2.