Developing HuBERT: a Natural Language Processing algorithm for extending the Seshat Global History Databank

Abstract

This project addresses the challenges of expanding and increasing the cross-utilization of historical datasets by developing a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. Firstly, the project involves the re-organization of the Seshat Literature Repository, which contains over 8,000 academic articles and books on various cultures and civilizations. This re-organization makes the repository easily and automatically accessible to researchers, increasing the usability of the data contained within it. Secondly, the project develops an algorithm called HuBERT that partially automates document screening and data collection from archaeological and historical materials and existing databases. This helps overcome the challenges of slow data collection and error-prone manual entry, thus increasing the translatability of data across projects and enabling researchers to answer a wider range of research questions. In addition to these contributions, the project includes updates to the project website, tutorials, and other resources to make the data and algorithm accessible to a broad range of researchers across multiple disciplines.

Project

A wide range of historical and archaeological works have documented the dynamics of past complex societies from across the globe. Recent projects have focused on recompiling this information across time for different societies (e.g. Seshat and D-place databases), enabling research from multiple disciplines across the Humanities and Social Sciences [1-5]. Increasing the usability of these databases and increasing the information on them would allow for many more research questions to be answered.

The main challenge for expanding and increasing the cross-utilization of historical datasets is that data collection is slow and that translating information across projects is tedious and prone to error. It takes many human hours to screen through the existing literature, thoroughly read selected articles, and manually record variables or reenter the information into a different framework. The goal of this project is to develop a Natural Language Processing (NLP) algorithm that can help expand current databases and increase the translatability of data across projects. In particular, we will build upon the recent BERT language model [6] to partially automate document screening and data collection from archeological and historical materials and existing databases.

This project is organized in three parts:

Organizing and refining labels of the repository of research articles. The Seshat team has recompiled more than 8000 academic articles and books on different cultures and civilizations. However, there is currently no structured repository that allows for easy or automatic access. Here we will build the Seshat Literature Repository by updating and organizing the current repository so that access can be automatized. We also refine references to link to the particular paragraph(s) where the information was inferred from. This detailed referencing system allows for other researchers to access the Seshat data more easily and is useful for fine tuning NLP algorithms.
Developing a Natural Language Process to aid data extraction data extraction. Recent developments in NLP have allowed researchers to train language models with the whole English Wikipedia text corpus [6]. Here we will develop HuBERT, a BERT model fine-tuned with the text of social science and humanities research articles in the Seshat literature repository. In particular we aim for HuBERT to able to both screen research articles and pre-select those that might have information on a specific variable and then allow researchers to query Seshat about variables not yet defined there; i.e. HuBERT will do the pre-selection of articles and identify data suitable for reuse by future researchers.
Maintenance and expansion of documentation and tutorials. We will continue to update and document the Seshat database and the website. We will add data visualizations and documentation, and tutorials for the Seshat Literature Repository and HumanBERT.

References

Botero CA, Gardner B, Kirby KR, Bulbulia J, Gavin MC, Gray RD. The ecology of religious beliefs. Proceedings of the National Academy of Sciences. 2014 Nov 25;111(47):16784-9.
Haynie HJ, Kavanagh PH, Jordan FM, Ember CR, Gray RD, Greenhill SJ, Kirby KR, Kushnick G, Low BS, Tuff T, Vilela B. Pathways to social inequality. Evolutionary Human Sciences. 2021;3.
Turchin P, Currie TE, Whitehouse H, François P, Feeney K, Mullins D, Hoyer D, Collins C, Grohmann S, Savage P, Mendel-Gleason G. Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization. Proceedings of the National Academy of Sciences. 2018 Jan 9;115(2):E144-51.
Turchin P, Currie TE, Turner EA, Gavrilets S. War, space, and the evolution of Old World complex societies. Proceedings of the National Academy of Sciences. 2013 Oct 8;110(41):16384-9.
Turchin P. Arise ‚cliodynamics‘. Nature. 2008 Jul;454(7200):34-5.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.
Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. 2019 Mar 26.
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020 Feb 15;36(4):1234-40.

Developing HuBERT: a Natural Language Processing algorithm for extending the Seshat Global History Databank

Links