History and archaeology are often perceived to be non-quantitative fields, the purview of bespectacled technophobes languishing among the dusty tomes of dark library aisles or covered in sand, trowel in hand. This could not be further from the truth for many modern-day practitioners. Recent years have seen the application of cutting-edge techniques to various fields in the humanities, including artificial intelligence, machine learning, and natural language processing. One such field is Assyriology, the study of texts discovered in the ancient Near East and written in the cuneiform script on clay tablets.
Cuneiform is one of the earliest known writing systems in the world and was used to document millennia of human civilisations. Hundreds of thousands of cuneiform texts were found in the 19th and 20th centuries, most of which are written in the ancient Akkadian language. Sadly, countless tablets sit in museums and collections where they remain unread and unpublished, despite being of vital importance to world history and world heritage. One, if not the greatest, obstacle for Assyriology continues to be the extremely limited number of people sufficiently skilled to access and interpret the data in order to make it accessible to the rest of the world. Traditional methods have not found a solution for this problem.
At Ariel University in Israel, Dr Shai Gordin and his collaborators are tackling this issue head-on, using ultra-modern techniques to reveal the ancient history of the Near East. As the founder and director of the Digital Pasts Lab (DigPast-Lab), and with funding from the Israeli Ministry of Science and Technology, and a joint LMU Munich and Tel Aviv University grant, Dr Gordin is working to create a human–machine interface for the development of the digital humanities.
Cuneiform and Akkadian
Cuneiform is one of the world’s earliest known writing systems. It was developed in Mesopotamia, what we now call Iraq, around the late fourth millennium BCE (before common era, equivalent to ‘Before Christ’). This area was the centre of the Akkadian Empire (focused on the city of Akkad) in the late third millennium BCE, and later the Assyrian (to the north) and Babylonian (to the south) empires in the first millennium BCE. The cuneiform system has overall approximately 900 characters, which can be used to form the syllables within a word or act as a symbol to represent a whole word. The characters were formed by pressing cut pieces of reed into a tablet of moist clay, which was then allowed to dry. These signs form wedge shapes, which give the system its modern name (the Latin for ‘wedge’ is ‘cuneus’).
Cuneiform is one of the earliest known writing systems in the world and was used to document millennia of human civilisations.
It was the system of writing for a number of languages, principal among which was Akkadian (and its dialects), and was used to record 2,500 years of human history across the Near East. Among others, Akkadian was a lingua franca across the ancient Near East in the latter second millennium BCE, used as the language of diplomacy between the great empires of the time. In the first millennium BCE Akkadian was gradually replaced by Aramaic, which uses an alphabetic system of writing. Today, approximately 600,000 cuneiform tablets of Akkadian text have been identified, accounting for more than 10 million words.
The Babylonian Engine project, led by Dr Shai Gordin, aims to bring Assyriology into the 21st century. Through collaboration with computer scientists, the project team has developed two tools to help decipher Akkadian cuneiform texts: ‘Atrahasis’, which fills missing gaps in the text, and ‘Akkademia’, which provides automated transliteration (conversion from one writing system to another; that is, from cuneiform to the Latin alphabet) and segmentation (joining individual signs into words).
Atrahasis: Filling the gaps
Clay tablets that have been exposed to the elements have often deteriorated, becoming eroded, brittle, flaked, or fragmented. As a result, individual characters to whole sections of text can be missing. Until now, reconstructing these characters has been a painstaking and time-consuming manual process, requiring expertise in both cuneiform and Akkadian.
Atrahasis is one of the names for the Babylonian Noah, the sole human survivor of an ancient flood, and his name literally means “beyond wisdom”. Dr Shai Gordin explains, “we thought it an appropriate name for our model, which uses recurrent neural networks (RNN), a type of artificial intelligence, to reconstruct gaps in Akkadian texts”. Neural networks are computer systems in which multiple nodes form a network; RNNs are a special form designed to predict outcomes based on sequential data. Currently, the model requires the input text to be transliterated into the Latin alphabet (see more on that below), as automated character recognition of cuneiform remains in its infancy.
Similar language modelling for English texts is simplified by the sheer volume of text available to train the model; for Akkadian, only a small volume of digital text is currently available. Atrahasis was trained using 1,400 transliterated Late Babylonian texts (539 to 331 BCE). While only providing a small data volume, these tablets are suitable for model training because they comprise official documents such as legal proceedings and contracts. They are generally short, highly structured, and have simple grammatical conventions – tedious for a human to read perhaps, but perfect for algorithms to analyse. These training materials contained a total 220,926 words, most of which were repeated multiple times (ie, they were made up of 1,549 unique words: only 3,175 words appeared just once and only 923 words appeared twice). These numbers are very small compared with English-language datasets used for similar purposes, which can contain more than a million total words and 10,000 unique words.
The team used two main tests to assess the accuracy of the tool. First, they took random sentences and removed one or more words, and then tried to predict the word using the Atrahasis algorithm. The algorithm predicted the correct word 85% of the time; moreover, the correct word was in the top ten suggested words 94% of the time. Accuracy dropped when more words were removed, but not to the extent that the tool became useless. Then the team created a multiple-choice test for the algorithm. For each of 52 sentences with one word missing, the model was offered four possible correct solutions. Of the three wrong options, one was semantically incorrect (ie, the words didn’t make sense), one syntactically incorrect (ie the structure didn’t make sense), and one was incorrect with respect to both.
The model picked the correct option more than 88% of the time. Interestingly, as the team had hypothesised, the most commonly chosen wrong answer was the semantically incorrect option. This means the algorithm is particularly good at recreating sentence structures and grammar, and better than hoped at predicting semantics. This provides a good complement to human-based skills, which are often the opposite way around.
Akkademia: transliteration and segmentation
The second tool developed by Dr Gordin and his colleagues focuses on an earlier step in the process—the sign-by-sign transliteration and segmentation of cuneiform characters that have been extracted from the original tablets.
The transliteration of cuneiform script is particularly challenging as most signs can be used for multiple meanings; they can represent (1) a word, (2) a syllable, or (3) a determinative – a clue to help correctly read the signs that make up the following word (such as the sign for “wood” appearing before wood-made objects). To train the algorithm, the team used royal inscriptions of the Neo-Assyrian Period, including 23,526 lines of text, each of which was considered to be one sample. Of the samples, 80% were used for training, 10% for validation, and 10% for testing the algorithm. The results showed that Akkademia can reach 97% accuracy in transcribing cuneiform characters into correctly alphabetical segmented words.
The greatest obstacle for Assyriology continues to be the extremely limited number of people sufficiently skilled to access and interpret the data.
A second test sample examined the efficacy of the model on four texts chosen from outside the original training set. Although naturally they did not reach the same accuracy level, they still showed very high results, considering the temporal distances between the different texts. Furthermore, a qualitative analysis of the these results showed that many of the inaccuracies are predictable and therefore easy to correct efficiently with minimal human intervention, or that they are not necessarily inaccuracies, but rather open to interpretation. This marks a huge step forward in the use of machine learning in analysing Akkadian texts.
A new partnership
Models such as Atrahasis and Akkademia show huge potential to revolutionise the digital humanities. They by no means replace the need for humans, but instead complement our own analytical skills. While completely automated analysis remains some way off, the models are invaluable tools to help with text restoration. Both models show impressive accuracy which, with more training materials, can be improved still further. The quicker and more accurately we can decipher our cuneiform documents, the more we will understand about the ancient Near East, one of the so-called cradles of civilisation.
The codes for these algorithms have been made freely available online via the Digital Pasts Lab. Dr Gordin and his team hope they will offer humanities researchers – and ultimately the wider public – new opportunities to more completely document and better understand our world heritage.
For more information, visit Digital Pasts Lab https://digitalpasts.github.io
What are the implications of your algorithms for our understanding of other ancient writing systems?
Akkadian and cuneiform are an excellent case-study for the capabilities of advanced machine learning (ML) and artificial intelligence (AI) models on low-resource languages: languages which have a limited, sometimes finite amount of data available, whether because of a lack of digitized texts or simply a small number of texts preserved – something that is often the case with ancient languages and writing systems. It is well known that ML and AI models have astounding results when data availability is not a problem (think of the many AI models for English). Our research shows that good results can be achieved even with scarce data, when the methods and data are chosen carefully to suit the problem at hand.