Reliably exploring the presence of languages on the Internet
The Internet is a precious resource for linguists as it offers an easily accessible and broad space where they can observe the evolution of languages over time. Daniel Pimienta, Head of the Observatory of Linguistic and Cultural Diversity in the Internet (OBDILCI), has developed a method to measure the presence of languages on the Web, which was greatly enhanced in recent years. This method informed the development of a comprehensive database that could support linguistic research, language-related public policy, and e-commerce strategies.
The advent of the Internet and its widespread use opened interesting new avenues for the study of languages. For instance, the presence of different languages online offers valuable hints about their future use and development.
Reliable estimates of the use of languages on the Internet could ultimately guide the development of public policies aimed at influencing their presence in the cyberspace. The Observatory of Linguistic and Cultural Diversity in the Internet, a research institute founded in 1996, specialises in the development of effective methods to measure the presence and evolution of languages online.
While language recognition algorithms – computational tools that can identify written languages – might seem like ideal tools for determining the prevalence of languages online, the Web has now become so large that applying these tools to all online content is highly impractical. Some studies thus used these algorithms to analyse subsets of online content, yet this experimental approach has been found to be ineffective, leading to biased and often unreliable results.
Up until recently, the most consulted source of statistics related to the use of languages online (W3Techs) relied on algorithms to analyse websites classed as the most visited. While these statistics offer some interesting insight, they might not accurately reflect the presence of languages online due to the lack of consideration of the often very multilingual nature of websites which trigger important biases.
In 2017, the Observatory of Linguistic and Cultural Diversity in the Internet devised a new approach that could help to better follow the progress and prevalence of languages online. Using this approach, Daniel Pimienta and his colleagues were able to identify meaningful indicators outlining the presence of 343 languages on the Internet.
Indicators of online language presence
Back in 1998, researchers at the observatory had introduced one predominant approach to study the presence of seven languages online, which relied on data collected by search engines such as AltaVista and Google. In 2007, however, they started realising that search engine reports were becoming unreliable, so looked for alternative methods.
The new approach introduced in 2017 addresses the strong biases associated with previous efforts in this field of research. Initially, the researchers applied this approach to 138 languages, namely those with over 5 million native speakers, but they were recently able to extend it to 343 languages, those natively spoken by over 1 million people.
Using their proposed method, Pimienta and his colleagues compiled a set of indicators of the presence of these 343 languages online. These indicators were divided into three broad categories, namely intermediary, macro, and advanced indicators.
Intermediary indicators, all of which are expressed as percentages, include the number of internauts (ie, speakers of a given language connected to the Internet), the usage of specific Internet services or applications, and the traffic reported on websites or applications. They also include an approximation of the digital language level of support and so-called indexes, which are country ratings based on Information Society parameters weighted into language ratings.
The second set of indicators, referred to as macro-indicators or model outputs, is comprised of connected speakers (ie, the percentage of first- and second-language speakers globally who are connected to the Internet), the percentage of Web content in each language, content productivity (the ratio between Web content and internauts), and virtual presence (the ratio between Web content and speakers).
Finally, the more advanced indicators identified by the researchers include the cyber-geography of languages or, in other words, the division of languages on the Web into geographical groups (ie, European, Asian, Arabic, American, and African) and the so-called Cyber-Globalisation indicator. This value, calculated using some of the other indicators, essentially summarises the ‘strategic advantages’ of a given language on the Internet.
A new method to study the presence of languages
The observatory’s new approach consists of indirectly approximating the relative amount of Web content per language. In doing so, it also considers crucial factors that are often ignored when describing a language’s Internet presence, but should be considered to prevent errors or biases.
Firstly, the team considers the likely existence of an ‘economic law’ relating to online communication, which links the offer (ie, Web content available in a language) with the demand (ie, number of speakers of that language who are connected to the Internet). Past findings suggest that the more speakers of a given language are connected to the Internet, the more webpages in that language tend to exist.
Between 2011 and 2022, W3Techs presented statistics suggesting that English has a steady presence online, representing over 50% of online content, yet the observatory’s analysis indicates that this is not the case.In addition, past research suggests that Internet users often prefer to communicate in their mother tongue when content is available in that language, yet they are happy to use their second language or languages in the absence of this content. In some cases, Internet users might also create content in their second language for economic reasons and could use translation services to do so.
A language’s presence online is also linked to the amount of Internet traffic in different places, the number of subscriptions to social networks, and the progress of different countries in terms of Internet-related services. The indicators of Internet presence created by the researchers collectively consider all these factors, thus painting a more detailed picture of how much and in what ways different languages exist online.
A readily accessible database
Using their approach, Pimienta and colleagues set out to calculate indicators of online presence for languages natively spoken by over 1 million people globally. This allowed them to compile a comprehensive database summarising the presence of these languages online, which the observatory plans to update every year.
The values they derived are very interesting, as they often do not match with values attained by other linguistic research efforts. For instance, between 2011 and 2022, the website W3Techs presented statistics suggesting that English represents steadily and by far the most online content (over 50%), yet the observatory’s analysis indicates that this is not the case (around 20% today).
Pimienta and his colleagues found that the languages with more Web content are English and Chinese. Each of these languages was estimated to account for between 16 and 26% of all Web content, followed by Spanish (7–9%), Arabic, Hindi, Russian, French, and Portuguese (3–4%), Japanese, German, and Malay (2–3%), Bengali (1.5–2.3%), and Turkish, Vietnamese, Italian, Korean and Persian (0.8–1.2%). Overall, the team’s results suggest that content online has become increasingly multilingual, with the prevalence of English progressively declining from 80% back in 1998 to 20% today. Notably, this does not mean that the amount of English content has diminished over time, but rather that online content in many other languages increased, in turn reducing the percentage of English content.
Collectively, the remaining 215 languages were found to account for 18–26% of Web content. By looking at several indicators of language presence online, the researchers’ database offers several other insights that could inform linguistic research, public policies, and e-commerce strategies.
For instance, Pimienta and colleagues found that people who speak Norwegian are the most connected to the Internet, with a staggering 98.8% of connected speakers, followed by Danish (98.7%), Swiss German (94.1%), Catalan (94.5%), and Finnish (92.8%) speakers. In addition, Japanese-speaking persons appear to be the most virtually present, generating, proportionally to their connected number, more content than connected speakers of other languages.
Japanese-speaking persons appear to be the most virtually present, generating proportionally more content than connected speakers of other languages.Indicators related to the cyber-geography of languages also provided valuable insights. For instance, they showed that while the number of languages spoken by over 1 million people in Africa is greater than the number of languages spoken in each of the other geographical regions, African language speakers are less connected to the Internet than people speaking languages from other regions, although the recent trend is for growth.
Finally, the Cyber-Globalisation indicator emphasised the strategic advantage of speaking English and French. In other words, speaking these two languages appeared to open more future opportunities for Internet users.
Reliably exploring the Internet prevalence of languages
The recent work by the Observatory of Linguistic and Cultural Diversity in the Internet led to the production of reliable new data summarising the presence of different languages online. As Internet use continues to grow globally, this data could shed light on how the languages used by Internet users are progressing on the Web.
Many existing statistics about the Internet presence of languages have proved to be highly misleading, failing to adequately represent the extent to which languages exist online. Notably, some of these statistics have been widely used in linguistic studies and reported on media platforms, leading to further confusion and misinformation.
The observatory’s database is publicly available online and could soon be used by linguists who are exploring the presence of different languages online. Concurrently, it could also serve as a reference for policymakers, e-commerce strategists, and Internet providers, helping them to gain insight into the use of different languages on the Internet.
Personal Response
What inspired you to conduct this research?In 1995, during the Francophonie Summit in Cotonou (Benin), France’s President Chirac painted the Internet as a 100% English-speaking realm. At this period, I was an Internet evangelist with the Network & Development Foundation (FUNREDES) and I felt that statement was incorrect and did not stand up to proven data. In reaction, I decided to initiate a research effort aimed at measuring the prevalence of language on the Internet. That project matured in 1998 with the help of Union Latine, developing a series of results until 2007. It went through a difficult path after 2007 and until 2017, when it was able to revive again standing in a new and promising approach allowing to extend the scope of languages processed.
How could the database of language indicators you compiled inform public policy aimed at strengthening the presence of languages online?
Today, strategies for strengthening languages must focus primarily on cyberspace due to its powerful global impact. Whatever policy you develop in whatever field, you need meaningful, reliable, and perennial indicators to define your strategy and be able to frequently assess the results of your actions, to adapt interventions accordingly. The indicators for such language policies in cyberspace have been characterised for too long by biased data, widely overestimating the reality of the presence of English and underestimating multilingualism in the Internet, therefore demotivating the efforts for local content production. This misinformation needs to stop and we are glad we could help that way and empower all actions towards multilingualism in the Net.