Speaking in Tongues: Teaching Local Languages to Machines
ChatGPT1, the charming, somewhat unreliable chatbot from OpenAI, went viral at the end of 2022, engaging people in conversations that ranged from the existential2 to the mundane. Other generative AI tools such as DALL-E,3 which can create images based on simple descriptions, and Make-a-Video,4 which can generate short video clips based on simple text descriptions, have also sparked enthusiasm and have even won art competitions.5 Similar tools are rapidly becoming fixtures in homes, where Alexa and Siri banter with and amuse people6 when not following commands to turn off the lights or play the current No. 1 hit on a smart speaker.
But it’s not all fun and games. If the evangelists are to be believed, the impact of these tools will soon show up in productivity data,7 as chatbots begin to do things like write code, create public relations materials, and replace research assistants. The implications of such a productivity boost for international development are obvious.
As things stand now, however, people who don’t speak or write English, or one of the other “major” languages that are generally spoken in advanced economies, are out of luck when trying to access or use these tools and services. ChatGPT, for instance, largely understands the world through the eyes of English-speaking content creators. English comprises the bulk of the training data, while additional languages such as Spanish, French, German, Italian, Portuguese, Dutch, Russian, Arabic, Chinese, Japanese, Korean, and Hindi are also used to train the chatbot. Other tools like voice assistants also support only a small number of the world’s languages. Currently, Google Home doesn’t support Zulu,8 which is widely spoken in South Africa, one of the more developed markets in Africa.
It’s easy to see why this is the case. Machines and algorithms learn through exposure to a sufficiently large corpus of knowledge, which is typically available through written, video, and audio materials (e.g., books, articles, movies, and cartoons9), ideally online in digital format. This corpus powers natural language libraries (NLP) that provide the intelligence embedded in these machines.
The unfortunate reality is that the quantity and quality of available explicit knowledge about developing countries is relatively low, and even lower in local languages.