Discovered: Dec 6, 2025 23:15 (UTC) ME:: go ‘Te Hiku, Apertus and other smaller models for preserving small languages’ go :-) ! ; Ethan Zuckerman:: Gramsci’s Nightmare: AI, Platform Power and the Automation of Cultural Hegemony

QUOTE:

I’m inspired by a project in New Zealand launched by a cultural organization called Te Hiku Media. The CEO of Te Hiku, Peter-Lucas Jones, is Maori and the organization has worked since 1990 to put voices in te reo Maori, the Maori language, on the radio, ending decades during which the New Zealand government suppressed the teaching and speaking of the language. A few years ago, he asked his husband, Keoni Mahelona, a native Hawaiian and a polymathic scholar and technologist, to help the organization build a website. The two ended up realizing that the three decades of recorded Maori speech in the radio archives represented a cultural heritage that likely existed nowhere else.

That archive of spoken te reo Maori becomes more powerful if it’s indexed, and that requires transcribing many thousands of hours of tape, or building a speech to text model. Mahelona visited with gatherings of Maori elders and explained how machine learning models worked, what a speech to text model might make possible and got buy-in and consent from the broader community. That allowed Te Hiku to recruit a truly amazing number of participants into the project of recording snippets of spoken Maori. Over ten days, 2500 speakers recorded 300 hours of the language, creating 200,000 labeled snippets of speech. Jones and Mahelona relied on a set of cultural institutions to accomplish this – they held a contest between traditional canoe racing teams to see which could record the most phrases. They ended up with training data that powers a language transcription model that is 92% accurate… and they’ve got engaged volunteers who could work to correct transcription errors and add data.

This Maori ML project is already being used to power a language learning application, similar to DuoLingo, but using tools built by the community rather than extracted from them. Young Maori speakers are able to check their pronunciation against a database of voices of elders who’ve worked to keep the language alive. And the 30+ years of audio that Te Hiku has collected are now both an indexable archive, but also potentially the corpus for a small LLM built around Maori language, knowledge and values.

The Te Hiku corpus is probably too small to build a large language model using the techniques we use today, which rely on ready access to massive data sets. But there are projects similar in spirit, notably Apertus, a Swiss project to create an open, multilingual language model that emphasizes the importance of non-English languages, especially Romansh and Swiss German. 40% of Apertus’s model is non-English… which gives you a sense of just how dominant English is in most models… and the goal is to build models for chatbots, translation systems and educational tools that emphasize transparency and diversity.

Leave a comment on github