International Wakashan AI Consortium
One-line solution summary:
The international Wakashan language communities are uniting to build a shared Voice AI, overcoming borders and barriers.
Pitch your solution.
The Wakashan language family comprises multiple languages including Kwak'wala on northern Vancouver Island in BC and Makah in Neah Bay, Washington. These languages are at risk of going silent. Though traditionally interdependent with the Wakashan community in British Columbia, the Makah people are isolated by the American border. This solution is a technological bridge between Wakashan languages with a 'Think Big' goal: building Automatic Speech Recognition AI. This solution will leverage internationally cooperating community talent and resources to build an invaluable annotated dataset for a novel machine learning strategy.
Developing AI for Wakashan languages is significant because: these languages are highly polysynthetic which are under researched in ML science, data collection will respect Indigenous Data Sovereignty, the AI models generated can be shared among the Wakashan languages amplifying the effect, it will provide a model for sustainable AI research, and the results will enable next generation curricula and research.
What specific problem are you solving?
Generally, the problem this solution works to address is a lack of research and technology enabling advanced curriculum for the Makah and other Wakashan languages. Modern Automatic Speech Recognition (ASR) advancements, and natural language processing, has enabled voice assistants, such as Siri, and education apps like Duolingo. Unfortunately, like most indigneous languages this technology does not exist for Makah.
The specific problem is a lack of necessary research and engineering to build ASR research for the Makah and related languages. There are barriers to this work, in addition to being rural and isolated from other Wakashan communities, there's a lack of annotated audio suitable for machine learning, basic research to build polysynthetic ASR for Makah, and community skills to conduct this work. Time is threatening the revitalization of this language, there are just 16 speakers of Makah.
The scale of this problem impacts the Makah nation, of nearly 4,000 members, plus the communities of the Wakashan language families in Canada. There are over 25,000 members internationally ("Report on the Status of B.C. First Nations Languages", 2018).
Broadly, researching polysynthetic ASR affects hundreds of language communities across North America and affecting millions of indigenous peoples and thousands of endangered languages worldwide.
What is your solution?
The goal of this solution is to empower the Makah to build advanced voice experiences and conduct research on their own language in order to revitalize their endangered language. To bridge this technical gap, international brethren will be united to collect data for, conduct research for, and build skills to create Automatic Speech Recognition AI. This AI will be a Deep Neural Net model implemented on the machine learning platform TensorFlow using Open Source Software.
The ASR model will take audio files as input and return text corresponding to the audio in the Makah orthography. Using the resulting text, future Makah software engineers can use simple logic to build interactive experiences. Because Makah is closely related to other Wakashan languages this model can be used to jumpstart custom models accommodating unique attributes in each language and vice-versa.
Furthermore, because Makah and Wakashan language are highly polysynthetic, this solution will contribute to the broader indigenous language research community working to build ASR for polysynthetic languages across the globe. Polysynthetic languages have as many words as there are sentences in English, or as many stars are in the sky. Current ASR assumes that languages have a relatively small dictionary of words.
Strong preference will be given to Native-led solutions that directly benefit and are located within the Indigenous communities. Which community(s) does your solution benefit?
Our International Wakashan AI Consortium will continue the work with the Sanyakola Foundation, a collaborative group led by Kwakwa̱ka̱’wakw fluent speakers and language activists from four Kwak’wala speaking communities on Northern Vancouver Island, British Columbia (Kwagu’ł in Tsax̱is Fort Rupert, Gwa’sala and ’Nakwaxda’xw in Tsulquate, and G̱usg̱imux̱w in Quatsino). Thanks to the MIT Solve project we will unite the Wakashan language communities across international borders by including the Makah Tribe ( ̓Qʷidiččaʔa·tx̌) of Neah Bay, Washington in this project and develop an automatic speech recognition (ASR) system for both Kwak’wala and Makah with the possibility of expanding to other Wakashan languages in the future.
The Makah Tribe (www.makah.com) has called the Neah Bay area on the most northwestern tip of the Olympic Peninsula in Washington State, USA home since time immemorial. They are a fairly small community with a population of 3,099 (of which 1,550 live in Neah Bay) and about 16 active speakers. The Makah are working hard to preserve and restore their traditional language that has suffered under colonization and the deliberate measures taken by the US government to eradicate tribal languages during the 19th century and the early part of the 20th century. The last fluent, first-language speaker passed away in early 2000, but thankfully their language is thoroughly documented and their language preservation efforts are centralized through the Makah Language Program of the Makah Cultural and Research Center. They have hundreds of hours of audio data in the Makah central archive and the language is currently taught by seven certified Makah language teachers at the local K-12 state school and in adult language classes. Only local residents have access to language classes however, as there are no language materials online.
In comparison, Kwak’wala is spoken in several dialects across 15 communities on northern Vancouver Island and the smaller islands and mainland directly to the east. According to the 2018 Report on the Status of B.C. Languages (http://www.fpcc.ca/files/PDF/FPCC-LanguageReport-180716-WEB.pdf), the population of Kwakwaka’wakw is approximately 6,224 with 763 active language learners and 8.1% fluent and semi-fluent speakers.
Both the language and the culture are famously well-researched, a trend started by the anthropologist Franz Boas. The wealth of existing materials on the language and culture of the Kwakwaka’wakw remain minimally accessible however to those who need it most: teachers and learners actively engaged in reclaiming their languages.
With funding from the National Research Council of Canada (NRC) our partnership among the Sanyakola Foundation, the Gwa’sala-’Nakwaxda’xw Language Revitalization Program, and the University of British Columbia (UBC) initiated our Indigenous, community-led Kwak’wala corpus collection and ASR project in summer 2019. (https://nrc-publications.canada.ca/eng/view/object/?id=d4f10144-c711-43c5-b80b-5ace7df5e68b) Our project team has since identified hundreds of hours of analog, digitized, and born-digital audio data in Kwak’wala, scattered among institutions and communities, and also held in personal, local, museum, and provincial archives. We have trained 20 community members in skills such as audio recording, data management, ear training, Kwak’wala orthographies and transcription, and annotation in ELAN transcription software, while assembling a corpus of machine-readable time-aligned transcriptions of recorded Kwak’wala audio for the purpose of automating speech-to-text. The team also developed a data pipeline to support the project. So far our project has compiled approximately 30 hours of machine-readable Kwak’wala audio data, consisting of basic daily conversational speech, pedagogical materials, and Elders’ storytelling.
The corpus creation project will continue in the coming months and years, working towards making these existing, distributed recordings more accessible through digitization, transcription and a graded-access website, creating new audio records with fluent Elders to accompany existing language learning materials, in addition to contributing to the development of ASR tools to lessen the transcription bottleneck (Himmelman 1998) and support Indigenous language revitalization. The ASR we are developing for Kwak’wala is one component of a suite of technical language infrastructure modules (Audio Recognition, Natural Language Understanding/Interpretation, Text-to-Speech and Speech-to-Text synthesis) which will enable the creation of a rich voice-forward interactive education system (i.e. xR or extended reality and voice-assistant appliances such as Siri or Google Voice).
Uniting across borders
Our work developing open-source ASR for Kwak’wala provides a strong platform from which to develop similar systems for Makah and other Wakashan languages (Ditidaht, Haisla, Heiltsukvla, Nuu-chah-nulth, Oowekyala and X̄enaksialak̓ala/X̄a’islak̓ala).
The Makah are the only representatives of the Wakashan language family that were cut off from their relatives in present-day British Columbia through the International Boundary that was drawn along the 49th parallel. Despite that forced separation the Makah were able to stay connected with their relatives in the north through intermarriage and through the canoe family. Canoes are still a vital part of Makah life and every year they embark on long journeys, paddling hundreds of miles in ocean going canoes to visit and celebrate with tribes along the coast of Washington (USA) and British Columbia (Canada). It is through these family and canoe family relations between the Makah and the Kwagu’ł that our solution team was introduced to the community.
Benefits to the Makah
By cooperating internationally in this technological endeavor, the Makah will be part of a 25 thousand strong Wakashan language community enabling them to cooperatively benefit from each other’s languages. Because the Wakashan languages are closely related, the machine learning models created by one community can be leveraged to jumpstart a model specialized for another Wakashan language. This strategy is called Transfer Learning which will allow the Makah to benefit from their international brethren.
Additionally, through this international technological effort alongside capacity-building measures, the Makah community will be empowered with valuable skills to: collect annotated data for machine learning, conduct AI research, and leverage the results to build engaging voice experiences for language revitalization.
As an Indigenous-led team we are fully aware of the history of extractive research. Just as our relations with the Kwakwa̱ka̱’wakw community partners follow Indigenous methodologies and values, we plan to build similar long-term, respectful relations with the Makah. The Makah are equal community partners in all phases of the project. This means that we define the scope and the goals of the project together. The Makah are in control of data collection, sharing, and dissemination. All work will be conducted according to their comfort level and abide by Makah customs and protocols, ensuring Indigenous Data Sovereignty.
The work of our International Wakashan AI Consortium will further the goals of the Makah Language Program to preserve and restore the Makah language, support next generation curriculum development, within a community-oriented framework that allows for job-creation and capacity-building across Wakashan communities, while providing a model for sustainable AI research.
All language data is sacred.
Which dimension of the Fellowship does your solution most closely address?Support language and cultural revitalization, quality K-12 education, and support for first-generation college students
Explain how the problem, your solution, and your solution’s target population relate to the Fellowship and your selected dimension.
This solution is aligned with supporting language revitalization, though the results will not be immediate. Building ASR for Makah and Wakashan languages will enable innovative app development and language research by the Makah.
Makah is a language with only 16 speakers and our solution will reinforce their revitalization efforts by bridging them with their international brethren with a common goal. This goal will enable a new class of immersive voice experiences for the Makah nation. Furthermore, the creation of polysynthetic ASR for Wakashan languages will build technical capacity within the Makah community to participate in the knowledge economy.
In what city, town, or region is your solution team headquartered?Seattle, WA, USA
What is your solution’s stage of development?Prototype: A venture or organization building and testing its product, service, or business model
Who is the primary delegate for your solution?
Michael Running Wolf
Please indicate the tribal affiliation of your primary delegate.
Northern Cheyenne, Lakota, and Blackfeet
Is your primary delegate a member of the community in which your project is based?
Which of the following categories best describes your solution?A new technology
Describe what makes your solution innovative.
Statistics-centric approaches to ASR are difficult to apply to North American Indigenous languages, which typically only have scant text and audio corpora, let alone hand-tagged training data. ASR word-recognition uses whitespace-based demarcation — but Wakashan languages are polysynthetic. Information-dense words build up by stacking non-standalone meaningful elements (lexical affixes), each conveying as much as one English word, to create single-word expressions corresponding to entire phrases/sentences:
‘Apparently she wants to be an expert at dyeing straw.’
Instead of statistics, we exploit restrictive positional relationships between these elements: identifying one limits the search-space possibilities of the next, and so on. Recognition follows a cascading series of restrictions on what the next element can be: eliminating the need for large training corpora, and suggesting more resource-efficient solutions for all languages. Our immediate goal is for Wakashan ASR to recognize the small core vocabulary feeding most high-frequency, everyday usage. This enables audio transcription automation, and developing VR-based training of unlimited numbers of learners in beginner fundamentals—so that fluent speakers can instead teach complex higher-order information (like usage norms and contextual nuance)—and lets shy learners practice with an interactive voice experience until ready to handle face-to-face conversation. This community-driven approach deploys the technology thoughtfully: creating comfortable, meaningful interactions with carefully prioritized core-language content, and helping address the individual/community-level emotional challenges of language reclamation.
Describe the core technology that powers your solution.
We will build off the ASR work of two unrelated polysynthetic North American languages, Seneca (Jimerson and Prud'hommeaux, 2018) and Inuktitut (Gupta and Boulianne, 2020). The former showed that ASR could work, albeit imperfectly, with small datasets, while the latter showed that it was possible to use a bi-directional long short-term memory architecture, but required moving the n-gram language model from whitespace-based standalone word units to word-internal morpheme (= lexical and grammatical affix) units.
Our approach extends this work in a new direction by applying loss functions heavily penalizing individual morpheme identifications that are disallowed by the adjacent morpheme sets. We are unaware of any research of this kind using loss functions, but we anticipate that creating a loss function informed by linguistic co-occurrence constraints will significantly enhance the usability of smaller datasets. From here, we will further minimize word error rate by implementing data augmentation via speed perturbation (Ko et al., 2015) and SpecAugment (Park et al., 2019), which can increase these small datasets’ size significantly; and increasing computational efficiency via selective-backprop (Jiang et al., 2019), which would skip the computationally expensive gradient calculation in certain cases; and bypassing the need for a language model by using RNN transducers (Battenberg et al., 2017) and low-rank transformers (Winata et al., 2020) as alternative sequence-to-sequence architectures.
Provide evidence that this technology works.
Our solution will be built with widely used and open-source technologies implemented in the TensorFlow machine learning platform. Our current work is based upon Mozilla’s DeepSpeech, which is an open-source ASR system in TensorFlow: it provides foundational work while allowing relevant modification to support Wakashan-specific polysynthetic language recognition. Though prior work with Seneca and Inuktitut extensively uses the Kaldi machine learning platform, TensorFlow is functionally equivalent. Unlike Kaldi however, TensorFlow has the added benefit of being widely supported by industry and easily deployable to many platforms, especially mobile apps. Offline mobile apps are particularly important to rural indigenous communities without easy or reliable internet access. This technology will scale and be usable in a rich range of future immersive language-learning apps.
Please select the technologies currently used in your solution:
What is your theory of change?
Building Automatic Speech Recognition for a language family that spans international borders will both advance language revitalization and contribute to the participating communities’ technological capacity.
Short term impact
Community members become proficient at rigorously documenting their own language.
Initial early models will demonstrate that community-driven AI development is attainable, which then encourages further work.
Community members receive training to carry out limited software engineering.
Learning one's heritage language becomes attractive.
Medium term impact
A critical mass of training data is created, which is invaluable for linguistic and machine learning research.
Community members gain capacity to conduct limited independent machine learning research.
Community members are comfortable with software engineering.
Initial prototype language education apps are being developed.
Learning one's heritage language becomes normalized.
Long term impact
New AI research, unimaginable now, is being conducted by the community using ASR work and training data.
Immersive voice experiences are being developed by community members.
Education, outside of language revitalization courses, are conducted in the community’s language. For example math and science is taught in Makah.
Select the key characteristics of your target population.
Which of the UN Sustainable Development Goals does your solution address?
In which state(s) do you currently operate?
In which state(s) will you be operating within the next year?
How many people does your solution currently serve? How many will it serve in one year? In five years?
Currently our work with the Kwak’wala language includes 20 community members who have been or are being trained in language documentation for machine learning. In the Makah community we will be working with 6 community researchers this year, for a total of 26 served by this solution this year.
A year from now, we expect to expand the Kwak’wala ASR solution to include more of the 15 Kwakwa̱ka̱’wakw communities (population ca. 6,500).
In five years we anticipate serving all Wakashan language communities (population ca. 25,000) with innovative voice based educational material.
This will be in addition to other Indigenous language communities we extend our polysynthetic ASR solution to. Our team has particularly long experience with the distinctive polysynthetic word structure of Algonquian languages (especially Penobscot, Passamaquoddy-Maliseet, Mi'kmaw, Abenaki, and Long Island Algonquian) and very well-established collaborations with those communities. This will readily allow us to extend our ASR solution to those languages, which, especially if including the related and similarly-structured Cree and Ojibwe languages among others, potentially reaches over 500,000 people, offering them a scalable solution for broad deployment of community-centered immersive language research and pedagogy.
What are your goals within the next year and within the next five years?
Within the next year our goal is to make significant progress in solving and implementing a reliable polysynthetic ASR model for Makah and Kwak’wala. This includes training community members in annotating data, collaboratively building a machine learning training pipeline, and building prototype apps demonstrating how to use the models for community-driven educational initiatives.
Within the following years we expect to be mere consultants as other Wakashan communities implement their own ASR models. From there, our team will focus on generalizing our polysynthetic toolset for and building relationships with Algonquian-speaking communities—as we already have decades of linguistic research and community-collaboration experience with several groups in that language family.
By year five we expect to have a fully functional polysynthetic machine learning platform, based upon open source software like TensorFlow, scaled to be independently implemented by Indigenous communities. Our long-term goal is to strategically build partnerships with Indigenous language communities to advance our solution’s capabilities and allow others to use and extend our platform sustainably and organically.
What barriers currently exist for you to accomplish your goals in the next year and in the next five years?
The biggest immediate challenge to this is COVID-19 which has several ramifications:
Risk to elder knowledge keepers.
Limits our ability to conduct in-person data-gathering, mentoring, and consultation.
Educational resources within communities are being diverted to mitigate the pandemic, which includes converting all K-12 and adult courses online.
The international border between Canada and USA is closed for non-essential travel.
Year one risks beyond COVID-19:
Lack of machine learning infrastructure, such as high performance computers to conduct research, within the Makah community.
Lack of secure clearinghouse for secure data storage.
Large text corpus to enhance training does not exist; many documents are handwritten.
Data cleanliness: status of existing data (analog / digital / machine-readable) needs to be assessed.
Makah community researchers are currently not funded.
How to protect Indigenous Data Sovereignty?
How to ensure IP rights for communities involved are protected in an international effort such as this?
Unknown if this is a sustainable for-profit endeavor or best conducted as a NGO.
Year five risks:
Scaling and generalizing our solution to all Wakashan languages and beyond.
Making this effort sustainable within communities and on a broader scale.
Indigenous Data Sovereignty must continually be protected.
How can we scale this effort beyond North America?
How do you plan to overcome these barriers?
For the immediate COVID-19 pandemic ensure the continuity of community knowledge by protecting elders from the virus. Essentially, conduct all work through teleconference.
For year one risk mitigations:
ML Infrastructure: explore leveraging high-performance computing resources in Canada being used for Kwak’wala. Initial research can be conducted upon personal computers.
Clearinghouse: follow Indigenous knowledge access protocols and assume no data can be stored insecurely.
Text corpus: partner with researchers in OCR AI fields and create an ASR strategy that does not strictly rely on a text corpus.
Data cleanliness: assess the current state of Makah and Kwak’wala data and build skills within the communities to create machine-readable language data.
Financial: begin partnering with the Makah community to seek funding such as grants from the US Government Administration for Native American Language Revitalization program.
Legal: seek mentorship from MIT Solve on how to develop a legal framework to protect international Indigenous Data Sovereignty.
Market: seek mentorship from MIT Solve on a sustainable financial model to conduct work amongst the Wakashan language family.
Scaling: seek MIT Solve mentorship for scalable enterprise AI design practices and in the meanwhile use industry-standard open-source software.
Financial: seek MIT Solve mentorship on how to make this effort scale throughout the Wakashan languages and beyond.
Legal: Seek mentorship from MIT Solve on how to scale a legal framework across the globe.
Market: Seek Mentorship from MIT Solve on making fiscal decisions now that will scale to include thousands of languages and peoples in the future.
What type of organization is your solution team?Not registered as any organization
How many people work on your solution team?
- Full time staff: 0
- Part-time staff: 1
- Consultant: 1
- Volunteers: 2
- Makah consultants & contractors: 6
The current active people working on Kwak'wala Wakashan Language in British Columbia are not reflected above which are 20+.
How many years have you worked on your solution?
1 year with the Kwak’wala (Wakashan) language, building a data pipeline for machine learning
Why are you and your team well-positioned to deliver this solution?
Collectively our solution team has decades of experience working in the fields of Computer Science, Math/Data Science, Technical Program Management, and Linguistics. More importantly, our team has collaborated extensively with Indigenous communities in North America and conducted advanced technological development.
Michael Running Wolf was raised by his Lakota father in his Cheyenne mother's village where his mother tongue was the dominant language. Despite growing up without reliable running water and electricity he has a MS in Computer Science and currently works in industry on Voice Assistant AI. https://www.linkedin.com/in/runningwolf
Shawn Tsosie, Chippewa/Navajo, is a Machine Learning Data Scientist working on Indigenous cultural revitalization. He is an Iraq War Veteran raised in rural Montana. He received his undergraduate degree from MIT and his masters and PhD degree in Math from UC Santa Cruz. https://shawn-tsosie.github.io/blog/2019/11/03/ASR-Project
Caroline Running Wolf, nee Old Coyote, is a Crow PhD student at UBC Vancouver focusing on technology-assisted language reclamation for Kwak'wala. Caroline created Indigenous VR-projects supported by Oculus/Facebook and the Sundance New Frontier Program. She gave talks on Indigenous Data Sovereignty at NeurIPS 2019: https://aiforsocialgood.github.io/neurips2019/schedule.htm
Conor McDonough Quinn is a documentary and revitalization/reclamation linguist who has worked since the mid-1990s on the morphosyntax and morphosemantics of polysynthetic stem structure and related grammatical properties of Eastern Algonquian languages, with special focus on how formal/scholarly models enable practical approaches to community language revitalization/reclamation pedagogy. http://www.conormquinn.com/professional.html
What organizations do you currently partner with, if any? How are you working with them?
Sanyakola Foundation, a Kwakwa̱ka̱’wakw nonprofit, to continue our Kwak’wala corpus collection and ASR project
Makah Language Program to plan, develop, and implement a Makah corpus and ASR project
Makah Cultural & Research Center to assist with archival research
Indigenous Languages Technology (ILT) project at National Research Council of Canada to network and share insights.
What is your path to financial sustainability?
In this early phase of the solution’s life-cycle, the most viable path to funding is through grants. We will continue to seek grants in Canada and USA that support our community-driven AI research and implementations promoting and enhancing Indigenous language and culture reclamation efforts.
Our team is also exploring the possibility of pursuing this work as a for-profit startup in addition to considering the forming of an international NGO.
What are your estimated expenses for 2020?
Makah esimated expenses: $55,000USD, current work/budget for Kwak'wala (Wakashan Language) is not included
Do you primarily provide products or services directly to individuals, or to other organizations?Individual consumers or stakeholders (B2C)
Why are you applying to Solve?
This team is applying to further our goal of language revitalization through the access to significant networks in AI, industry, and linguistic research represented by MIT scholarship and the MIT Solve organization. MIT Solve mentorship for legal and business related issues would help us build a sustainable and scalable business model for our innovative technology while securing Indigenous Data Sovereignty.
But this MIT Solve is a stepping stone to a larger goal: though we face significant technological hurdles the barriers faced in our target communities are greater. The Wakashan Language communities persevere despite broad and entrenched socioeconomic and latent colonial era oppression. These communities are ancient and persist despite genocidal policies from two colonial governments, but struggle needlessly to participate in the modern knowledge economy.
The Makah, like their international brethren, require infrastructure to support sustainable technological/economical advancement and MIT Solve is an avenue toward this goal. By participating in advanced AI research, and forging partnerships with institutions within MIT Solve, the Makah and Wakashan communities can obtain an equitable economic future. More importantly, they can engineer a future that respects and leverages ancient traditions to drive innovation.
In which of the following areas do you most need partners or support?
Please explain in more detail here.
As an early-stage collective of activists, engineers, and linguists that is not yet registered as any organization we require advice and mentor ship in business (NGO / for-profit) and legal (Indigenous Data Sovereignty, IP rights) related topics. These requests are in addition to technological guidance and/or partners in the language revitalization space
What organizations would you like to partner with, and how would you like to partner with them?
To advance our cutting-edge solution and explore scalability we would like to collaborate and share resources/insights with other individuals and institutions who are researching similar ASR challenges. We currently know of
CRIM (Vishwa Gupta & Gilles Boulianne)
Robbie Jimerson at Rochester Institute of Technology
We would love to communicate with the alumni and current recipients of the AI Innovations prize and any other AI or ASR initiatives to share resources and insights.
We would also be interested to connect with linguists who are specialized in Indigenous and/or endangered languages, such as Dr. Norvin Richards from the MIT Linguistics Department.
- Michael Running Wolf Founder, Indigenous in AI