Lingobaraza

Empowering communities through enriching languages

We envision a future where advanced Natural Language Processing (NLP) technologies fully empower low-resource languages. By constructing comprehensive, multilingual, and multimodal language datasets and developing robust language models, we aim to enhance the linguistic and communicative capabilities of NLP applications significantly. This will not only foster greater inclusivity and accessibility but also spur innovation and development across educational, technological, and social spheres in Uganda and beyond.

Who we are ..

"LingoBaraza" is a web portal that provides a vibrant and culturally rich identity for researchers and innovators in Language (Lingo) technology a meeting place (Baraza) for research in NLP. We are strengthening efforts in computational linguistics and NLP by constructing extensive datasets and developing pre-trained models for major Ugandan languages. Our focus encompasses languages such as Luganda, Runyankore, and Rukiga, representing a spectrum of linguistic diversity across Uganda.


Our approaches aim at incorporating linguistic tagging, such as part of speech and morphemic information, using standards like the Universal Dependency treebank and Universal Morphology. .

Our core objectives

At Lingobaraza, we work within six SMART objectives that guide our day to day processes.

Collect

Collect parallel corpora from diverse study participants to establish a robust dataset for translation studies.

Annotate

Annotate the collected textual dataset utilizing Universal Dependencies (UD) treebank standards and the Universal Morphology (UniMorph) framework to facilitate advanced linguistic analysis and machine learning applications.

Process Audio

Record and compile hours of audio data from native speakers and broadcast media in local languages, creating a language dataset for speech processing and auditory analysis.

Support

Construct comprehensive computational grammars and lexicons for local languages to support sophisticated language processing tasks, including parsing and text generation.

Automate

Utilize the developed computational grammars and lexicons to automate the translation of texts from English based on the Universal Dependencies framework, aiming to improve accuracy and fluency in translation models.

Develop

Develop general-purpose large language models (LLMs) for Local languages capable of performing various NLP tasks, thereby extending the benefits of advanced NLP technology to these under-resourced languages..

we are setting the stage to leverage state-of-the-art deep learning techniques for advancing NLP tasks, ultimately paving the way for a linguistically inclusive digital future.

Our Core team

In today's interconnected world, our team isn't confined by borders. We are a global force, a vibrant tapestry woven from threads of talent scattered across the corners of the earth.

David
Bamutura
Profile
Dr. Kimera
Richard
Profile
Dr. Buri
Gershom
Profile
Dr. Wilson
Tumuhimbise
Profile
Mbabazi
Margaret
Profile

Research Dissemination:

It will commence with the extensive collection, curation, and annotation of multilingual and multimodal datasets for Ugandan languages. This phase integrates the development and application of computational resource grammars to facilitate automatic data synthesis, followed by a thorough assessment of both the datasets and the grammars used.

Our team will then conduct experimental analyses to fine-tune Large Language Models (LLMs) using these datasets, comparing the results against existing benchmarks to gauge performance improvements. The culmination of these efforts will be the drafting of two comprehensive research papers that detail every aspect of our methodology and results, from data collection to model fine-tuning, which will subsequently be submitted for peer review and publication in leading journals or conferences focused on computational linguistics and NLP..