We envision a future where advanced Natural Language Processing (NLP) technologies fully empower low-resource languages. By constructing comprehensive, multilingual, and multimodal language datasets and developing robust language models, we aim to enhance the linguistic and communicative capabilities of NLP applications significantly. This will not only foster greater inclusivity and accessibility but also spur innovation and development across educational, technological, and social spheres in Uganda and beyond.
"LingoBaraza" is a web portal that provides a vibrant and culturally rich identity for researchers and innovators in Language (Lingo) technology a meeting place (Baraza) for research in NLP. We are strengthening efforts in computational linguistics and NLP by constructing extensive datasets and developing pre-trained models for major Ugandan languages. Our focus encompasses languages such as Luganda, Runyankore, and Rukiga, representing a spectrum of linguistic diversity across Uganda.
Our approaches aim at incorporating linguistic tagging, such as part of speech and morphemic information, using standards like the Universal Dependency treebank and Universal Morphology. .
At Lingobaraza, we work within six SMART objectives that guide our day to day processes.
Collect parallel corpora from diverse study participants to establish a robust dataset for translation studies.
Annotate the collected textual dataset utilizing Universal Dependencies (UD) treebank standards and the Universal Morphology (UniMorph) framework to facilitate advanced linguistic analysis and machine learning applications.
Record and compile hours of audio data from native speakers and broadcast media in local languages, creating a language dataset for speech processing and auditory analysis.
Construct comprehensive computational grammars and lexicons for local languages to support sophisticated language processing tasks, including parsing and text generation.
Utilize the developed computational grammars and lexicons to automate the translation of texts from English based on the Universal Dependencies framework, aiming to improve accuracy and fluency in translation models.
Develop general-purpose large language models (LLMs) for Local languages capable of performing various NLP tasks, thereby extending the benefits of advanced NLP technology to these under-resourced languages..
we are setting the stage to leverage state-of-the-art deep learning techniques for advancing NLP tasks, ultimately paving the way for a linguistically inclusive digital future.
In today's interconnected world, our team isn't confined by borders. We are a global force, a vibrant tapestry woven from threads of talent scattered across the corners of the earth.
It will commence with the extensive collection, curation, and annotation of multilingual and multimodal datasets for Ugandan languages. This phase integrates the development and application of computational resource grammars to facilitate automatic data synthesis, followed by a thorough assessment of both the datasets and the grammars used.
Our team will then conduct experimental analyses to fine-tune Large Language Models (LLMs) using these datasets, comparing the results against existing benchmarks to gauge performance improvements. The culmination of these efforts will be the drafting of two comprehensive research papers that detail every aspect of our methodology and results, from data collection to model fine-tuning, which will subsequently be submitted for peer review and publication in leading journals or conferences focused on computational linguistics and NLP..
This website is a central hub for all project-related information, including updates, publications, and accessible datasets. Additionally, social media platforms will be utilized to foster engagement with both the academic community and the public, ensuring continuous dissemination of findings and updates.
The datasets, curated and annotated, will be made available under an appropriate Creative Commons license to promote open access and encourage their use by other researchers. This proactive approach in communication and open sharing is aimed at stimulating further research, fostering collaborations, and enhancing the utility of our work within the global NLP community.