Projects

Previous Projects

The Turkic Interlingua (TIL) Multilingual Corpus for Machine Translation was a project conducted by our community that resulted in state-of-the-art pre-trained machine translation systems as well as multi-way parallel training and evaluation resources in 22 Turkic languages. For more details, please visit the project GitHub page: https://github.com/turkic-interlingua/til-mt

The Turkic Unified Multilingual Language Understanding (TUMLU) Benchmark is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. GitHub page: https://github.com/ceferisbarov/TUMLU

Ongoing Projects

We are currently establishing standards and methodology for building a large-scale multilingual Turkic Corpus that would have linguistic annotations to be used for developing and evaluating systems for solving natural language processing tasks, such as past-of-speech tagging, named entity recognition, text summarization, and question answering. If you are interested in contributing to this project, please feel free to get in touch with us.

Future plans

As a rapidly growing special interest group, we are aiming to start holding official meetings starting from 2023 to promote and accelerate progress in computational linguistics of Turkic languages. We also intend to keep refining our open and collaborative infrastructure, besides starting to build tooling that will help the Turkic NLP ecosystem have more interoperable modules with intermediate representations across language family members. We hope to see anyone interested in Turkic languages to be involved and participate in our future workshops. More details will be available through our website and newsletters.