Introduction
The special interest group for Uralic languages hosts an up-to-date list of resource for Uralic languages. This matrix tries to capture state of the Uralic languages computational resources using linkable, downloadable and usable resources as references (rather than, expert judgment, as some other similar matrices do). For a full list of resources available, users are advised to turn to services, such as meta-share.
Please help us keep the list up-to-date, send information of new resources and fixes to current ones to our ticket tracking system on github.
The matrix
The columns are the following:
- ISO 639: closest applicable standard language code
- Language: the name of the language, in case of related / similarly named languages with separate language codes
- (group): is used to add the differentiating part of language name
- Orth: describes the status of standard orthography
- Keyboard: freely available keyboard layouts for commonly used operation systems available
- Corpora: freely available language data, both spoken and written, annotated or not, carefully selected or not
- Speech: speech technology resources such as synthesised speakers
- Morph: various text analysers; morpho-syntactic or otherwise
- Treebank: is for different treebanks and parsebanks with over word-level annotations
- MT: machine translators
- CLDR: resources available in the Common Language Data Repository
The resources listed are ones that we have verified free to use, at least for research purposes but usually free for all, free as in no costs and free as in no restrictions for purposes of use, can be copyleft. Also, systems must be usable, ideally used by us or researchers we know.
ISO 639 | Language | (group) | Orth | Kbd | Corpora | Speech | Morph | Treebank | MT | CLDR |
---|---|---|---|---|---|---|---|---|---|---|
fin | Finnish | ??? | ++++ | ??? | ??? | ++++ | ++ | +- | + | |
fkv | Kven | ? | ? | ? | ? | ? | ? | ? | ||
fit | Meänkieli | ? | ? | ? | ? | ? | ? | ? | ||
hun | Hungarian | ? | ? | ? | ? | + | ? | ? | + | |
est | Estonian | ? | + | ? | ? | ++++ | ? | + | + | |
ekk | (Estonian) | |||||||||
vro | Võro | ? | ? | ? | ? | ? | ? | ? | ||
sme | Sámi | North | + | ++ | + | ? | + | ? | + | + |
smj | Lule | ? | ? | ? | ? | ? | ? | ? | ||
sma | South | ? | ? | ? | ? | ? | ? | ? | ||
smn | Inari | ? | ? | ? | ? | ? | ? | ? | + | |
sms | Skolt | ? | ? | ? | ? | ? | ? | ? | ||
sjd | Kildin | ? | ? | ? | ? | ? | ? | ? | ||
kpv | Komi | Zyrian | ? | ? | ? | ? | ? | ? | ||
koi | Permyak | ? | ? | ? | ? | ? | ? | |||
udm | Udmurt | ? | ? | ? | ? | ? | ? | ? | ||
mrj | Mari | Hill | ? | ? | ? | ? | ? | ? | ||
mhr | Meadow | ? | ? | ? | ? | ? | ? | |||
myv | Mordvin | Erzya | ? | ? | ? | ? | ? | ? | ||
mdf | Moksha | ? | ? | ? | ? | ?? | ? | |||
mns | Mansi | ? | ? | ? | ? | ? | ? | ? | ||
kca | Khanty | ? | ? | ? | ? | ? | ? | ? | ||
nio | Nganasan | ? | ? | ? | ? | ? | ? | ? | ||
enh | Enets | Tundra | ? | ? | ? | ? | ? | ? | ||
enf | Forest | ? | ? | ? | ? | ? | ? | |||
yrk | Nenets | Tundra | ? | ? | ? | ? | ? | ? | ||
yrk | Forest | ? | ? | ? | ? | ? | ? | |||
krl | Karelian | Varsinais- | ? | ? | ? | ? | ? | ? | ||
izh | Ingrian | ? | ? | ? | ? | ? | ? | |||
olo | Olonets | ? | ? | ? | ? | ? | ? | - | ||
sel | Selkup | ? | ? | ? | ? | ? | ? | ? | ||
vot | Votic | ? | ? | ? | ? | + | ? | ? |
I have used a plus sign + for most resources, an occasional hyphen-minus - is used to denote rather work-in-progress versions of data or software.
References
We have tried to link all resources while avoiding spamming the list with derivations and forks of the same resource.
by Language
This is language-index:
Finnish
Finnish keyboard
- Kotoistus keyboard layout, for national (SFS 5966) and international standards
Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.
Finnish Morphology
- Omorfi (see also: apertium-fin, giella-fin)
- Voikko (also: suomi-malaga, vfst morphology)
- GF Finnish
- UralicNLP (uses Omorfi)
Finnish Treebanks
- Universal Depedencies Finnish (see also: Turku dependency treebank)
- Universal Dependencies Finnish FTB (see also: FinnTreeBanks)
Finnish Machine Translation
- Apertium Finnish-English (high coverage, low quality)
- GF Finnish to any
North Saami
North Saami keyboards
- Official layout keyboard layout, for national and international standards
Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.
Hungarian
Hungarian morphologies
by Resource
This is resource-type index:
Orthography
Keyboards
Corpora
Speech technology
Morphology
- Omorfi (see also: apertium-fin, giella-fin)
- Voikko (also: suomi-malaga, vfst morphology)
- hunmorph
- GF (Available: Finnish, Hungarian…)
- UralicNLP (Available models/data: Finnish, North Saami…)
Treebanks
Machine Translation
Other references
Larger collections:
- Universal dependencies, treebanks, dependency syntax conventions for Finnish, Estonian and Hungarian plus other world languages (includes Uralic guidelines
- Akusanat an online dictionary
- UralicNLP resources different resources for Uralic languages
- OPUS open source parallel corpora corpora for most of the world’s languages
- Giellatekno repository of uralic analysers and tools, most Uralic languages
- Apertium, machine translation dictionaries including some uralic languages
- Grammatical Framework Haskell descriptions of linguistic data, including a few uralic languages
- Korp at CSC, a corpus search interface for CSC.fi-managed corpora
- Wanca corpora from SUKI project on harvesting internet for Uralic texts
- Voikko spell-checking for many Uralic languages
- Language bank of Finland a Finland’s central repository of language resources
- Centre for Estonian Language Resources
- Divvun Writers’ tools for Saami languages, and lots of others