Introduction

The special interest group for Uralic languages hosts an up-to-date list of resource for Uralic languages. This matrix tries to capture state of the Uralic languages computational resources using linkable, downloadable and usable resources as references (rather than, expert judgment, as some other similar matrices do). For a full list of resources available, users are advised to turn to services, such as meta-share.

Please help us keep the list up-to-date, send information of new resources and fixes to current ones to our ticket tracking system on github.

The matrix

The columns are the following:

  • ISO 639: closest applicable standard language code
  • Language: the name of the language, in case of related / similarly named languages with separate language codes
  • (group): is used to add the differentiating part of language name
  • Orth: describes the status of standard orthography
  • Keyboard: freely available keyboard layouts for commonly used operation systems available
  • Corpora: freely available language data, both spoken and written, annotated or not, carefully selected or not
  • Speech: speech technology resources such as synthesised speakers
  • Morph: various text analysers; morpho-syntactic or otherwise
  • Treebank: is for different treebanks and parsebanks with over word-level annotations
  • MT: machine translators
  • CLDR: resources available in the Common Language Data Repository

The resources listed are ones that we have verified free to use, at least for research purposes but usually free for all, free as in no costs and free as in no restrictions for purposes of use, can be copyleft. Also, systems must be usable, ideally used by us or researchers we know.

ISO 639 Language (group) Orth Kbd Corpora Speech Morph Treebank MT CLDR
fin Finnish   ??? ++++ ??? ??? ++++ ++ +- +
fkv   Kven ? ? ? ? ? ? ?  
fit   Meänkieli ? ? ? ? ? ? ?  
hun Hungarian   ? ? ? ? + ? ? +
est Estonian   ? + ? ? ++++ ? + +
ekk (Estonian)                  
vro Võro   ? ? ? ? ? ? ?  
sme Sámi North + ++ + ? + ? + +
smj   Lule ? ? ? ? ? ? ?  
sma   South ? ? ? ? ? ? ?  
smn   Inari ? ? ? ? ? ? ? +
sms   Skolt ? ? ? ? ? ? ?  
sjd   Kildin ? ? ? ? ? ? ?  
kpv Komi Zyrian ? ? ? ? ? ?    
koi   Permyak ? ? ? ? ? ?    
udm Udmurt   ? ? ? ? ? ? ?  
mrj Mari Hill ? ? ? ? ? ?    
mhr   Meadow ? ? ? ? ? ?    
myv Mordvin Erzya ? ? ? ? ? ?    
mdf   Moksha ? ? ? ? ?? ?    
mns Mansi   ? ? ? ? ? ? ?  
kca Khanty   ? ? ? ? ? ? ?  
nio Nganasan   ? ? ? ? ? ? ?  
enh Enets Tundra ? ? ? ? ? ?    
enf   Forest ? ? ? ? ? ?    
yrk Nenets Tundra ? ? ? ? ? ?    
yrk   Forest ? ? ? ? ? ?    
krl Karelian Varsinais- ? ? ?   ? ? ?  
izh   Ingrian ? ? ?   ? ? ?  
olo   Olonets ? ? ? ? ? ? -  
sel Selkup   ? ? ? ? ? ? ?  
vot Votic   ? ? ? ? + ? ?  

I have used a plus sign + for most resources, an occasional hyphen-minus - is used to denote rather work-in-progress versions of data or software.

References

We have tried to link all resources while avoiding spamming the list with derivations and forks of the same resource.

by Language

This is language-index:

Finnish

Finnish keyboard

  1. Kotoistus keyboard layout, for national (SFS 5966) and international standards

Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.

Finnish Morphology

  1. Omorfi (see also: apertium-fin, giella-fin)
  2. Voikko (also: suomi-malaga, vfst morphology)
  3. GF Finnish
  4. UralicNLP (uses Omorfi)

Finnish Treebanks

  1. Universal Depedencies Finnish (see also: Turku dependency treebank)
  2. Universal Dependencies Finnish FTB (see also: FinnTreeBanks)

Finnish Machine Translation

  1. Apertium Finnish-English (high coverage, low quality)
  2. GF Finnish to any

North Saami

North Saami keyboards

  1. Official layout keyboard layout, for national and international standards

Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.

  1. Divvun’s North Saami keyboard

Hungarian

Hungarian morphologies

  1. hunmorph

by Resource

This is resource-type index:

Orthography

Keyboards

Corpora

Speech technology

Morphology

  • Omorfi (see also: apertium-fin, giella-fin)
  • Voikko (also: suomi-malaga, vfst morphology)
  • hunmorph
  • GF (Available: Finnish, Hungarian…)
  • UralicNLP (Available models/data: Finnish, North Saami…)

Treebanks

Machine Translation

Other references

Larger collections: