Turkic Transliteration Suite
Web Interface for exploring Turkic language transliteration tools
Explore transliteration capabilities for Turkic languages between Cyrillic, Latin, and IPA representations. Navigate through the tabs below to access different features.
Corpus Downloader: Stream sentences from public corpora (OSCAR or Wikipedia) directly in the browser. Select a source and language, optionally cap the number of sentences, and decide whether to filter by FastText language ID.
Keep only sentences whose FastText language-ID matches the code above (uses lid.176 model).
Transliteration Options
Get both original + transliterated corpus files
Preview
| Corpus Source | Language | Max Sentences (empty = all) | Filter by FastText LangID | Min Lang-ID Confidence Threshold | Also create transliterated version | Transliteration Format |
|---|
SentencePiece Training: Train your own SentencePiece model for tokenization. You can use this model for custom tokenization tasks with the Turkic Transliteration toolkit.
Provide text in the box above, or upload a file, or both
Number of tokens in the vocabulary
SentencePiece algorithm
Amount of characters covered by the model
Optional special tokens to include in the vocabulary
Direct Transliteration: Convert text directly to Latin or IPA using language-specific rules. Select the language, choose your output format, and click Transliterate.
Or upload a text file:
File content replaces text input
Only applies when using Latin output format
| Input Text | Language | Output Format | Also transliterate Arabic script |
|---|
Mutual Intelligibility Analysis
Compare how similar two language models are in their understanding of a specific Turkic language.
Results
Processing Log
Click 'Compare Models' to start analysis...