๐ Turkic Transliteration Suite
Web Interface for exploring Turkic language transliteration tools
โ ๏ธ Model file(s) missing:
turkic_model.model
not found in /usr/local/lib/python3.10/site-packages/turkic_translit/turkic_model.model Please ensure all required models are present for full functionality.
Explore transliteration capabilities for Turkic languages between Cyrillic, Latin, and IPA representations. Navigate through the tabs below to access different features.
Keep only sentences whose FastText language-ID matches the code above (uses lid.176 model).
๐ Transliteration Options
Get both original + transliterated corpus files
Preview
Corpus Source | Language | Max Sentences (empty = all) | Filter by FastText LangID | Min Lang-ID Confidence Threshold | โจ Also create transliterated version | Transliteration Format |
---|
You can provide text directly in the box above, or upload a file below, or both
Number of tokens in the vocabulary
SentencePiece algorithm
Amount of characters covered by the model
Special tokens to include in the model
Click 'Train Model' to begin training
Input Text | RU Mask Threshold | Min Token Length |
---|
Or upload a text file:
File content replaces text input
Only applies when using Latin output format
Input Text | Language | Output Format | Also transliterate Arabic script |
---|
Upload a file with Latin transliteration
Upload a file with IPA transliteration
Number of lines to sample (leave empty to use all)
๐ค Mutual Intelligibility Analysis
Compare how similar two language models are in their understanding of a specific Turkic language.
This tool measures how similarly two different language models understand and process the same language. You can compare any two models to see if they would be mutually intelligible in real-world applications.
How It Works:
- Select two models to compare (pre-trained from Hugging Face or local models)
- Choose a Turkic language for evaluation
- Select a similarity metric (see details below)
- Adjust the sample size for the comparison (larger samples provide more accurate results but take longer)
๐ Results
Processing Log
Click 'Compare Models' to start analysis...
About the Similarity Metrics:
Cosine Similarity - How similarly do the models represent language?Cosine similarity measures how closely aligned the internal representations of language are between two models. It tells you if the models "think" about language in the same way.
- Values range from 0 to 1, where higher values mean greater similarity
- Above 0.9: Models are practically identical in how they understand the language
- 0.7-0.9: Very high similarity; models likely have similar training data or architecture
- 0.5-0.7: Moderate similarity; models share some common understanding
- Below 0.5: Low similarity; models have significant differences
Perplexity measures how "confused" a model is by new text in a specific language. Lower perplexity means the model is less surprised by the text and therefore understands the language better.
- The tool measures perplexity for each model separately against the same language samples
- The model with lower perplexity has a better understanding of the language
- Significant differences in perplexity (>10%) indicate one model has a substantially better grasp of the language
- This is useful for determining which model would be more effective for applications in that language