Turkic Transliteration Suite

Corpus Downloader: Stream sentences from public corpora (OSCAR or Wikipedia) directly in the browser. Select a source and language, optionally cap the number of sentences, and decide whether to filter by FastText language ID.

Corpus Source

Language

Max Sentences (empty = all)

Keep only sentences whose FastText language-ID matches the code above (uses lid.176 model).

Filter by FastText LangID

Min Lang-ID Confidence Threshold

FastText analyzes each SENTENCE and outputs a confidence score. 0.95 = keep sentences where model is 95%+ sure it's the target language | 0.5 = keep sentences with 50%+ confidence | Lower values include more mixed-language or ambiguous sentences

0 1

📝 Transliteration Options

Get both original + transliterated corpus files

✨ Also create transliterated version

Original Corpus

Preview

Try this example

Corpus Source	Language	Max Sentences (empty = all)	Filter by FastText LangID	Min Lang-ID Confidence Threshold	✨ Also create transliterated version	Transliteration Format

SentencePiece Training: Train your own SentencePiece model for tokenization. You can use this model for custom tokenization tasks with the Turkic Transliteration toolkit.

Input Text (Optional)

You can provide text directly in the box above, or upload a file below, or both

Training File (Optional)

Vocabulary Size

100 50000

Number of tokens in the vocabulary

Model Type

SentencePiece algorithm

Character Coverage

0.9 1

Amount of characters covered by the model

User Symbols (Comma-separated)

Special tokens to include in the model

Click 'Train Model' to begin training

Download Trained Model

Try this example text

Token Analysis: Analyze the text by splitting it into tokens and identifying the most likely language for each token. This helps understand how the system processes mixed-language text.

Input Text

Try these examples

Russian Text Filter: Identify and mask Russian text within mixed-language content. Adjust the threshold and minimum token length to control sensitivity.

✅ using the Full model: lid.176.bin for language ID

Input Text

RU Mask Threshold

Higher values are more strict in identifying Russian

0 1

Min Token Length

Skip words shorter than this

1 10

Masked Output

Try these examples

Input Text	RU Mask Threshold	Min Token Length

Direct Transliteration: Convert text directly to Latin or IPA using language-specific rules. Select the language, choose your output format, and click Transliterate.

Input Text

Or upload a text file:

📁 Upload .txt file

File content replaces text input

Language

ar = Arabic (ar), kk = Kazakh (kk), ky = Kirghiz (ky), +1 more

ar kk ky tr

Output Format

IPA: International Phonetic Alphabet

Latin IPA

Only applies when using Latin output format

Also transliterate Arabic script

Output

📥 Download Result

Try these examples

Input Text	Language	Output Format	Also transliterate Arabic script

Compare Transliterations: Calculate the median Levenshtein distance between Latin and IPA transliterations. Upload two text files for comparison to evaluate transliteration quality.

Latin File

Upload a file with Latin transliteration

IPA File

Upload a file with IPA transliteration

Sample Size (optional)

Number of lines to sample (leave empty to use all)

🤝 Mutual Intelligibility Analysis

Compare how similar two language models are in their understanding of a specific Turkic language.

This tool measures how similarly two different language models understand and process the same language. You can compare any two models to see if they would be mutually intelligible in real-world applications.

How It Works:

Select two models to compare (pre-trained from Hugging Face or local models)
Choose a Turkic language for evaluation
Select a similarity metric (see details below)
Adjust the sample size for the comparison (larger samples provide more accurate results but take longer)

Model A (path or repo)

Model B (path or repo)

Evaluation Language

Sample Size

100 1000

Similarity Metric

Cosine measures overall similarity between models. Perplexity shows how well each model understands the language.

Cosine Perplexity

📈 Results

Processing Log

Click 'Compare Models' to start analysis...

About the Similarity Metrics:

Cosine Similarity - How similarly do the models represent language?

Cosine similarity measures how closely aligned the internal representations of language are between two models. It tells you if the models "think" about language in the same way.

Values range from 0 to 1, where higher values mean greater similarity
Above 0.9: Models are practically identical in how they understand the language
0.7-0.9: Very high similarity; models likely have similar training data or architecture
0.5-0.7: Moderate similarity; models share some common understanding
Below 0.5: Low similarity; models have significant differences

Perplexity - How well does each model understand the language?

Perplexity measures how "confused" a model is by new text in a specific language. Lower perplexity means the model is less surprised by the text and therefore understands the language better.

The tool measures perplexity for each model separately against the same language samples
The model with lower perplexity has a better understanding of the language
Significant differences in perplexity (>10%) indicate one model has a substantially better grasp of the language
This is useful for determining which model would be more effective for applications in that language

🌍 Turkic Transliteration Suite

Web Interface for exploring Turkic language transliteration tools

📝 Transliteration Options

🤝 Mutual Intelligibility Analysis

How It Works:

📈 Results

Processing Log

About the Similarity Metrics: