Multilingual Support
Actyze supports natural language queries in 50+ languages, enabling global teams to query data in their native language without requiring English proficiency.
Overview
Our multilingual semantic search is powered by advanced transformer-based embeddings that understand queries across languages, matching them accurately to your database schemas regardless of the language used.
Key Benefits:
- Query in your native language
- No translation required
- Consistent accuracy across languages
- Global team collaboration
- Reduced language barriers to data access
Supported Languages
European Languages (18)
- Western: English, German, French, Spanish, Italian, Portuguese, Dutch
- Nordic: Swedish, Danish, Norwegian, Finnish, Icelandic
- Eastern: Polish, Russian, Czech, Bulgarian, Romanian, Greek, Ukrainian
Asian Languages (20)
- East Asian: Chinese (Simplified), Chinese (Traditional), Japanese, Korean
- Southeast Asian: Thai, Vietnamese, Indonesian, Malay, Tagalog (Filipino)
- South Asian: Hindi, Bengali, Tamil, Telugu, Marathi, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Sinhala, Nepali
Middle Eastern Languages (4)
- Arabic
- Hebrew
- Persian (Farsi)
- Turkish
Other Languages (15+)
Afrikaans, Albanian, Azerbaijani, Basque, Belarusian, Bosnian, Catalan, Croatian, Estonian, Galician, Georgian, Hungarian, Irish, Kazakh, Kurdish, Kyrgyz, Latvian, Lithuanian, Macedonian, Maltese, Mongolian, Serbian, Slovak, Slovenian, Somali, Swahili, Tajik, Tatar, Uzbek, Welsh, Yiddish
Total: 50+ languages
How It Works
Multilingual Embedding Model
Actyze uses paraphrase-multilingual-mpnet-base-v2, a state-of-the-art sentence transformer trained on parallel data from 50+ languages.
Technical Details:
- Architecture: MPNet (Microsoft's Masked and Permuted Pre-training)
- Embedding Dimension: 768
- Training Data: Billions of sentence pairs across 50+ languages
- Semantic Understanding: Captures meaning, not just keywords
- Cross-lingual: Queries in one language match schemas in another
Process:
- User query in any language → Embedded into 768-dimensional vector
- Schema metadata → Embedded into same vector space
- FAISS vector search → Finds semantically similar schemas
- Results ranked by semantic similarity (cosine distance)
Language-Agnostic Schema Matching
Your database schema (table names, column names, metadata) is typically in English, but queries can be in any supported language:
French Query: "Montrez-moi les 10 meilleurs clients par revenu"
↓ (Embedded)
↓ (Semantic Match)
English Schema: sales.customers (revenue, customer_name, total_purchases)
↓
Generated SQL:
SELECT customer_name, SUM(revenue) as total_revenue
FROM sales.customers
GROUP BY customer_name
ORDER BY total_revenue DESC
LIMIT 10
Named Entity Recognition (NER)
For enhanced entity detection (product names, locations, dates), Actyze uses spaCy's English NER model as a lightweight supplement:
- Model:
en_core_web_md(English only) - Role: Extracts entities (PERSON, ORG, GPE, DATE, MONEY, etc.)
- Impact: Improves accuracy for entity-heavy queries
- Note: Not required for multilingual queries—the primary semantic search handles all languages
Important: While NER is English-only, it's a lightweight enhancement, not a requirement. The multilingual embedding model performs the heavy lifting and works independently across all 50+ languages.
Query Examples
Spanish
Query: "¿Cuáles son las ventas totales por región en 2025?"
Translation: "What are the total sales by region in 2025?"
Generated SQL:
SELECT region, SUM(sales) as total_sales
FROM sales_data
WHERE year = 2025
GROUP BY region
Chinese (Simplified)
Query: "显示过去三个月的客户增长趋势"
Translation: "Show customer growth trend for the past three months"
Generated SQL:
SELECT DATE_TRUNC('month', signup_date) as month, COUNT(*) as new_customers
FROM customers
WHERE signup_date >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY month
ORDER BY month
German
Query: "Zeige mir die umsatzstärksten Produkte im letzten Quartal"
Translation: "Show me the highest revenue products in the last quarter"
Generated SQL:
SELECT product_name, SUM(revenue) as total_revenue
FROM products
WHERE quarter = 'Q4'
GROUP BY product_name
ORDER BY total_revenue DESC
Japanese
Query: "先月の部門別売上を表示してください"
Translation: "Please display last month's sales by department"
Generated SQL:
SELECT department, SUM(sales) as total_sales
FROM sales_data
WHERE month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
GROUP BY department
Arabic
Query: "أظهر لي أفضل 5 عملاء حسب إجمالي الطلبات"
Translation: "Show me top 5 customers by total orders"
Generated SQL:
SELECT customer_name, COUNT(*) as total_orders
FROM orders
GROUP BY customer_name
ORDER BY total_orders DESC
LIMIT 5
Hindi
Query: "पिछले साल के राजस्व की तुलना इस साल से करें"
Translation: "Compare last year's revenue with this year"
Generated SQL:
SELECT
YEAR(order_date) as year,
SUM(revenue) as total_revenue
FROM orders
WHERE YEAR(order_date) IN (2024, 2025)
GROUP BY year
ORDER BY year
Best Practices
Metadata in Multiple Languages
For optimal accuracy, add metadata descriptions in the languages your team uses:
English Metadata:
{
"table_name": "customers",
"description": "Customer records including contact info and purchase history"
}
Multilingual Metadata:
{
"table_name": "customers",
"description": "Customer records | Registros de clientes | Dossiers clients | Kundendaten | 客户记录"
}
The multilingual embedding model will understand all language variants and match appropriately.
Natural Phrasing
Users should phrase queries naturally in their language:
Good (Natural):
- Spanish: "¿Cuántos clientes tenemos en México?"
- French: "Quels sont nos meilleurs produits ce trimestre?"
- German: "Wie hoch ist der Gesamtumsatz diese Woche?"
Avoid (Awkward/Literal Translation):
- Spanish: "Mostrar cuenta de clientes donde país es México"
- French: "Afficher produits où revenu est maximum"
- German: "Zeigen Summe Verkäufe Woche aktuelle"
Mixed Language Queries
For technical terms or proper nouns, mixing languages is acceptable:
Spanish + English: "Muestra los pedidos de Amazon y eBay"
French + English: "Affiche les ventes de Black Friday"
German + English: "Zeige die API requests der letzten Stunde"
The model understands context and correctly interprets mixed-language queries.
Limitations
SQL Generation Language
While queries can be in any of 50+ languages, the SQL generation and LLM responses depend on your configured LLM provider:
- Most LLMs support multilingual SQL generation
- Generated SQL is standard SQL (language-agnostic)
- LLM responses (explanations) may vary by model capability
Recommended: Claude Sonnet 4.5, GPT-4o, or equivalent models provide excellent multilingual SQL generation quality and accuracy.
Entity Detection (NER)
Named Entity Recognition is English-only, but this has minimal impact:
- Primary semantic search works across all 50+ languages
- NER is a lightweight enhancement for entity-heavy queries
- Multilingual queries work accurately without NER
For advanced multilingual NER, consider:
- Language-specific spaCy models (e.g.,
de_core_news_mdfor German) - Multilingual transformer NER models (e.g.,
xlm-roberta-large-finetuned)
Column Name Language
If your database uses non-English column names, add English metadata descriptions for better cross-language understanding:
Table: ventes (French)
Columns: nom_client, revenu_total, date_achat
Metadata:
- nom_client → "Customer name | Nombre del cliente | 客户名"
- revenu_total → "Total revenue | Ingresos totales | 总收入"
- date_achat → "Purchase date | Fecha de compra | 购买日期"
Technical Details
Embedding Model Specifications
Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Architecture:
- Base: microsoft/mpnet-base
- Fine-tuning: Parallel sentence pairs from 50+ languages
- Output: 768-dimensional dense vectors
Performance:
- Encoding Speed: ~50-100ms per query (single GPU)
- Memory: ~420MB model size
- Accuracy: 85-90% semantic similarity across languages
Training Data:
- Billions of sentence pairs
- Wikipedia, parallel corpora, web data
- Cross-lingual alignment
Vector Search (FAISS)
Index Type: IndexFlatIP (Inner Product for cosine similarity)
Process:
- Normalize embeddings to unit length (L2 normalization)
- Compute cosine similarity via inner product
- Return top-k most similar schemas
Performance:
- Search Speed: <50ms for 10,000 schemas
- Scalability: Linear with schema count
- Accuracy: Exact nearest neighbor search
Language Detection
Actyze does not require explicit language detection:
- Multilingual model handles all languages in same vector space
- No need to specify query language
- Automatic cross-lingual matching
Users can switch languages mid-conversation without configuration.
API Usage
Query in Any Language
Example: Spanish Query
curl -X POST https://your-actyze.com/api/query \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "Muestra los 10 productos más vendidos este mes",
"database": "postgres"
}'
Response:
{
"sql": "SELECT product_name, SUM(quantity) as units_sold FROM sales WHERE MONTH(sale_date) = MONTH(CURRENT_DATE) GROUP BY product_name ORDER BY units_sold DESC LIMIT 10",
"schemas_used": ["sales"],
"confidence": 0.92
}
Metadata in Multiple Languages
Add Multilingual Metadata:
curl -X POST https://your-actyze.com/api/metadata \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"table_name": "customers",
"description": "Customer records | Registros de clientes | Dossiers clients | 客户记录",
"tags": ["customer", "cliente", "client", "客户"]
}'
Configuration
Model Caching
The multilingual model is downloaded during container build and cached:
# From Dockerfile
RUN python3 download_models.py
Cache location: /app/model_cache/sentence_transformers/
Model size: ~420MB
Custom Models
To use a different multilingual model, update the embedder configuration:
# schema-service/app/embedder.py
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" # Default
# Alternative multilingual models:
# "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" # Faster, smaller
# "sentence-transformers/LaBSE" # More languages (109)
Troubleshooting
Query Not Matching Schemas
Issue: Query in non-English language not finding relevant tables.
Causes:
- Schema metadata is too technical (no business context)
- Query is too vague or ambiguous
Solutions:
- Add multilingual metadata descriptions
- Rephrase query more specifically
- Use table/column names explicitly
Poor Results in Specific Language
Issue: Queries in Language X work poorly compared to English.
Check:
- Is the language in the supported 50+ list?
- Is metadata available in that language?
- Are queries phrased naturally?
Improve:
- Add metadata in user's language
- Provide example queries in that language
- Consider language-specific spaCy NER model
Mixed Results
Issue: Schema search returns irrelevant tables for multilingual queries.
Debug:
- Check schema metadata quality
- Verify embedding model is loaded correctly
- Test with simpler queries
API Debug:
# Check model status
curl -X GET https://your-actyze.com/api/schema-service/health \
-H "Authorization: Bearer YOUR_TOKEN"
Additional Resources
- Schema Boosting - Improve recommendations
- Metadata - Add context for better matching
- Quick Start - Try multilingual queries
- Hugging Face Model: paraphrase-multilingual-mpnet-base-v2
Support
For multilingual query issues:
- Verify query language is in supported list
- Check schema metadata includes business context
- Test with example queries in that language
- Review embedding model health status
- Contact support with specific query examples