Skip to main content

Multilingual Support

Actyze supports natural language queries in 50+ languages, enabling global teams to query data in their native language without requiring English proficiency.

Overview

Our multilingual semantic search is powered by advanced transformer-based embeddings that understand queries across languages, matching them accurately to your database schemas regardless of the language used.

Key Benefits:

  • Query in your native language
  • No translation required
  • Consistent accuracy across languages
  • Global team collaboration
  • Reduced language barriers to data access

Supported Languages

European Languages (18)

  • Western: English, German, French, Spanish, Italian, Portuguese, Dutch
  • Nordic: Swedish, Danish, Norwegian, Finnish, Icelandic
  • Eastern: Polish, Russian, Czech, Bulgarian, Romanian, Greek, Ukrainian

Asian Languages (20)

  • East Asian: Chinese (Simplified), Chinese (Traditional), Japanese, Korean
  • Southeast Asian: Thai, Vietnamese, Indonesian, Malay, Tagalog (Filipino)
  • South Asian: Hindi, Bengali, Tamil, Telugu, Marathi, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Sinhala, Nepali

Middle Eastern Languages (4)

  • Arabic
  • Hebrew
  • Persian (Farsi)
  • Turkish

Other Languages (15+)

Afrikaans, Albanian, Azerbaijani, Basque, Belarusian, Bosnian, Catalan, Croatian, Estonian, Galician, Georgian, Hungarian, Irish, Kazakh, Kurdish, Kyrgyz, Latvian, Lithuanian, Macedonian, Maltese, Mongolian, Serbian, Slovak, Slovenian, Somali, Swahili, Tajik, Tatar, Uzbek, Welsh, Yiddish

Total: 50+ languages

How It Works

Multilingual Embedding Model

Actyze uses paraphrase-multilingual-mpnet-base-v2, a state-of-the-art sentence transformer trained on parallel data from 50+ languages.

Technical Details:

  • Architecture: MPNet (Microsoft's Masked and Permuted Pre-training)
  • Embedding Dimension: 768
  • Training Data: Billions of sentence pairs across 50+ languages
  • Semantic Understanding: Captures meaning, not just keywords
  • Cross-lingual: Queries in one language match schemas in another

Process:

  1. User query in any language → Embedded into 768-dimensional vector
  2. Schema metadata → Embedded into same vector space
  3. FAISS vector search → Finds semantically similar schemas
  4. Results ranked by semantic similarity (cosine distance)

Language-Agnostic Schema Matching

Your database schema (table names, column names, metadata) is typically in English, but queries can be in any supported language:

French Query: "Montrez-moi les 10 meilleurs clients par revenu"
↓ (Embedded)
↓ (Semantic Match)
English Schema: sales.customers (revenue, customer_name, total_purchases)

Generated SQL:
SELECT customer_name, SUM(revenue) as total_revenue
FROM sales.customers
GROUP BY customer_name
ORDER BY total_revenue DESC
LIMIT 10

Named Entity Recognition (NER)

For enhanced entity detection (product names, locations, dates), Actyze uses spaCy's English NER model as a lightweight supplement:

  • Model: en_core_web_md (English only)
  • Role: Extracts entities (PERSON, ORG, GPE, DATE, MONEY, etc.)
  • Impact: Improves accuracy for entity-heavy queries
  • Note: Not required for multilingual queries—the primary semantic search handles all languages

Important: While NER is English-only, it's a lightweight enhancement, not a requirement. The multilingual embedding model performs the heavy lifting and works independently across all 50+ languages.

Query Examples

Spanish

Query: "¿Cuáles son las ventas totales por región en 2025?"
Translation: "What are the total sales by region in 2025?"

Generated SQL:
SELECT region, SUM(sales) as total_sales
FROM sales_data
WHERE year = 2025
GROUP BY region

Chinese (Simplified)

Query: "显示过去三个月的客户增长趋势"
Translation: "Show customer growth trend for the past three months"

Generated SQL:
SELECT DATE_TRUNC('month', signup_date) as month, COUNT(*) as new_customers
FROM customers
WHERE signup_date >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY month
ORDER BY month

German

Query: "Zeige mir die umsatzstärksten Produkte im letzten Quartal"
Translation: "Show me the highest revenue products in the last quarter"

Generated SQL:
SELECT product_name, SUM(revenue) as total_revenue
FROM products
WHERE quarter = 'Q4'
GROUP BY product_name
ORDER BY total_revenue DESC

Japanese

Query: "先月の部門別売上を表示してください"
Translation: "Please display last month's sales by department"

Generated SQL:
SELECT department, SUM(sales) as total_sales
FROM sales_data
WHERE month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
GROUP BY department

Arabic

Query: "أظهر لي أفضل 5 عملاء حسب إجمالي الطلبات"
Translation: "Show me top 5 customers by total orders"

Generated SQL:
SELECT customer_name, COUNT(*) as total_orders
FROM orders
GROUP BY customer_name
ORDER BY total_orders DESC
LIMIT 5

Hindi

Query: "पिछले साल के राजस्व की तुलना इस साल से करें"
Translation: "Compare last year's revenue with this year"

Generated SQL:
SELECT
YEAR(order_date) as year,
SUM(revenue) as total_revenue
FROM orders
WHERE YEAR(order_date) IN (2024, 2025)
GROUP BY year
ORDER BY year

Best Practices

Metadata in Multiple Languages

For optimal accuracy, add metadata descriptions in the languages your team uses:

English Metadata:

{
"table_name": "customers",
"description": "Customer records including contact info and purchase history"
}

Multilingual Metadata:

{
"table_name": "customers",
"description": "Customer records | Registros de clientes | Dossiers clients | Kundendaten | 客户记录"
}

The multilingual embedding model will understand all language variants and match appropriately.

Natural Phrasing

Users should phrase queries naturally in their language:

Good (Natural):

  • Spanish: "¿Cuántos clientes tenemos en México?"
  • French: "Quels sont nos meilleurs produits ce trimestre?"
  • German: "Wie hoch ist der Gesamtumsatz diese Woche?"

Avoid (Awkward/Literal Translation):

  • Spanish: "Mostrar cuenta de clientes donde país es México"
  • French: "Afficher produits où revenu est maximum"
  • German: "Zeigen Summe Verkäufe Woche aktuelle"

Mixed Language Queries

For technical terms or proper nouns, mixing languages is acceptable:

Spanish + English: "Muestra los pedidos de Amazon y eBay"
French + English: "Affiche les ventes de Black Friday"
German + English: "Zeige die API requests der letzten Stunde"

The model understands context and correctly interprets mixed-language queries.

Limitations

SQL Generation Language

While queries can be in any of 50+ languages, the SQL generation and LLM responses depend on your configured LLM provider:

  • Most LLMs support multilingual SQL generation
  • Generated SQL is standard SQL (language-agnostic)
  • LLM responses (explanations) may vary by model capability

Recommended: Claude Sonnet 4.5, GPT-4o, or equivalent models provide excellent multilingual SQL generation quality and accuracy.

Entity Detection (NER)

Named Entity Recognition is English-only, but this has minimal impact:

  • Primary semantic search works across all 50+ languages
  • NER is a lightweight enhancement for entity-heavy queries
  • Multilingual queries work accurately without NER

For advanced multilingual NER, consider:

  • Language-specific spaCy models (e.g., de_core_news_md for German)
  • Multilingual transformer NER models (e.g., xlm-roberta-large-finetuned)

Column Name Language

If your database uses non-English column names, add English metadata descriptions for better cross-language understanding:

Table: ventes (French)
Columns: nom_client, revenu_total, date_achat

Metadata:
- nom_client → "Customer name | Nombre del cliente | 客户名"
- revenu_total → "Total revenue | Ingresos totales | 总收入"
- date_achat → "Purchase date | Fecha de compra | 购买日期"

Technical Details

Embedding Model Specifications

Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Architecture:

  • Base: microsoft/mpnet-base
  • Fine-tuning: Parallel sentence pairs from 50+ languages
  • Output: 768-dimensional dense vectors

Performance:

  • Encoding Speed: ~50-100ms per query (single GPU)
  • Memory: ~420MB model size
  • Accuracy: 85-90% semantic similarity across languages

Training Data:

  • Billions of sentence pairs
  • Wikipedia, parallel corpora, web data
  • Cross-lingual alignment

Vector Search (FAISS)

Index Type: IndexFlatIP (Inner Product for cosine similarity)

Process:

  1. Normalize embeddings to unit length (L2 normalization)
  2. Compute cosine similarity via inner product
  3. Return top-k most similar schemas

Performance:

  • Search Speed: <50ms for 10,000 schemas
  • Scalability: Linear with schema count
  • Accuracy: Exact nearest neighbor search

Language Detection

Actyze does not require explicit language detection:

  • Multilingual model handles all languages in same vector space
  • No need to specify query language
  • Automatic cross-lingual matching

Users can switch languages mid-conversation without configuration.

API Usage

Query in Any Language

Example: Spanish Query

curl -X POST https://your-actyze.com/api/query \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "Muestra los 10 productos más vendidos este mes",
"database": "postgres"
}'

Response:

{
"sql": "SELECT product_name, SUM(quantity) as units_sold FROM sales WHERE MONTH(sale_date) = MONTH(CURRENT_DATE) GROUP BY product_name ORDER BY units_sold DESC LIMIT 10",
"schemas_used": ["sales"],
"confidence": 0.92
}

Metadata in Multiple Languages

Add Multilingual Metadata:

curl -X POST https://your-actyze.com/api/metadata \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"table_name": "customers",
"description": "Customer records | Registros de clientes | Dossiers clients | 客户记录",
"tags": ["customer", "cliente", "client", "客户"]
}'

Configuration

Model Caching

The multilingual model is downloaded during container build and cached:

# From Dockerfile
RUN python3 download_models.py

Cache location: /app/model_cache/sentence_transformers/

Model size: ~420MB

Custom Models

To use a different multilingual model, update the embedder configuration:

# schema-service/app/embedder.py
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" # Default

# Alternative multilingual models:
# "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" # Faster, smaller
# "sentence-transformers/LaBSE" # More languages (109)

Troubleshooting

Query Not Matching Schemas

Issue: Query in non-English language not finding relevant tables.

Causes:

  1. Schema metadata is too technical (no business context)
  2. Query is too vague or ambiguous

Solutions:

  1. Add multilingual metadata descriptions
  2. Rephrase query more specifically
  3. Use table/column names explicitly

Poor Results in Specific Language

Issue: Queries in Language X work poorly compared to English.

Check:

  1. Is the language in the supported 50+ list?
  2. Is metadata available in that language?
  3. Are queries phrased naturally?

Improve:

  • Add metadata in user's language
  • Provide example queries in that language
  • Consider language-specific spaCy NER model

Mixed Results

Issue: Schema search returns irrelevant tables for multilingual queries.

Debug:

  1. Check schema metadata quality
  2. Verify embedding model is loaded correctly
  3. Test with simpler queries

API Debug:

# Check model status
curl -X GET https://your-actyze.com/api/schema-service/health \
-H "Authorization: Bearer YOUR_TOKEN"

Additional Resources

Support

For multilingual query issues:

  1. Verify query language is in supported list
  2. Check schema metadata includes business context
  3. Test with example queries in that language
  4. Review embedding model health status
  5. Contact support with specific query examples