Multilingual Support

Actyze supports natural language queries in 50+ languages, enabling global teams to query data in their native language without requiring English proficiency.

Overview

Our multilingual semantic search is powered by advanced transformer-based embeddings that understand queries across languages, matching them accurately to your database schemas regardless of the language used.

Key Benefits:

Query in your native language
No translation required
Consistent accuracy across languages
Global team collaboration
Reduced language barriers to data access

Supported Languages

European Languages (18)

Western: English, German, French, Spanish, Italian, Portuguese, Dutch
Nordic: Swedish, Danish, Norwegian, Finnish, Icelandic
Eastern: Polish, Russian, Czech, Bulgarian, Romanian, Greek, Ukrainian

Asian Languages (20)

East Asian: Chinese (Simplified), Chinese (Traditional), Japanese, Korean
Southeast Asian: Thai, Vietnamese, Indonesian, Malay, Tagalog (Filipino)
South Asian: Hindi, Bengali, Tamil, Telugu, Marathi, Urdu, Gujarati, Kannada, Malayalam, Punjabi, Sinhala, Nepali

Middle Eastern Languages (4)

Arabic
Hebrew
Persian (Farsi)
Turkish

Other Languages (15+)

Afrikaans, Albanian, Azerbaijani, Basque, Belarusian, Bosnian, Catalan, Croatian, Estonian, Galician, Georgian, Hungarian, Irish, Kazakh, Kurdish, Kyrgyz, Latvian, Lithuanian, Macedonian, Maltese, Mongolian, Serbian, Slovak, Slovenian, Somali, Swahili, Tajik, Tatar, Uzbek, Welsh, Yiddish

Total: 50+ languages

How It Works

Multilingual Embedding Model

Actyze uses paraphrase-multilingual-mpnet-base-v2, a state-of-the-art sentence transformer trained on parallel data from 50+ languages.

Technical Details:

Architecture: MPNet (Microsoft's Masked and Permuted Pre-training)
Embedding Dimension: 768
Training Data: Billions of sentence pairs across 50+ languages
Semantic Understanding: Captures meaning, not just keywords
Cross-lingual: Queries in one language match schemas in another

Process:

User query in any language → Embedded into 768-dimensional vector
Schema metadata → Embedded into same vector space
FAISS vector search → Finds semantically similar schemas
Results ranked by semantic similarity (cosine distance)

Language-Agnostic Schema Matching

Your database schema (table names, column names, metadata) is typically in English, but queries can be in any supported language:

French Query: "Montrez-moi les 10 meilleurs clients par revenu"
↓ (Embedded)
↓ (Semantic Match)
English Schema: sales.customers (revenue, customer_name, total_purchases)
↓
Generated SQL:
SELECT customer_name, SUM(revenue) as total_revenue
FROM sales.customers
GROUP BY customer_name
ORDER BY total_revenue DESC
LIMIT 10

Named Entity Recognition (NER)

For enhanced entity detection (product names, locations, dates), Actyze uses spaCy's English NER model as a lightweight supplement:

Model: en_core_web_md (English only)
Role: Extracts entities (PERSON, ORG, GPE, DATE, MONEY, etc.)
Impact: Improves accuracy for entity-heavy queries
Note: Not required for multilingual queries—the primary semantic search handles all languages

Important: While NER is English-only, it's a lightweight enhancement, not a requirement. The multilingual embedding model performs the heavy lifting and works independently across all 50+ languages.

Query Examples

Spanish

Query: "¿Cuáles son las ventas totales por región en 2025?"
Translation: "What are the total sales by region in 2025?"

Generated SQL:
SELECT region, SUM(sales) as total_sales
FROM sales_data
WHERE year = 2025
GROUP BY region

Chinese (Simplified)

Query: "显示过去三个月的客户增长趋势"
Translation: "Show customer growth trend for the past three months"

Generated SQL:
SELECT DATE_TRUNC('month', signup_date) as month, COUNT(*) as new_customers
FROM customers
WHERE signup_date >= CURRENT_DATE - INTERVAL '3 months'
GROUP BY month
ORDER BY month

German

Query: "Zeige mir die umsatzstärksten Produkte im letzten Quartal"
Translation: "Show me the highest revenue products in the last quarter"

Generated SQL:
SELECT product_name, SUM(revenue) as total_revenue
FROM products
WHERE quarter = 'Q4'
GROUP BY product_name
ORDER BY total_revenue DESC

Japanese

Query: "先月の部門別売上を表示してください"
Translation: "Please display last month's sales by department"

Generated SQL:
SELECT department, SUM(sales) as total_sales
FROM sales_data
WHERE month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
GROUP BY department

Arabic

Query: "أظهر لي أفضل 5 عملاء حسب إجمالي الطلبات"
Translation: "Show me top 5 customers by total orders"

Generated SQL:
SELECT customer_name, COUNT(*) as total_orders
FROM orders
GROUP BY customer_name
ORDER BY total_orders DESC
LIMIT 5

Hindi

Query: "पिछले साल के राजस्व की तुलना इस साल से करें"
Translation: "Compare last year's revenue with this year"

Generated SQL:
SELECT 
  YEAR(order_date) as year,
  SUM(revenue) as total_revenue
FROM orders
WHERE YEAR(order_date) IN (2024, 2025)
GROUP BY year
ORDER BY year

Best Practices

Metadata in Multiple Languages

For optimal accuracy, add metadata descriptions in the languages your team uses:

English Metadata:

{
  "table_name": "customers",
  "description": "Customer records including contact info and purchase history"
}

Multilingual Metadata:

{
  "table_name": "customers",
  "description": "Customer records | Registros de clientes | Dossiers clients | Kundendaten | 客户记录"
}

The multilingual embedding model will understand all language variants and match appropriately.

Natural Phrasing

Users should phrase queries naturally in their language:

Good (Natural):

Spanish: "¿Cuántos clientes tenemos en México?"
French: "Quels sont nos meilleurs produits ce trimestre?"
German: "Wie hoch ist der Gesamtumsatz diese Woche?"

Avoid (Awkward/Literal Translation):

Spanish: "Mostrar cuenta de clientes donde país es México"
French: "Afficher produits où revenu est maximum"
German: "Zeigen Summe Verkäufe Woche aktuelle"

Mixed Language Queries

For technical terms or proper nouns, mixing languages is acceptable:

Spanish + English: "Muestra los pedidos de Amazon y eBay"
French + English: "Affiche les ventes de Black Friday"
German + English: "Zeige die API requests der letzten Stunde"

The model understands context and correctly interprets mixed-language queries.

Limitations

SQL Generation Language

While queries can be in any of 50+ languages, the SQL generation and LLM responses depend on your configured LLM provider:

Most LLMs support multilingual SQL generation
Generated SQL is standard SQL (language-agnostic)
LLM responses (explanations) may vary by model capability

Recommended: Claude Sonnet 4.5, GPT-4o, or equivalent models provide excellent multilingual SQL generation quality and accuracy.

Entity Detection (NER)

Named Entity Recognition is English-only, but this has minimal impact:

Primary semantic search works across all 50+ languages
NER is a lightweight enhancement for entity-heavy queries
Multilingual queries work accurately without NER

For advanced multilingual NER, consider:

Language-specific spaCy models (e.g., de_core_news_md for German)
Multilingual transformer NER models (e.g., xlm-roberta-large-finetuned)

Column Name Language

If your database uses non-English column names, add English metadata descriptions for better cross-language understanding:

Table: ventes (French)
Columns: nom_client, revenu_total, date_achat

Metadata:
- nom_client → "Customer name | Nombre del cliente | 客户名"
- revenu_total → "Total revenue | Ingresos totales | 总收入"
- date_achat → "Purchase date | Fecha de compra | 购买日期"

Technical Details

Embedding Model Specifications

Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Architecture:

Base: microsoft/mpnet-base
Fine-tuning: Parallel sentence pairs from 50+ languages
Output: 768-dimensional dense vectors

Performance:

Encoding Speed: ~50-100ms per query (single GPU)
Memory: ~420MB model size
Accuracy: 85-90% semantic similarity across languages

Training Data:

Billions of sentence pairs
Wikipedia, parallel corpora, web data
Cross-lingual alignment

Vector Search (FAISS)

Index Type: IndexFlatIP (Inner Product for cosine similarity)

Process:

Normalize embeddings to unit length (L2 normalization)
Compute cosine similarity via inner product
Return top-k most similar schemas

Performance:

Search Speed: <50ms for 10,000 schemas
Scalability: Linear with schema count
Accuracy: Exact nearest neighbor search

Language Detection

Actyze does not require explicit language detection:

Multilingual model handles all languages in same vector space
No need to specify query language
Automatic cross-lingual matching

Users can switch languages mid-conversation without configuration.

API Usage

Query in Any Language

Example: Spanish Query

curl -X POST https://your-actyze.com/api/query \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Muestra los 10 productos más vendidos este mes",
    "database": "postgres"
  }'

Response:

{
  "sql": "SELECT product_name, SUM(quantity) as units_sold FROM sales WHERE MONTH(sale_date) = MONTH(CURRENT_DATE) GROUP BY product_name ORDER BY units_sold DESC LIMIT 10",
  "schemas_used": ["sales"],
  "confidence": 0.92
}

Metadata in Multiple Languages

Add Multilingual Metadata:

curl -X POST https://your-actyze.com/api/metadata \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "table_name": "customers",
    "description": "Customer records | Registros de clientes | Dossiers clients | 客户记录",
    "tags": ["customer", "cliente", "client", "客户"]
  }'

Configuration

Model Caching

The multilingual model is downloaded during container build and cached:

# From Dockerfile
RUN python3 download_models.py

Cache location: /app/model_cache/sentence_transformers/

Model size: ~420MB

Custom Models

To use a different multilingual model, update the embedder configuration:

# schema-service/app/embedder.py
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"  # Default

# Alternative multilingual models:
# "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"  # Faster, smaller
# "sentence-transformers/LaBSE"  # More languages (109)

Troubleshooting

Query Not Matching Schemas

Issue: Query in non-English language not finding relevant tables.

Causes:

Schema metadata is too technical (no business context)
Query is too vague or ambiguous

Solutions:

Add multilingual metadata descriptions
Rephrase query more specifically
Use table/column names explicitly

Poor Results in Specific Language

Issue: Queries in Language X work poorly compared to English.

Check:

Is the language in the supported 50+ list?
Is metadata available in that language?
Are queries phrased naturally?

Improve:

Add metadata in user's language
Provide example queries in that language
Consider language-specific spaCy NER model

Mixed Results

Issue: Schema search returns irrelevant tables for multilingual queries.

Debug:

Check schema metadata quality
Verify embedding model is loaded correctly
Test with simpler queries

API Debug:

# Check model status
curl -X GET https://your-actyze.com/api/schema-service/health \
  -H "Authorization: Bearer YOUR_TOKEN"

Additional Resources

Schema Boosting - Improve recommendations
Metadata - Add context for better matching
Quick Start - Try multilingual queries
Hugging Face Model: paraphrase-multilingual-mpnet-base-v2

Support

For multilingual query issues:

Verify query language is in supported list
Check schema metadata includes business context
Test with example queries in that language
Review embedding model health status
Contact support with specific query examples

Overview​

Supported Languages​

European Languages (18)​

Asian Languages (20)​

Middle Eastern Languages (4)​

Other Languages (15+)​

How It Works​

Multilingual Embedding Model​

Language-Agnostic Schema Matching​

Named Entity Recognition (NER)​

Query Examples​

Spanish​

Chinese (Simplified)​

German​

Japanese​

Arabic​

Hindi​

Best Practices​

Metadata in Multiple Languages​

Natural Phrasing​

Mixed Language Queries​

Limitations​

SQL Generation Language​

Entity Detection (NER)​

Column Name Language​

Technical Details​

Embedding Model Specifications​

Vector Search (FAISS)​

Language Detection​

API Usage​

Query in Any Language​

Metadata in Multiple Languages​

Configuration​

Model Caching​

Custom Models​

Troubleshooting​

Query Not Matching Schemas​

Poor Results in Specific Language​

Mixed Results​

Additional Resources​

Support​

Overview

Supported Languages

European Languages (18)

Asian Languages (20)

Middle Eastern Languages (4)

Other Languages (15+)

How It Works

Multilingual Embedding Model

Language-Agnostic Schema Matching

Named Entity Recognition (NER)

Query Examples

Spanish

Chinese (Simplified)

German

Japanese

Arabic

Hindi

Best Practices

Metadata in Multiple Languages

Natural Phrasing

Mixed Language Queries

Limitations

SQL Generation Language

Entity Detection (NER)

Column Name Language

Technical Details

Embedding Model Specifications

Vector Search (FAISS)

Language Detection

API Usage

Query in Any Language

Metadata in Multiple Languages

Configuration

Model Caching

Custom Models

Troubleshooting

Query Not Matching Schemas

Poor Results in Specific Language

Mixed Results

Additional Resources

Support