AI/ML/NLP Research Track · Lesson 13 of 16
Research Project: Norwegian + Urdu AI Assistant
Research Project: Norwegian + Urdu Multilingual AI Assistant
This is a flagship research-style project that combines two underrepresented languages — Norwegian and Urdu — in a practical AI assistant. The goal is not just to build something that works, but to understand the specific challenges each language presents, measure performance systematically, and document findings the way a researcher would.
Why these languages? Norwegian has strong public NLP resources but is underrepresented in production AI tools relative to English. Urdu is one of the most spoken languages globally but severely underrepresented in NLP benchmarks — Roman Urdu (Urdu written in Latin script, common in informal digital communication) is almost entirely absent from most models' training data.
What you will build:
- Multilingual sentiment analysis pipeline
- Norwegian-English and English-Urdu translation
- Domain-specific text classification (immigrant support queries)
- Multilingual chatbot with language detection
- Research-quality evaluation report
Setup
pip install transformers datasets torch langdetect sacrebleu evaluate pandas tqdmfrom transformers import (
pipeline, AutoTokenizer, AutoModelForSequenceClassification,
AutoModelForSeq2SeqLM, MarianMTModel, MarianTokenizer
)
from datasets import load_dataset
import langdetect
from langdetect import detect
import pandas as pd
import torch
print(f"PyTorch: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")Phase 1: Baseline Features
Language Detection
def detect_language(text: str) -> str:
try:
return langdetect.detect(text)
except Exception:
return "unknown"
test_texts = [
"Hva er reglene for foreldrepermisjon i Norge?", # Norwegian
"مجھے اپنا پاسپورٹ تجدید کرنا ہے", # Urdu (Nastaliq script)
"Mujhe apna passport renew karna hai", # Roman Urdu
"I need to renew my passport", # English
]
for text in test_texts:
lang = detect_language(text)
print(f"[{lang}] {text[:50]}")[no] Hva er reglene for foreldrepermisjon i Norge?
[ur] مجھے اپنا پاسپورٹ تجدید کرنا ہے
[en] Mujhe apna passport renew karna hai ← Roman Urdu detected as English (known limitation)
[en] I need to renew my passportNote: Roman Urdu detection is a known challenge — most language detectors classify it as English. This becomes a research finding.
Multilingual Sentiment Analysis
# XLM-RoBERTa trained on multiple languages including Norwegian
sentiment_pipeline = pipeline(
"sentiment-analysis",
model="cardiffnlp/twitter-xlm-roberta-base-sentiment",
tokenizer="cardiffnlp/twitter-xlm-roberta-base-sentiment",
)
test_sentiments = [
"Dette er et fantastisk system!", # Norwegian: "This is a fantastic system!"
"Jeg er veldig frustrert over ventetiden", # Norwegian: "I am very frustrated with the wait time"
"یہ سروس بہت اچھی ہے", # Urdu: "This service is very good"
"مجھے اس نظام سے مسائل ہیں", # Urdu: "I have problems with this system"
]
for text in test_sentiments:
result = sentiment_pipeline(text, truncation=True, max_length=512)
print(f"[{result[0]['label']} {result[0]['score']:.2f}] {text}")Translation Pipeline
class TranslationPipeline:
def __init__(self):
# Norwegian → English
self.no_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-tc-big-no-en")
self.no_en_tok = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-tc-big-no-en")
# English → Urdu
self.en_ur_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-ur")
self.en_ur_tok = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ur")
# English → Norwegian
self.en_no_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
# Note: Norwegian requires a multi-target model — a research limitation to document
def translate(self, text: str, src: str, tgt: str) -> str:
if src == "no" and tgt == "en":
model, tok = self.no_en_model, self.no_en_tok
elif src == "en" and tgt == "ur":
model, tok = self.en_ur_model, self.en_ur_tok
else:
return f"[Translation not supported: {src}→{tgt}]"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
output = model.generate(**inputs, max_new_tokens=200)
return tok.decode(output[0], skip_special_tokens=True)
translator = TranslationPipeline()
test_query = "Hva er reglene for å søke om familieinnvandring?"
# "What are the rules for applying for family immigration?"
en_translation = translator.translate(test_query, "no", "en")
ur_translation = translator.translate(en_translation, "en", "ur")
print(f"Norwegian: {test_query}")
print(f"English: {en_translation}")
print(f"Urdu: {ur_translation}")Phase 2: Domain Adaptation — Immigrant Support Queries
Build a classifier that routes queries to the right support category.
Dataset Construction
# Create a small labelled dataset of immigrant support queries
# In a real project this would come from actual support ticket data
training_data = {
"text": [
# Norwegian queries
"Hva er reglene for foreldrepermisjon?",
"Jeg trenger hjelp med skattemeldingen",
"Hvordan søker jeg om familiegjenforening?",
"Hva koster det å fornye oppholdstillatelse?",
"Jeg har mistet jobben og trenger hjelp",
# English queries
"How do I apply for parental leave?",
"I need help with my tax return",
"What documents do I need for family reunification?",
"How much does it cost to renew residence permit?",
"I lost my job and need support",
# Urdu (Nastaliq)
"میں اپنی اقامت کی اجازت کیسے تجدید کروں؟",
"مجھے ٹیکس ریٹرن میں مدد چاہیے",
"خاندانی اتحاد کے لیے کیا کاغذات درکار ہیں؟",
# Roman Urdu
"Mujhe residence permit renew karni hai",
"Tax return mein madad chahiye",
"Family reunification ke liye kya documents chahiye?",
],
"category": [
"parental_leave", "tax", "family_reunion", "residence_permit", "employment",
"parental_leave", "tax", "family_reunion", "residence_permit", "employment",
"residence_permit", "tax", "family_reunion",
"residence_permit", "tax", "family_reunion",
]
}
df = pd.DataFrame(training_data)
print(df["category"].value_counts())# Use zero-shot classification for low-resource categorisation
# (we don't have enough labelled data for fine-tuning)
zero_shot = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli",
)
candidate_labels = [
"parental leave", "tax and finance", "family reunification",
"residence permit", "employment and unemployment", "healthcare", "education"
]
def classify_query(text: str) -> dict:
result = zero_shot(text, candidate_labels)
return {
"predicted_category": result["labels"][0],
"confidence": result["scores"][0],
"all_scores": dict(zip(result["labels"], result["scores"]))
}
# Test
queries = [
"Jeg trenger hjelp med foreldrepengeordningen",
"مجھے میری بچت پر ٹیکس کیسے ادا کرنا ہے؟",
"Mujhe maternity leave ke baare mein jaan'na hai",
]
for q in queries:
result = classify_query(q)
print(f"Query: {q[:60]}")
print(f" → {result['predicted_category']} (confidence: {result['confidence']:.2f})")Phase 3: Multilingual Chatbot
from transformers import pipeline as hf_pipeline
class MultilingualSupportBot:
def __init__(self):
self.translator = TranslationPipeline()
self.sentiment = pipeline(
"sentiment-analysis",
model="cardiffnlp/twitter-xlm-roberta-base-sentiment"
)
self.classifier = zero_shot
# Knowledge base in English (source of truth)
self.knowledge_base = {
"parental_leave": (
"In Norway, parental leave totals 49 weeks at 100% pay or 59 weeks at 80% pay. "
"Both parents are entitled to leave. The father's quota is 15 weeks."
),
"residence_permit": (
"To renew your residence permit, apply via UDI.no at least 1 month before expiry. "
"Required documents: valid passport, documentation of accommodation, proof of income."
),
"tax": (
"Tax returns in Norway are pre-filled and sent in March. You have until April 30 to submit. "
"If you have income from abroad, you must add it manually."
),
"family_reunion": (
"Family reunification requires the sponsor to have a valid residence permit. "
"Processing time is typically 3-6 months. Fees vary by relationship type."
),
"employment": (
"Unemployed residents can apply for dagpenger (unemployment benefit) through NAV. "
"You must have worked for at least 12 months in the last 2 years."
),
}
def respond(self, user_message: str) -> str:
# 1. Detect language
lang = detect_language(user_message)
# 2. Analyse sentiment
sentiment = self.sentiment(user_message, truncation=True)[0]
is_frustrated = sentiment["label"] == "Negative" and sentiment["score"] > 0.8
# 3. Translate to English for classification if needed
english_message = user_message
if lang == "no":
english_message = self.translator.translate(user_message, "no", "en")
# 4. Classify query
classification = classify_query(english_message)
category = classification["predicted_category"].replace(" and ", "_").replace(" ", "_")
# Map to knowledge base key
key_map = {
"parental_leave": "parental_leave",
"tax_and_finance": "tax",
"family_reunification": "family_reunion",
"residence_permit": "residence_permit",
"employment_and_unemployment": "employment",
}
kb_key = key_map.get(category, None)
# 5. Retrieve answer from knowledge base
if kb_key and kb_key in self.knowledge_base:
answer_en = self.knowledge_base[kb_key]
else:
answer_en = "I don't have specific information about that. Please contact NAV or UDI directly."
# 6. Add empathy if user is frustrated
if is_frustrated:
answer_en = "I understand this process can be stressful. " + answer_en
# 7. Translate answer back to user's language
if lang == "no":
# For this prototype, translate back to Norwegian
# Production would use a better no-targeted model
return f"[NO→EN→NO translation prototype]\n{answer_en}"
elif lang == "ur":
answer_ur = self.translator.translate(answer_en, "en", "ur")
return answer_ur
else:
return answer_en
bot = MultilingualSupportBot()
# Test conversations
test_queries = [
"Hva er reglene for foreldrepermisjon?",
"I need to renew my residence permit urgently",
"مجھے اپنے ٹیکس کے بارے میں مدد چاہیے",
]
for query in test_queries:
print(f"\nUser: {query}")
response = bot.respond(query)
print(f"Bot: {response[:200]}")Phase 4: Research-Style Evaluation
Translation Quality — BLEU Score
from evaluate import load
bleu = load("bleu")
sacrebleu = load("sacrebleu")
# Reference translations (ground truth from a professional translator)
references_no_en = [
["I need help with parental leave application"],
["What documents do I need for family reunification?"],
["How much does it cost to renew a residence permit?"],
]
hypotheses_no_en = [
translator.translate("Jeg trenger hjelp med søknad om foreldrepermisjon", "no", "en"),
translator.translate("Hvilke dokumenter trenger jeg for familiegjenforening?", "no", "en"),
translator.translate("Hva koster det å fornye oppholdstillatelse?", "no", "en"),
]
result = sacrebleu.compute(
predictions=hypotheses_no_en,
references=references_no_en
)
print(f"Norwegian→English BLEU: {result['score']:.2f}")
# Urdu translation quality (harder to evaluate without native speaker references)
# Document this as a limitation in the reportPer-Language Accuracy Report
# Test classification accuracy per language
test_cases = pd.DataFrame({
"text": [
"Hva er reglene for foreldrepermisjon?",
"مجھے اقامت کی اجازت تجدید کرنی ہے",
"Mujhe residence permit renew karni hai",
"How do I apply for unemployment benefits?",
"Jeg har mistet jobben",
],
"language": ["no", "ur", "roman_ur", "en", "no"],
"true_category": [
"parental_leave", "residence_permit", "residence_permit",
"employment", "employment"
]
})
results = []
for _, row in test_cases.iterrows():
pred = classify_query(row["text"])
predicted = pred["predicted_category"].lower()
correct = row["true_category"].replace("_", " ") in predicted
results.append({
"language": row["language"],
"correct": correct,
"confidence": pred["confidence"],
})
results_df = pd.DataFrame(results)
print("\nAccuracy by language:")
print(results_df.groupby("language")["correct"].mean().round(2))
print("\nConfidence by language:")
print(results_df.groupby("language")["confidence"].mean().round(3))Hallucination and Error Categories
# Systematically document error types (for the research report)
error_categories = {
"Roman Urdu misdetected as English": 0,
"Low-confidence classification (< 0.5)": 0,
"Translation quality insufficient for classification": 0,
"Out-of-vocabulary cultural concepts": 0,
"Sentiment false negative on indirect language": 0,
}
# Fill in from your test results
error_categories["Roman Urdu misdetected as English"] = 8 # out of 10 test cases
error_categories["Low-confidence classification (< 0.5)"] = 3
print("\nError category frequencies:")
for category, count in error_categories.items():
print(f" {category}: {count}")Deliverables
1. GitHub repo containing:
[ ] scripts/translate.py — translation pipeline
[ ] scripts/classify.py — zero-shot classification
[ ] scripts/chatbot.py — multilingual bot
[ ] notebooks/evaluation.ipynb — all evaluation results
[ ] data/test_cases.csv — labelled test set with ground truth
2. Evaluation report (markdown or PDF) containing:
[ ] BLEU scores for Norwegian→English translation (with confidence intervals)
[ ] Classification accuracy broken down by language (Norwegian, Urdu, Roman Urdu, English)
[ ] Error analysis table (error type, frequency, example, proposed fix)
[ ] Qualitative examples: 3 successful and 3 failed responses with explanation
[ ] Fairness analysis: does the system perform equally across languages?
[ ] Recommendations for production deployment
3. Demo:
[ ] Short video (3-5 minutes) walking through the chatbot in all three languages
[ ] Or a README with screenshots showing multilingual conversationsKey Research Findings to Document
After running your evaluation, your report should address these questions honestly:
Roman Urdu: How does the system handle queries in Urdu written in Latin script? What is the language detection accuracy? How does this affect downstream classification?
Domain coverage: Which support categories does the system handle well and which does it struggle with? Are there Norwegian-specific concepts (NAV, UDI, dagpenger) that translate poorly?
Fairness: Is the system's accuracy consistent across languages? If Norwegian queries get 90% accuracy and Urdu queries get 60%, what are the implications for deployment to an immigrant support service?
Limitations vs. production readiness: What would need to change before this system could be deployed in a real immigrant support context?
Documenting failures honestly is what separates research from marketing. A system that acknowledges it cannot handle Roman Urdu reliably is more trustworthy than one that claims multilingual support and silently fails.