Skip to content

A list of Nepali Dataset sources. (Hoping that it will encourage everyone to research more on Nepali language)

Notifications You must be signed in to change notification settings

pemagrg1/Nepali-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

Nepali-Datasets: Comprehensive NLP Resource Collection

A thoroughly verified and curated collection of Nepali datasets for NLP research, development, and benchmarking. This resource aggregates 100+ datasets across 20+ categories to encourage and support research on low-resource Nepali language.

NOTE: Hope that this will encourage everyone to research more on Nepali language. And you are welcome to add the sources if its not listed here 📌


Benchmarks & Standards

Comprehensive evaluation frameworks and shared tasks for Nepali NLP.

  • NLUE (Nepali Language Understanding Evaluation) ✓ - 9 classification + 3 structural prediction tasks (sentiment, hate speech, toxicity, QA, NER). arXiv: 2411.19244
  • Nep-gLUE Benchmark - Official Nepali GLUE-style benchmark (7 NLU tasks). Limited direct access; see NLUE for comprehensive alternatives.
  • FLORES-101 Evaluation Benchmark - Machine translation evaluation across 101 languages including Nepali. GitHub: facebookresearch/flores
  • IndicBench - Benchmark for 11 Indic languages including Nepali (13 tasks). New 2025 addition.
  • SemEval 2026 Task 9 - Polarization type classification with Nepali data. Codabench New 2026.

Nepali Text Corpus

Large-scale text collections for language modeling, pre-training, and linguistic analysis.

Ultra-Large Corpora (>1GB)

  • Nepali-Text-Corpus (IRIISNEPAL) ✓ - 6.4M articles, 10.1 GB - Largest Nepali corpus from 99 news websites. State-of-the-art pre-training resource. HF: IRIISNEPAL/nepali-text-corpus | arXiv: 2411.15734

  • OSCAR Corpus Nepali ✓ - 3.8 GB, 100M+ sentences from Common Crawl. Kaggle: hsebarp/oscar-corpus-nepali

  • CC100-Nepali ✓ - Common Crawl 2019 subset, 200GB uncompressed. Foundation data for multilingual models. MetaText: cc100-nepali

  • Lamsal (2020) Corpus - 12M+ words professionally compiled. Note: Original DOI 404; consider IRIISNEPAL as primary substitute.

Large Curated Collections (100MB-1GB)

Specialized Text Collections


Classification Datasets

News classification, topic modeling, and text categorization.


Named Entity Recognition (NER) Datasets

Annotated datasets for entity recognition (person, organization, location, etc.).


Sentiment Analysis & Hate Speech Datasets

Social media, news, and online text with sentiment/toxicity annotations.

Sentiment Analysis

Hate Speech & Offensive Language


Question Answering (QA) Datasets

Extractive, generative, and domain-specific QA datasets.


Summarization Datasets

Abstractive & extractive summarization, headline generation.


Speech Datasets (ASR & TTS)

Audio data for automatic speech recognition and text-to-speech synthesis.

Large-Scale ASR

TTS & Synthesized Speech

Speech Analysis & Emotion

Multilingual Benchmarks

  • Google FLEURS ✓ - Multilingual benchmark including Nepali (101 languages). HF: google/fleurs

Image & Video Datasets (Computer Vision)

Datasets for image/video captioning, object detection, and multimodal learning.

Sign Language & Gesture

Image Captioning & Multimodal

Face Recognition & Emotion

Domain-Specific Objects


OCR & Handwriting Datasets

Character recognition, document digitization, and license plate detection.

Handwriting & Character Recognition

License Plate & Vehicle Recognition

Academic OCR Research

  • Nepali Handwritten Character Recognition (Zenodo) ✓ - Research dataset with detailed annotations. Zenodo: 7472398

  • Improving Tesseract-OCR for Nepali (Zenodo) ✓ - 5,000+ images with preprocessing techniques (DOI: 10.5281/zenodo.4361896). Zenodo: 4361896


Translation Datasets

Parallel corpora for machine translation and low-resource language pairs.

Large-Scale Parallel Corpora

  • English-Nepali Parallel Corpus (Kathmandu University) ✓ - 1,800,000 sentence pairs gold standard for EN-NE MT. Largest parallel resource. ELRA: W0077

  • Kathmandu University English-Nepali Corpus ✓ - 1.8M sentence pairs (direct source confirmation). AI4Bharat: indicnlp_catalog

Medium-Scale Corpora

Multilingual & Specialized

Historical & Shared Tasks

  • WMT19 Parallel Corpus ✓ - Shared task corpus with filtering challenge. statmt.org/wmt19

  • English - Nepali translated strings - UI/software localization strings. Note: Original link 503; alternative via TDIL-DC not direct—use ELRA above.


Word Embeddings & Pre-trained Models

Pre-computed word vectors and language models with training datasets.

Word Embeddings

Large Language Models & Transformers


Lexicons, Linguistics & Resources

Linguistic resources, dictionaries, and instruction-tuned datasets.

Dictionaries & Word Lists

Morphology & Syntax

Instruction Tuning & Multilingual

  • Bactrian-X (Instruction Tuning) ✓ - Nepali included in multilingual instruction-tuning dataset (50+ languages). HF: MBZUAI/Bactrian-X

  • Aya Dataset (Instruction Tuning) ✓ - Nepali included in community-driven instruction dataset (101 languages). HF: cohere/aya_dataset


Code-Mixed & Multilingual NLP Datasets

Datasets for code-mixing, cross-lingual learning, and low-resource adaptation.


Specialized Collections & Aggregators

One-stop resources for finding related Nepali datasets.


Open Data & Government Resources

Official government datasets and open data portals.

  • Open Data Nepal ✓ - Official open data portal with 500+ government datasets (health, education, infrastructure). opendatanepal.com

  • Census Nepal ✓ - Official census data from Central Bureau of Statistics (demographic, geographic, economic). censusnepal.cbs.gov.np/results

  • Local Government of Nepal - Municipal & district government data (federal structure). Note: Original link insufficient; recommend using Open Data Nepal instead.


Tools & NLP Frameworks

Complete NLP toolkits and utilities for Nepali processing.


Research Papers & Benchmarks

Peer-reviewed publications on Nepali NLP and related work.

Recent & High-Impact (2024-2026)

  • NepaliGPT: A Generative Language Model for the Nepali Language ✓ - Recent LLM research. arXiv: 2506.16399

  • NLUE (Nepali Language Understanding Evaluation) ✓ - 9 NLU tasks with comprehensive benchmark. arXiv: 2411.19244

  • IRIISNEPAL RoBERTa: State-of-the-art Nepali LM ✓ - 27.5 GB training corpus from 99 news sites. arXiv: 2411.15734

  • Code-Mixed Nepali-English Abuse Detection ✓ - 5k annotated code-mixed dataset. arXiv: 2504.21026

  • Nepali Transformers@NLU of Devanagari Script Languages 2025 ✓ - Transformer architectures for Devanagari. ACL: 2025.chipsal-1.36

Sentiment Analysis & Classification

  • Aspect Based Sentiment Analysis of Nepali Text Using SVM and Naive Bayes ✓ - Comparative ML approach. ResearchGate

  • An Analysis of Classification Algorithms for Nepali News ✓ - Benchmark of various classifiers. ResearchGate

  • Nepali Text Document Classification Using Deep Neural Network ✓ - Deep learning approaches. NEPJOL

  • Application of Nepali Large Language Models to Improve Sentiment ✓ - LLM applications. ACM New 2024.

NLP Tasks & Applications

  • A Machine Learning Approach to Anaphora Resolution in Nepali Language ✓ - Pronoun resolution task. IEEE

  • Nepali Image Captioning ✓ - Vision-language multimodal task. IEEE: 8947436

  • Named-Entity Based Sentiment Analysis of Nepali News Media Texts ✓ - NER + sentiment joint modeling. ACL Anthology

  • Topic Modeling for Nepali Political News ✓ - Topic analysis in news domain. IEEE: 11004776 New.

  • NepKanun: A RAG-Based Nepali Legal Assistant ✓ - RAG systems for legal domain. OpenReview New 2025.

  • Exploring NLP Challenges for Nepali ✓ - Overview of remaining challenges. Preprints: 202409.1229 New 2024.

Linguistic & Historical

  • Natural language processing for Nepali text: a review ✓ - Comprehensive NLP review. Springer

  • A Descriptive Grammar of Nepali and an Analyzed Corpus ✓ - Linguistic grammar reference. Google Books

  • Nepali Spell Checker 1.1 and the Thesaurus ✓ - Early spell checking research. Wayback: NEP05.pdf

  • Nepali Spell Checker ✓ - Earlier spell checking work. Wayback: NEP04.pdf

Research Aggregators


Ethical Considerations

  • Sentiment/Hate Speech Data: Contains potentially offensive language; bias mitigation recommended for model training
  • Social Media Data (Tweets, Instagram): May contain personal information; use with GDPR/privacy compliance
  • Copyright: Wikipedia, news articles sourced responsibly; attribution recommended
  • Multilingual Data: Code-mixed datasets reflect real-world language use; social biases may be present

How to Contribute

  1. Verify Link: Test that dataset is publicly accessible
  2. Document Metadata: Include: name, size, domain, language(s), annotation scheme
  3. Format Entry: Follow category structure with title, description, link
  4. Submit PR: To pemagrg1/Nepali-Datasets

Additional Resources

About

A list of Nepali Dataset sources. (Hoping that it will encourage everyone to research more on Nepali language)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •