Finetuning NepBerta for Nepali NER

September 30, 2024

By Priyanshu Koirala

Advancing Nepali NLP with NepBertA-NER for Named Entity Recognition

Natural Language Processing (NLP) and LLMs have gained significant traction in recent years, helping machines understand and interpret human languages more effectively. However, low-resource languages like Nepali still lag behind in data, models, resources, research, and even cultural factors. One key challenge is the lack of a culture that promotes contributing to public datasets and tools for free, which hinders the development of NLP models for such languages. To bridge this gap, we developed Finetuned NepBertA-NER, a Nepali Named Entity Recognition (NER) model based on NepBERTa (originally introduced in NepBERTa: Nepali language model trained in a large corpus).

This blog will walk you through the background of this model, how it was fine-tuned, its capabilities, how you can use it, future improvements, and its potential applications.

Background

Named Entity Recognition (NER) is a key NLP task that involves identifying and classifying entities (like people, organizations, and locations) in text. In widely spoken languages like English, state-of-the-art models for NER exist, but Nepali—spoken by millions—lacks similar resources.

This gap inspired the creation of NepBERTa, a BERT-based model designed for the Nepali language. Building on that, the Finetuned NepBertA-NER model was trained to recognize specific entities in Nepali, making it an important tool for a variety of use cases, such as:

Extracting key information from Nepali news articles
Analyzing customer feedback or social media posts in Nepali
Processing legal and financial documents in Nepali

How the Model Was Fine-Tuned

The fine-tuning process of Finetuned NepBertA-NER involved adapting the pre-trained NepBERTa model to the task of Named Entity Recognition. Here’s a step-by-step breakdown of the process:

1. Dataset Preparation

The first step was preparing a custom dataset, annotated with labels like:

B-PER (Beginning of a person’s name)
I-PER (Inside of a person’s name)
B-ORG (Beginning of an organization’s name)
I-ORG (Inside of an organization’s name)
B-LOC (Beginning of a location’s name)
I-LOC (Inside of a location’s name)
O (Other words that don’t belong to any entity)

The dataset was split into training and validation sets using an 80-20 ratio.

2. Model Setup

We used Hugging Face’s AutoModelForTokenClassification to load the NepBERTa model and fine-tune it for the NER task. The model was configured to identify the 7 entity classes described above.

3. Training

The training was conducted using the AdamW optimizer with a learning rate of 5e-5. The model was trained for 5 epochs on a GPU to ensure quick processing. During training, batches of text were fed into the model to predict entity labels, with loss calculated based on the model’s predictions and true labels.

4. Validation

After each epoch, the model’s performance was validated using a validation set. The evaluation metrics included precision, recall, and F1-score to ensure that the model was effectively recognizing entities.

Model Capabilities

The Finetuned NepBertA-NER model is capable of recognizing and classifying the following types of entities in Nepali text:

Persons (PER): Names of people
Organizations (ORG): Names of companies, institutions, and other entities
Locations (LOC): Names of places, such as cities, countries, and landmarks

For example, when given the sentence:

काठमाडौं उपत्यकाको ऐतिहासिक पशुपतिनाथ मन्दिर धेरै प्रसिद्ध छ।

The model will output:

पशुपतिनाथ -> Location (LOC)
काठमाडौं -> Location (LOC)

This enables the model to assist in a range of NLP applications, including document parsing, information extraction, and sentiment analysis in the Nepali language.

How to Use the Model

You can easily use Finetuned NepBertA-NER through Hugging Face’s transformers library. Below is an example of how to load and use the model in your own projects.

Step 1: Load the Model and Tokenizer

First, load the pre-trained model and tokenizer.

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Set up the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the pre-trained model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
tokenizer = AutoTokenizer.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
model.to(device)

Step 2: Define a Function for Chunked Predictions

For large texts, it’s good practice to process them in chunks. Here’s a function to predict NER tags for chunked text:

def predict_ner_chunked(text, model, tokenizer, device, max_length=512):
    model.eval()
    words = text.split()
    ner_results = []
    
    for i in range(0, len(words), max_length):
        chunk = ' '.join(words[i:i+max_length])
        tokens = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
        tokens = {k: v.to(device) for k, v in tokens.items()}
        
        with torch.no_grad():
            outputs = model(**tokens)
        
        predictions = torch.argmax(outputs.logits, dim=2)
        predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
        
        chunk_words = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
        for word, label in zip(chunk_words, predicted_labels):
            if label in ["B-PER", "I-PER", "B-ORG", "B-LOC"] and word not in ["[CLS]", "[SEP]", "[PAD]"]:
                ner_results.append((word, label))
    
    return ner_results

Step 3: Use the Model

Feed Nepali text into the function and get the entities:

text = "सङ्घीय लोकतान्त्रिक गणतन्त्र नेपालको प्रधानमन्त्री शेरबहादुर देउवा हुन्।"
ner_results = predict_ner_chunked(text, model, tokenizer, device)

print(ner_results)

The output will be:

[('शेरबहादुर', 'B-PER'), ('देउवा', 'I-PER'), ('नेपाल', 'B-LOC')]

This allows you to easily extract key entities from Nepali text for downstream tasks such as document analysis or sentiment analysis.

Future Work

While the Finetuned NepBertA-NER model performs well for recognizing named entities in Nepali, there is always room for improvement. Here are a few areas that could enhance the model further:

Larger, More Diverse Dataset: Currently, the model is trained on a custom dataset, but expanding this dataset to include more varied domains—such as medical, legal, and financial texts—could improve the model’s versatility.
Addition of More Entity Types: The current model focuses on identifying persons, organizations, and locations. Adding more entity types (such as dates, monetary values, and geopolitical entities) would enhance the model’s utility in a wider range of applications.
Cross-Lingual Training: Incorporating training data from other similar languages, like Hindi, could make the model more robust, especially when dealing with ambiguous or unseen entities.
Leveraging More Advanced Architectures: Exploring newer transformer architectures like DeBERTa or RoBERTa, or even hybrid approaches using sequence-to-sequence models for NER, could potentially yield better results and increase the model’s overall accuracy.

Conclusion

The Finetuned-NepBertA-NER model is a step forward in enhancing the natural language processing capabilities for the Nepali language. By providing a robust method to recognize and categorize named entities, it opens doors to various applications—from text analysis and information extraction to sentiment analysis and legal document processing. If you are working with Nepali text, this model can serve as a valuable tool to simplify and streamline entity recognition.

We encourage researchers, developers, and enthusiasts to explore and use this model in their projects, contribute to its improvement, and expand its capabilities by sharing data and feedback. The model is available on Hugging Face: Finetuned NepBertA-NER, and we look forward to seeing how it evolves with community support and more fine-tuning.