Skip to content

Training punctuation model


In this we finetune a IndicBERT model (multilingual ALBERT model trained on large-scale corpora, covering 12 major Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.)

Code is linked with Wandb to monitor our training in real-time. And all input data, intermediate reults and resulting checkpoint are picked and stored in Google Cloud Platform (GCP) bucket

1. Data Preparation

make changes in config.yaml based on your data. and run to generate train, test and valid csvs.

Cleaning steps 1. normalize text corpus 2. tokenize text corpus 3. replace foreign characters not punctuation and numerals with space

For punctuation symbol we have taken only [".", ",", "?"] these 3 symbols, which can be changed in and

2. Start Training

format of input csvs file for training


where label maps what is the next punctuation symbol for the corresponding word in sentence.

To start training change training parameters from and run

3. Inference

To infer sentences check this file

Also there is a spearate repository to try already built punctation model in indic langauges indic-punct

4 To use our models

git clone
cd indic-punct
python bdist_wheel
pip install -e .

Currently (v 2.0.6) we are supporting the following languages: Punctuation:

  • Hindi ('hi')
  • English ('en')
  • Gujarati ('gu')
  • Telugu ('te')
  • Marathi ('mr')
  • Kannada ('kn')
  • Punjabi ('pa')
  • Tamil ('ta')
  • Bengali ('bn')
  • Odia ('or')
  • Malayalam ('ml')
  • Assamese ('as')
from punctuate.punctuate_text import Punctuation
hindi = Punctuation('hi') #loads model in memory
hindi.punctuate_text(["इस श्रेणी में केवल निम्नलिखित उपश्रेणी है", "मेहुल को भारत को सौंप दिया जाए"])