
The Legislative Assembly of Punjab has achieved a first of its kind. It has gone live with digital access to the public of all its archives from 1947.
Using a search engine, people can now browse the archives for selected debates and discussions held in the Vidhan Soudha (Legislative Assembly) easily. In 2023, the entire digitisation work was completed under the Punjab Digital Library Project.
The unique search engine was developed by a consortium consisting of the International Institute of Information Technology, Hyderabad, under the guidance of Prof. Gurpreet Lehal, Consultant, Punjabi University, and Prof. C.V. Jawahar, IIITH, along with Punjabi University, Patiala, and C-DAC, Noida.
The project is an initiative of the National Language Translation Mission, Bhashini. With Punjab State leading the way, the hope is that all the other State Vidhan Soudhas could follow suit and take steps towards better and transparent governance through simple and easy online access to the public of government activities.
Digital India BHASHINI, launched in 2022 by Prime Minister Narendra Modi from Gandhinagar in Gujarat, seeks to enable easy access to the internet and digital services in Indian languages, including voice-based access, and help the creation of content in Indian languages.
As the country’s Artificial Intelligence (AI)-led language translation platform, it will enable massive citizen engagement to build multilingual datasets through a crowd-sourcing initiative called Bhasha Daan, too, as per its objectives.
IIITH, Punjab Digital Project
The objective of the Punjab Digital Library project was to preserve the cultural heritage of Punjab in digital format. It was completed in 2023. “But those PDFs were not searchable images,” notes Prof. Gurpreet Lehal, adding that oftentimes each PDF had three different languages – English, Hindi, and Punjabi, written in their distinct scripts of English, Devanagari, and Gurmukhi, respectively.
“The first challenge was to develop an OCR that could recognise the appropriate script and then convert it into text with high accuracy. The next was to make them searchable so that anyone who wanted to go through historical debates could retrieve the appropriate text. For instance, if I type ‘Punjabi Suba’ – the movement that ultimately led to the creation of Haryana – in Hindi, the engine will search through the two lakh page-database and pull out all references to the movement in the three languages,” he says.
The search engine has not only enhanced public accessibility to the digital archives but also made it inclusive through its integration with audiobooks.
The versatile search engine can handle fuzziness in the search criteria, such as similar-sounding words or names, says a report from the IIITH.
“Suppose you are typing in ‘Prakash Singh Badal’, you could type it as ‘Parkash’ as well, and the engine will auto-correct for minor spelling errors and retrieve the correct output. Essentially, it reveals insights and fosters accountability in governance when one can search for and retrieve all topics debated by any MLA, along with their frequency of participation and so on,” states Prof. Lehal.
Another inclusive feature is that the archives have been made accessible to the visually impaired by converting them into audiobooks.
According to Krishna Tulsyan, a researcher at the IIITH and part of Bhashini’s efforts to convert Indian language books into audiobooks, “We use consortium OCR to extract Unicode text from the PDFs and then use the Bhashini TTS to convert the text to speech that can either be played on-the-spot in the application itself or downloaded as a reader-compatible format like mp3 or Daisy.”
“The ultimate aim is to make the legislative archives accessible in all Indian languages, so that if a debate is in Punjabi, it can be made available in, say, Marathi, to a native speaker of Marathi,” says Prof. Lehal, referring to Phase 2 of the project.
According to him, conversion of textual matter into Unicode has helped lay the foundation for all other language services such as search, translation, conversion to speech, and so on. “With Unicode conversion, integrating the search engine with Bhashini’s machine translation system will become very easy,” he adds.
The availability of audiobooks in any of the Indian languages will increase the digital accessibility of the archives. Additionally, integration with a Large Language Model could enable intelligent, conversational search capabilities. “Users might be able to ask questions in natural language—such as ‘What were the key discussions on agricultural reforms in the 1980s’ or ‘Compare political stances on Punjabi Suba across party lines’—and receive context-aware, summarised responses,” describes Prof. Lehal.