Project documentation

Vidhia Tech AI.

By Balwinder Kaur

Nonprofit AI heritage-language preservation — Punjabi (Gurmukhi) OCR, digitization, and AI audiobook generation from public domain works.

The problem

No indexed digital Punjabi corpus exists. Public domain Punjabi literature — centuries of poetry, fiction, and heritage texts — sits in scanned image PDFs that are unsearchable, unindexed, and unusable for any AI or LLM application. This single blockage prevents every downstream use case: heritage search, AI audiobook generation, language-model training data, modern publishing.

The downstream pain hits two distinct users: diaspora heritage seekers who can't find audio versions of classic Punjabi works and struggle with Gurmukhi script reading (they simply drop the task), and researchers and authors who can't perform even basic searches across the existing corpus.

The solution

Vidhia Tech AI ships a full pipeline: PDF ingest → Gurmukhi OCR (Gemini Vision) → dictionary-augmented correction → human-in-the-loop validation → indexed searchable text → TTS audiobook generation (Gemini 2.5 TTS Pro).

The strategic move is foundational: by building the first large-scale digital Punjabi corpus as a public good, every downstream product (audiobooks, search, research tools, publishing services) becomes possible — and Vidhia owns the substrate.

How it works

Submission inputs: PDF of the book, title, author, format, script (Gurmukhi), language, genre, submitter contact. Pipeline runs PDF parsing → page-level OCR → page-level OCR correction → chapter-level TTS generation. Three discrete prompts handle each stage, all validated through manual evaluation before output.

Models: PDF parsing on the cheapest frontier model under evaluation; Gurmukhi OCR on Gemini Vision API ($1.5 / 1000 images, outperformed Tesseract, Indic Language models, and crowd-sourced manual entry in POC testing); Punjabi TTS on Gemini 2.5 TTS Pro ($2–4 per book, vs Bird3d which was prohibitively expensive). The TTS voice persona is genre-aware but secular: respectful and measured for heritage and historical texts, neutral for novels, expressive for poetry. Never religious, political, or community-specific.

Who it's for

Persona 1 (Phase 1): the diaspora Punjabi speaker in USA / UK / Canada / Australia, audiobook-native, who wants to reconnect with Punjabi literary heritage via audio but can't find any in their language. Often Gurmukhi-literacy-challenged second or third generation.

Persona 2 (Phase 2): the Punjabi author who writes in Gurmukhi script and wants to reach a wider audience through audio without the cost or gatekeeping of traditional publishers. Pays for high-quality AI audiobook conversion of their own works.

Persona 3 (Phase 3, future): Punjabi publishing houses needing scalable digitization and AI audio infrastructure — the B2B layer that becomes possible once the substrate exists.

Why it matters

The global audiobook market is growing at ~26% CAGR, and AI TTS is improving fast enough to deliver natural-sounding heritage-language audio at a fraction of the cost of human narration. For underserved languages like Punjabi — 100M+ speakers but limited digital presence — that combination is the difference between a language that participates in the AI era and one that doesn't.

The community-built positioning matters strategically: by being built by diaspora tech professionals with cultural authenticity rather than outside vendors, Vidhia clears the trust bar that commercial heritage-tech projects routinely fail. Zero licensing cost on public domain content plus a first-mover digital corpus position is a moat that compounds over time as the corpus grows.

At a glance

Project: Vidhia Tech AI
Built by: Balwinder Kaur
One-liner: Nonprofit AI heritage-language preservation — Punjabi (Gurmukhi) OCR, digitization, and AI audiobook generation from public domain works.

View the project page