Open to new opportunities · Bengaluru, India

Prathamesh Mahalebuilding the geometry of meaning.

I build agentic AI systems, LLM-powered products, semantic search, and NLP pipelines at production scale — retrieval, ranking, summarization, and content understanding across catalogs with millions of items.

Get in touch GitHub Résumé

stories embedded: 5M+
indexed in Hubble: 500K+
embedding cost reduced: 60%
pipeline throughput: 6×

Experience

Where I've shipped systems

Production LLM, retrieval, and NLP work — from research interns to the VP of Tech's office at Pratilipi.

AI Engineer, VP of Tech's Office · Pratilipi
Sept 2025 — Present
Bengaluru, India
- Built Hubble, a two-stage retrieval + reranking system over a 500K+ story catalog for OTT adaptation discovery (Netflix, Prime Video, SonyLIV). To bound latency and token cost, the LLM reranker never sees more than 100 candidates regardless of catalog size — embedding-based recall does the heavy lifting upstream. Cut analyst discovery time from 4h → 30min and drove a 2× increase in IP exploration.
- Designed the retrieval stack to keep the expensive LLM step at the end of the funnel: Gemini 3 Flash decomposes each brief into a tag set, Qwen (1024-d) embeddings + cosine ANN narrow 500K stories to a top-1K candidate pool, a top-100 working set is sliced by cosine score, and Gemini reranks those against the original brief for the top-10. 95% top-10 relevance on internally labeled queries from IP and licensing teams.
- Built StorySpine, a hierarchical LangGraph workflow that converts full-length stories into micro-drama adaptation plans. A supervisor agent coordinates chapter-grounded sub-agents via structured state passing — keeping each sub-agent's context window scoped to its own chapter slice while preserving narrative consistency across the full story. Two modes trade cost for quality: fast (Gemini 3.1 Flash Lite) for volume, deep (GPT-5.5 x-high) for retention-critical adaptations. Increased production throughput by 8+ adaptations/week.
- Built batch embedding infrastructure for a 5M+ story catalog on Vertex AI Batch using Qwen (1024-d) embeddings. Moved generation from synchronous online inference to scheduled distributed batch jobs, publishing vectors as shared S3 artifacts consumed by retrieval, recommendation ranking, and diversity monitoring — eliminating duplicate recomputation across teams. Reduced embedding cost by 60% while improving downstream indexing throughput.
Further reading
- Building Hubble — engineering write-up
Data Scientist Intern · Pratilipi
Mar 2025 — Jun 2025
Bengaluru, India
- Implemented SVP-CF sampling for the recommendation training pipeline — a coreset-style selection method that retained representative interactions under heavy class imbalance, cutting training data volume by 60% while preserving downstream A/B test performance.
- Refactored the Facebook ad generation pipeline for modularity, observability, and idempotency, introducing a generation cache keyed on prompt + content fingerprints. Reduced OpenAI API usage by 60–80% by eliminating redundant calls for previously-generated variants, with full request/response tracing for failure inspection.
- Re-architected the pipeline using AWS Step Functions + Metaflow DAG orchestration, decomposing sequential generation stages into parallelizable tasks with retry-aware execution and intermediate artifact persistence. Increased workflow throughput by 6× while improving observability and recovery from partial failures.
- Contributed to a LangGraph-based story-writing agent that decomposes long-form generation into chapter-scoped sub-tasks under constrained context windows — producing narrative drafts 35× faster and 10× more cost-effectively than manual workflows.
AI Research Intern · University of Arkansas, Little Rock
Nov 2024 — Mar 2025
Little Rock, AR
- Improved biomedical NER under severe label imbalance using weighted cross-entropy with class-frequency-inverse weighting — lifting balanced accuracy by 4% by trading majority-class precision for minority-class recall.
- Designed a BioBERT → student-model distillation pipeline using temperature-scaled soft logits and token-level alignment losses to preserve minority-class behavior under severe class imbalance. Reduced inference cost while improving F1 and balanced accuracy by 2–5%.
- Integrated focal loss into the NER training loop to down-weight easy negatives — achieving state-of-the-art balanced accuracy on the task.
- Built 2D and 3D embedding visualization workflows for BERT-family models using projection + clustering, accelerating error analysis and surfacing minority-class confusion patterns invisible in aggregate metrics.
AI Research Intern · HCL Tech
Sept 2023 — Jan 2024
Pune, India
- Built an OCR + LLM document-understanding pipeline to extract dimensional rules from technical hardware-manufacturing PDFs, normalizing layout variance across vendor formats before LLM extraction — improving rule-extraction accuracy by 30%.
- Fine-tuned DistilBERT on a custom rule-extraction dataset with class-weighted training to handle imbalanced rule/non-rule sentences, achieving 98% test accuracy — chosen over a larger model to fit the inference budget required for downstream review tooling.
- Built evaluation and visualization workflows over 10+ technical PDFs, surfacing extraction-failure modes by document section to guide manual review prioritization.

Projects

Selected projects

Things I've built end-to-end: RAG, document understanding, and an LLM-powered legal drafting platform.

proj · 01

PDF-to-Rule Converter for Manufacturing Insights

in collaboration with HCL Tech

LLM/NLP pipeline that extracts and parses dimensional rules from technical PDFs into structured JSON.

Fine-tuned an LLM/NLP classifier with class-weighted training to handle imbalanced rule/non-rule sentences across vendor PDF formats — improving classification accuracy by 48% over baseline.
Built a LLaMA 3 prototype that parses extracted rules into structured JSON via constrained generation, achieving 78% label match on a 300-sample internal test set.
Reduced downstream manual parsing workload, enabling an estimated 4× faster review workflow for engineering teams.

PythonPyMuPDFPyTorchTransformersLLaMA 3

proj · 02

LLM-powered RAG System — Airport Authority of India

Streaming RAG QA system over institutional content with quantized inference and semantic chunking.

Designed a hybrid retrieval pipeline combining semantic chunking, embedding search, and KMeans candidate clustering to improve precision over long institutional documents — 95% relevant-context recall on internal context-grounded QA benchmarks.
Built a FastAPI streaming generation API serving concurrent users at 300–500 ms p50 latency, with backpressure on a bounded inference queue.
Quantized transformer inference to 4-bit to fit memory budgets on commodity GPUs — reducing footprint 4× and improving inference speed 2× with negligible quality regression on the eval set.

FastAPITransformersHugging FacePython

proj · 03

Legal Document Generation Engine

Hackathon-winning platform that drafts legal documents and answers legal questions through an LLM chatbot.

Built a legal document generation platform with parameterized templates and a typed JSON contract between frontend and backend, supporting 6+ document types and reducing drafting time by 70–80%.
Implemented backend document rendering with LaTeX, exposed through FastAPI with input validation against per-document-type schemas.
Developed an LLM-powered chatbot for legal information retrieval and assistance on the platform.

LLaMA 3ReactTailwindCSSFastAPILaTeX

Stack

What I reach for

The tools I use most when going from a fuzzy problem to a deployed system.

Languages

Python
C++

AI / ML

PyTorch
Transformers
LangGraph
LangChain
Hugging Face
TensorFlow
LLMs
RAG
Semantic Search
Recommendation Systems
Knowledge Distillation
NLP

Backend & APIs

FastAPI
Django REST Framework
REST APIs

Data & Analysis

NumPy
Pandas
Matplotlib

Cloud & MLOps

AWS (S3, Step Functions, Batch)
Vertex AI
Azure Blob
Docker
Git
Metaflow

About

A little more context

I'm a final-year AI & Data Science student at VIIT Pune, and an AI Engineer in the VP of Tech's Office at Pratilipi. I work on systems that turn unstructured content into useful structure — embeddings for catalogs in the millions, retrieval that finds the right story for the right brief, and LLM pipelines that cut human review time in half.

My instinct is to start with a measurable problem, ship a thin slice end-to-end, and then make it faster and cheaper. I care about latency, cost per call, and the quiet kind of reliability that only shows up in production.

Education

Vishwakarma Institute of Information Technology

B.Tech, Artificial Intelligence & Data Science

Aug 2022 — May 2026 · Pune, India

Selected Highlights

Regional Industry Summit 2024
Project selected for representation at the summit organized by IFCCI, IGCC, and DSCI — presented to 50+ industry leaders and startup founders in GenAI, ML, and data privacy.
Overall Winner — Vortexa Tech Hackathon 2024
First prize among 250+ participants. Built a Legal Document Generation Engine and a GenAI chatbot in a 12-hour hackathon.
Smart India Hackathon 2024 — Grand Finalist
Among 380 teams. Built an AI-driven legal research prototype designed to accelerate legal information retrieval.

Prathamesh Mahalebuilding the geometry of meaning.

Where I've shipped systems

AI Engineer, VP of Tech's Office · Pratilipi

Data Scientist Intern · Pratilipi

AI Research Intern · University of Arkansas, Little Rock

AI Research Intern · HCL Tech