Building at the intersection of large language models, enterprise data, and real-world utility. Less hype, more shipping.
Enterprise documents are chaotic: scanned invoices with inconsistent layouts, regulatory filings buried in legalese, multi-column research papers with nested tables. SynthDoc is an LLM-powered extraction pipeline that transforms these unstructured documents into queryable, structured data. It chains Claude for semantic understanding with a custom layout parser for spatial reasoning, feeding results into Pinecone for vector-based retrieval. The system processes 10,000+ pages per run with 96.4% extraction accuracy on benchmark datasets, handling edge cases like rotated text, merged cells, and handwritten annotations through a multi-pass verification loop. Currently deployed in a Salesforce-integrated workflow where extracted contract data auto-populates opportunity records.
Explore project →The best AI systems disappear into the workflow. If the user has to think about the model, you have already failed at the design level.
Operating philosophy for every experiment in this labMulti-agent system using Claude and tool-use to plan, execute, and self-correct complex business processes. Agents negotiate task allocation, escalate edge cases, and produce audit trails for every decision made.
Explore →RAG-powered code review tool that ingests an entire repository, maps dependency graphs, and provides architecture-aware suggestions. Uses hybrid search with BM25 + dense embeddings for precise retrieval.
Explore →Multimodal pipeline combining GPT-4V and Claude for architectural photo analysis. Extracts spatial relationships, material identification, and design-style classification with structured JSON output.
Explore →Fine-tuned Mistral 7B on 15K legal and compliance documents to generate executive summaries that preserve critical clauses. Outperforms zero-shot GPT-4 on domain-specific ROUGE-L by 18%.
Explore →Custom evaluation framework for comparing LLM outputs across accuracy, latency, cost, and hallucination rate. Runs head-to-head benchmarks with human-in-the-loop scoring and automated regression detection.
Explore →Git-like version control for prompt templates with A/B testing, cost tracking, and latency monitoring. Integrates with LangSmith and custom dashboards to catch regressions before they reach production.
Explore →My research interests sit at the boundary where large language models meet messy, real-world enterprise data. I am particularly drawn to problems where off-the-shelf solutions fall short and custom pipelines are the only path to production-grade reliability. Three threads I keep pulling on:
Retrieval-Augmented Generation at scale. Most RAG demos work on a handful of documents. I focus on what breaks when you point the same architecture at 50,000 PDFs with inconsistent formatting, mixed languages, and no clean metadata. Chunking strategy, re-ranking, and hybrid search become the real engineering challenges.
Agentic systems with guardrails. Autonomous agents are powerful but brittle. My work emphasizes structured tool-use, explicit reasoning traces, and human-in-the-loop checkpoints that let agents operate in regulated environments like financial services and healthcare without sacrificing auditability.
Evaluation-driven development. You cannot improve what you cannot measure. I build custom evaluation harnesses before writing the first line of application code, establishing baselines against which every prompt revision and architecture change is tested.