Finds the right HR policy across 5,420 chunks — and answers in seconds, with sources.
A hybrid-RAG assistant over 109 HR policy documents. It runs semantic and keyword search in parallel, fuses and reranks the results, and answers strictly from the source — cited, and in whatever language the employee asks.
The impact
A scattered doc pile becomes a unified, measurable index that answers in seconds.
The brief
The problemThe same answer, dug out of a hundred docs.
An HR department maintains 109 policy documents — hiring, onboarding, benefits, time off, compensation, offboarding. Employees ask the same questions hundreds of times: “How many vacation days do I get?”, “What's the parental-leave policy?”, “How do I submit an expense?” Each time, someone manually searches the docs, copy-pastes fragments, and writes a reply. Average lookup: 5–10 minutes. Multiply by dozens of questions a week and that's hours of HR capacity burned on repetitive work — and answers are inconsistent: people interpret policy differently, and outdated info slips through. The company bleeds time, consistency, and employee trust.
The solutionOne assistant, every policy, instant cited answers.
I built a hybrid RAG assistant that indexes the entire HR knowledge base and answers questions in seconds — with source citations, in any language the employee writes in.
The core is a two-track retrieval pipeline. Every question runs through both semantic search (Qdrant, 3072-dim Gemini embeddings) and keyword search (BM25 with NLTK tokenization) in parallel. Results are fused via Reciprocal Rank Fusion, then a FlashRank cross-encoder reranks to the 5 most relevant chunks. Gemini answers strictly from retrieved context — no hallucinated policies, no invented numbers.
It supports on-the-fly uploads: drop a new PDF or DOCX into the chat and it's chunked, embedded, and searchable within seconds. Session memory tracks the last 5 exchanges, so follow-ups work naturally. Evaluated on 114 QA pairs with Ragas — Answer Relevancy 0.82, Faithfulness 0.75.
How it works
Two retrievers run in parallel, fuse, and rerank — so the LLM only ever sees the 5 best chunks.
Key features
Hybrid search, dual retrieval
Every question runs two retrievers in parallel — semantic over Qdrant (3072-dim embeddings) and BM25 keyword — then Reciprocal Rank Fusion merges and dedupes both lists. Catches what either alone would miss.
Cross-encoder reranking
A FlashRank cross-encoder (ms-marco-MiniLM-L-12-v2) re-scores the fused candidates and keeps only the top 5 — fine-grained relevance the first-pass retrievers can't provide.
Multilingual answers
Ask in any language. Gemini auto-detects and answers in kind, while semantic retrieval stays language-agnostic — so an English knowledge base still answers a question typed in Russian or Spanish.
Session memory
Each session keeps the last 5 exchanges, so follow-up questions — “and for part-timers?” — resolve against the running context instead of starting cold.
Automated evaluation
Quality is measured, not assumed: 114 QA pairs scored with Ragas on Faithfulness, Answer Relevancy, Context Precision and Recall — Answer Relevancy lands at 0.82.
Streaming chat, drop-in docs
A Chainlit chat streams answers token-by-token. Drop a new PDF or DOCX into the conversation and it's chunked, embedded, and searchable within seconds — no redeploy.
Under the hood
- Hybrid retrievalParallel semantic + lexical search merged via Reciprocal Rank Fusion — beats either alone.
- Two-stage rankingRRF narrows 40 candidates → 15; FlashRank cross-encoder picks the top 5.
- Pydantic configChunk size, top-k, vector dims, model names — all via BaseSettings + .env overrides.
- Stateless APIFastAPI serves 4 endpoints; an in-memory session store keeps context per user.
- Eval-driven114 QA pairs with Ragas metrics keep retrieval + generation measurable.
Knowledge buried in a hundred documents?
This turns a scattered doc pile into instant, cited answers — in any language, with retrieval quality measured by real eval metrics. I scope it, build it end to end, and prove it works with numbers.