Case study 03 · Lab — RAG & Search

Finds the right HR policy across 5,420 chunks — and answers in seconds, with sources.

A hybrid-RAG assistant over 109 HR policy documents. It runs semantic and keyword search in parallel, fuses and reranks the results, and answers strictly from the source — cited, and in whatever language the employee asks.

Role
Solo — design & build
Domain
HR knowledge base / search
Core stack
Qdrant · BM25 · FlashRank · Gemini
Quality
0.82 Answer Relevancy · Ragas
01

The impact

A scattered doc pile becomes a unified, measurable index that answers in seconds.

Before — manual lookup After — hybrid RAG
Time
~5–10 minmanual search through docs
~3 secondsinstant AI-powered answer
Base
Scattered documentsno single source of truth
109 docs, unified index5,420 searchable chunks
Quality
No measurementanswers vary by person
0.82 Answer RelevancyRagas-evaluated · 114 QA pairs
Work
Full manual lookupHR answers every question
Optional QA reviewAI handles routine queries
Language
English onlysingle-language responses
Multilingual · auto-detectask in any language
Avail
Business hours onlyno evenings or weekends
24/7 autonomousalways-on assistant
02

The brief

The problemThe same answer, dug out of a hundred docs.

An HR department maintains 109 policy documents — hiring, onboarding, benefits, time off, compensation, offboarding. Employees ask the same questions hundreds of times: “How many vacation days do I get?”, “What's the parental-leave policy?”, “How do I submit an expense?” Each time, someone manually searches the docs, copy-pastes fragments, and writes a reply. Average lookup: 5–10 minutes. Multiply by dozens of questions a week and that's hours of HR capacity burned on repetitive work — and answers are inconsistent: people interpret policy differently, and outdated info slips through. The company bleeds time, consistency, and employee trust.

The solutionOne assistant, every policy, instant cited answers.

I built a hybrid RAG assistant that indexes the entire HR knowledge base and answers questions in seconds — with source citations, in any language the employee writes in.

The core is a two-track retrieval pipeline. Every question runs through both semantic search (Qdrant, 3072-dim Gemini embeddings) and keyword search (BM25 with NLTK tokenization) in parallel. Results are fused via Reciprocal Rank Fusion, then a FlashRank cross-encoder reranks to the 5 most relevant chunks. Gemini answers strictly from retrieved context — no hallucinated policies, no invented numbers.

It supports on-the-fly uploads: drop a new PDF or DOCX into the chat and it's chunked, embedded, and searchable within seconds. Session memory tracks the last 5 exchanges, so follow-ups work naturally. Evaluated on 114 QA pairs with Ragas — Answer Relevancy 0.82, Faithfulness 0.75.


03

How it works

Two retrievers run in parallel, fuse, and rerank — so the LLM only ever sees the 5 best chunks.

User question
Hybrid retrievalboth run in parallel
Semantic searchQdrant · Gemini embeddings
top 20
Keyword searchBM25 · NLTK tokenizer
top 20
Reciprocal Rank Fusionmerge + deduplicate ranked lists
top 15
FlashRank rerankerms-marco-MiniLM cross-encoder
top 5
Google Gemini Flashsystem prompt + context + session memory
Answer + source citations

04

Key features

Hybrid search, dual retrieval

Every question runs two retrievers in parallel — semantic over Qdrant (3072-dim embeddings) and BM25 keyword — then Reciprocal Rank Fusion merges and dedupes both lists. Catches what either alone would miss.

Cross-encoder reranking

A FlashRank cross-encoder (ms-marco-MiniLM-L-12-v2) re-scores the fused candidates and keeps only the top 5 — fine-grained relevance the first-pass retrievers can't provide.

Multilingual answers

Ask in any language. Gemini auto-detects and answers in kind, while semantic retrieval stays language-agnostic — so an English knowledge base still answers a question typed in Russian or Spanish.

Session memory

Each session keeps the last 5 exchanges, so follow-up questions — “and for part-timers?” — resolve against the running context instead of starting cold.

Automated evaluation

Quality is measured, not assumed: 114 QA pairs scored with Ragas on Faithfulness, Answer Relevancy, Context Precision and Recall — Answer Relevancy lands at 0.82.

Streaming chat, drop-in docs

A Chainlit chat streams answers token-by-token. Drop a new PDF or DOCX into the conversation and it's chunked, embedded, and searchable within seconds — no redeploy.


05

Under the hood

Tech stack
Python 3.11 Google Gemini Flash Gemini Embedding-2 LangChain 0.3 Qdrant Cloud BM25 · rank-bm25 FlashRank FastAPI Chainlit 2.3 Ragas 0.2 NLTK Docker
Design principles
  • Hybrid retrievalParallel semantic + lexical search merged via Reciprocal Rank Fusion — beats either alone.
  • Two-stage rankingRRF narrows 40 candidates → 15; FlashRank cross-encoder picks the top 5.
  • Pydantic configChunk size, top-k, vector dims, model names — all via BaseSettings + .env overrides.
  • Stateless APIFastAPI serves 4 endpoints; an in-memory session store keeps context per user.
  • Eval-driven114 QA pairs with Ragas metrics keep retrieval + generation measurable.
hr-assistant / query flow
Chainlit chat UI (streaming + file upload)
↓ POST /api/chat
FastAPI + session memorylast 5 pairs per session
Hybrid search: Qdrant (semantic) + BM25 (lexical)
RRF fusion → FlashRank reranker → top-5 chunks
Gemini Flash (system prompt + context + history)
Answer + source citations

Knowledge buried in a hundred documents?

This turns a scattered doc pile into instant, cited answers — in any language, with retrieval quality measured by real eval metrics. I scope it, build it end to end, and prove it works with numbers.