Case study 03 · Lab — RAG & Search

A portfolio build — modeled on a real use case and taken end to end. It actually runs.

Finds the right HR policy across 5,420 chunks — and answers in seconds, with sources.

A hybrid-RAG assistant over 109 HR policy documents. It runs semantic and keyword search in parallel, fuses and reranks the results, and answers strictly from the source — cited, and in whatever language the employee asks.

Work with me ↗ View source

Role

Solo — design & build

Domain

HR knowledge base / search

Core stack

Qdrant · BM25 · FlashRank · Gemini

Quality

0.82 Answer Relevancy · Ragas

The impact

A scattered doc pile becomes a unified, measurable index that answers in seconds.

Before — manual lookup After — hybrid RAG

Time

~5–10 minmanual search through docs

→

~3 secondsretrieve → rerank → answer with sources

Base

Scattered documentsno single source of truth

→

109 docs, unified index5,420 searchable chunks

Quality

No measurementanswers vary by person

→

0.82 Answer RelevancyRagas-evaluated · 114 QA pairs

Work

Full manual lookupHR answers every question

→

Optional QA reviewAI handles routine queries

Language

English onlysingle-language responses

→

Multilingual · auto-detectask in any language

Avail

Business hours onlyno evenings or weekends

→

24/7 autonomousalways-on assistant

The brief

The problemThe same answer, dug out of a hundred docs.

An HR department maintains 109 policy documents — hiring, onboarding, benefits, time off, compensation, offboarding. Employees ask the same questions hundreds of times: “How many vacation days do I get?”, “What's the parental-leave policy?”, “How do I submit an expense?” Each time, someone manually searches the docs, copy-pastes fragments, and writes a reply. Average lookup: 5–10 minutes. Multiply by dozens of questions a week and that's hours of HR capacity burned on repetitive work — and answers are inconsistent: people interpret policy differently, and outdated info slips through. The company bleeds time, consistency, and employee trust.

The solutionOne assistant, every policy, instant cited answers.

I built a hybrid RAG assistant that indexes the entire HR knowledge base and answers questions in seconds — with source citations, in any language the employee writes in.

The core is a two-track retrieval pipeline. Every question runs through both semantic search (Qdrant, 3072-dim Gemini embeddings) and keyword search (BM25 with NLTK tokenization) in parallel. Results are fused via Reciprocal Rank Fusion, then a FlashRank cross-encoder reranks to the 5 most relevant chunks. Gemini answers strictly from retrieved context — no hallucinated policies, no invented numbers.

It supports on-the-fly uploads: drop a new PDF or DOCX into the chat and it's chunked, embedded, and searchable within seconds. Session memory tracks the last 5 exchanges, so follow-ups work naturally. Evaluated on 114 QA pairs with Ragas — Answer Relevancy 0.82, Faithfulness 0.75.

How it works

Two retrievers run in parallel, fuse, and rerank — so the LLM only ever sees the 5 best chunks.

User question

Hybrid retrievalboth run in parallel

Semantic searchQdrant · Gemini embeddings

top 20

Keyword searchBM25 · NLTK tokenizer

top 20

Reciprocal Rank Fusionmerge + deduplicate ranked lists

top 15

FlashRank rerankerms-marco-MiniLM cross-encoder

top 5

Google Gemini Flashsystem prompt + context + session memory

Answer + source citations

Key features

Hybrid search, dual retrieval

Every question runs two retrievers in parallel — semantic over Qdrant (3072-dim embeddings) and BM25 keyword — then Reciprocal Rank Fusion merges and dedupes both lists. Catches what either alone would miss.

Cross-encoder reranking

A FlashRank cross-encoder (ms-marco-MiniLM-L-12-v2) re-scores the fused candidates and keeps only the top 5 — fine-grained relevance the first-pass retrievers can't provide.

Multilingual answers

Ask in any language. Gemini auto-detects and answers in kind, while semantic retrieval stays language-agnostic — so an English knowledge base still answers a question typed in Russian or Spanish.

Session memory

Each session keeps the last 5 exchanges, so follow-up questions — “and for part-timers?” — resolve against the running context instead of starting cold.

Automated evaluation

Quality is measured, not assumed: 114 QA pairs scored with Ragas on Faithfulness, Answer Relevancy, Context Precision and Recall — Answer Relevancy lands at 0.82, Faithfulness at 0.75 (roughly three in four claims trace back to a retrieved source — exactly what the inline citations let you verify).

Streaming chat, drop-in docs

A Chainlit chat streams answers token-by-token. Drop a new PDF or DOCX into the conversation and it's chunked, embedded, and searchable within seconds — no redeploy.

Under the hood

Tech stack

Python 3.11 Google Gemini Flash Gemini Embedding-2 LangChain 0.3 Qdrant Cloud BM25 · rank-bm25 FlashRank FastAPI Chainlit 2.3 Ragas 0.2 NLTK Docker

Design principles

Hybrid retrievalParallel semantic + lexical search merged via Reciprocal Rank Fusion — beats either alone.
Two-stage rankingRRF narrows 40 candidates → 15; FlashRank cross-encoder picks the top 5.
Pydantic configChunk size, top-k, vector dims, model names — all via BaseSettings + .env overrides.
Stateless APIFastAPI serves 4 endpoints; an in-memory session store keeps context per user.
Eval-driven114 QA pairs with Ragas metrics keep retrieval + generation measurable.

hr-assistant / query flow

→ Chainlit chat UI (streaming + file upload)

↓ POST /api/chat

FastAPI + session memorylast 5 pairs per session

↓

Hybrid search: Qdrant (semantic) + BM25 (lexical)

↓

RRF fusion → FlashRank reranker → top-5 chunks

↓

Gemini Flash (system prompt + context + history)

↓

Answer + source citations

Knowledge buried in a hundred documents?

This turns a scattered doc pile into instant, cited answers — in any language, with retrieval quality measured by real eval metrics. I scope it, build it end to end, and prove it works with numbers.

Work with me ↗ View source