nikiforovall Oleksii Nikiforov nikiforovall.blog
Phase 1: INGEST (online) Phase 2: QUERY (offline) ───────────────────────── ──────────────────────── User: "ingest Figma" User: "Who are competitors?" → Scrape websites → Hybrid retrieval (Qdrant) → Clean → Chunk → Embed → RRF fusion (dense + BM25) → Store in Qdrant → LLM generates grounded answer
Key: No internet access during query phase — answers from stored knowledge only
Ingest triggered → Normalize company name → Wipe existing data (idempotent) → Scrape all sources (Crawl4AI) → Clean HTML → Markdown → Semantic chunking (256–512 tokens) → Dense embedding (arctic-embed-s, 384-dim) → Sparse vectors (BM25 tokenization) → Upsert to Qdrant → Knowledge base ready
snowflake-arctic-embed-s
sha256(url + chunk_index)
User query → Dense embed + BM25 tokenize → Company filter (LLM-inferred from context) → Qdrant prefetch: top-20 dense + top-20 sparse → RRF fusion (k=60) → top-10 → Similarity threshold (cosine ≥ 0.45) → ≤ 4,000 tokens context → LLM
RRF (Reciprocal Rank Fusion) combines dense and sparse rankings
dotnet run
Golden dataset (curated Q&A pairs with expected facts) → Ingest raw data into vector store → Run each query through retrieval pipeline → Check: did retrieved chunks contain expected facts? → Compute Hit Rate & Context Recall → Assert thresholds in CI
Key: Evaluation runs as an integration test — Aspire starts all services, eval runs end-to-end
dotnet test
Benefit: Confidence that eval results reflect production behavior — not a simulated environment
Our trade-off: LLM-as-judge requires a fast, capable model — with local Qwen3 8B, substring matching gives reliable signal in seconds vs. minutes