Company Intelligence

RAG-based company research assistant

Oleksii Nikiforov

  • Lead Software Engineer at EPAM Systems
  • AI Engineering Coach
  • +10 years in software development
  • Open Source and Blogging

nikiforovall
Oleksii Nikiforov
nikiforovall.blog

Problem Statement

  • Company research is manual, repetitive, scattered
  • Data lives across websites, Wikipedia, news, Crunchbase
  • Goal: ask questions → get grounded answers

Solution Overview

Two-phase architecture

Phase 1: INGEST (online)          Phase 2: QUERY (offline)
─────────────────────────         ────────────────────────
User: "ingest Figma"              User: "Who are competitors?"
  → Scrape websites               → Hybrid retrieval (Qdrant)
  → Clean → Chunk → Embed         → RRF fusion (dense + BM25)
  → Store in Qdrant               → LLM generates grounded answer

Key: No internet access during query phase — answers from stored knowledge only

Chat — Chat Window

center

Chat — Q&A with Citations

center

Chat — Detailed Research

center

Backoffice — Data Ingestion

center

Tech Stack

Layer Choice Why
LLM Qwen3 8B via Ollama 32K context, local-only
Embeddings snowflake-arctic-embed-s 384-dim, fast on CPU
Vector Store Qdrant Hybrid search, pre-filtering
Agent Pydantic AI + AG-UI SSE streaming to UI
Frontend CopilotKit + Next.js 15 React 19, chat UX
Orchestration .NET Aspire Service discovery, OTel
Scraping Crawl4AI Async, headless browser

Phase 1: Data Ingestion

Scrape → Clean → Chunk → Embed → Store

Ingestion Pipeline

Ingest triggered
  → Normalize company name
  → Wipe existing data (idempotent)
  → Scrape all sources (Crawl4AI)
  → Clean HTML → Markdown
  → Semantic chunking (256–512 tokens)
  → Dense embedding (arctic-embed-s, 384-dim)
  → Sparse vectors (BM25 tokenization)
  → Upsert to Qdrant
  → Knowledge base ready

Scraping

  • Crawl4AI — async, headless browser, Markdown output
  • Sources: company website, Wikipedia, news articles
  • Bounds: max 20 pages/source, 30s timeout, 1 req/s rate limit
  • Respects robots.txt, English-only filter

Chunking & Embedding

  • Semantic chunking — split by headings, then paragraphs, then sentences
  • Target: 256–512 tokens, 50-token overlap
  • Dense vectors: snowflake-arctic-embed-s (384-dim, L2-normalized)
  • Sparse vectors: BM25 tokenization via fastembed
  • Deterministic chunk IDs: sha256(url + chunk_index)

Phase 2: RAG Query

Retrieve → Augment → Ground → Cite

Hybrid Retrieval

User query
  → Dense embed + BM25 tokenize
  → Company filter (LLM-inferred from context)
  → Qdrant prefetch: top-20 dense + top-20 sparse
  → RRF fusion (k=60) → top-10
  → Similarity threshold (cosine ≥ 0.45)
  → ≤ 4,000 tokens context → LLM

RRF (Reciprocal Rank Fusion) combines dense and sparse rankings

Chat Agent

  • Grounded answers — exclusively from retrieved context
  • Citations — every answer references source URL + title
  • Confidence — "I don't have enough information" when cosine < 0.45
  • Multi-turn — full history within 32K context window
  • Context budget — ≤ 3,000 tokens retrieved context per query

Phase 3: Orchestration & Observability

.NET Aspire + OpenTelemetry

.NET Aspire

  • Polyglot orchestrator — Python (FastAPI) + Node.js (Next.js) + containers (Ollama, Qdrant)
  • Single dotnet run starts everything
  • Service discovery — connection strings injected automatically
  • Health monitoring via Aspire dashboard

OpenTelemetry

  • Instrumented via Logfire (HTTP/protobuf to Aspire dashboard)
  • Traces for: scraping, chunking, embedding, retrieval, generation
  • GenAI semantic conventions — token usage per LLM call
  • End-to-end distributed tracing across all services

Distributed Traces

center

Scraper Metrics

center

Phase 4: RAG Evaluation

Measure retrieval quality before it reaches the LLM

Evaluation Approach

Golden dataset (curated Q&A pairs with expected facts)
  → Ingest raw data into vector store
  → Run each query through retrieval pipeline
  → Check: did retrieved chunks contain expected facts?
  → Compute Hit Rate & Context Recall
  → Assert thresholds in CI

Key: Evaluation runs as an integration test — Aspire starts all services, eval runs end-to-end

Why Aspire for Eval?

  • Clean environment — fresh Qdrant container per test run, no stale data
  • Isolation — all services (Ollama, Qdrant, agent) spun up and torn down automatically
  • No port conflicts — Aspire assigns random ports, test discovers them via service name
  • One commanddotnet test boots the entire stack, runs eval, asserts thresholds
  • Same pipeline — eval exercises the real ingestion + retrieval

Benefit: Confidence that eval results reflect production behavior — not a simulated environment

Metrics

Metric What it measures
Hit Rate % of queries where at least one expected fact is retrieved
Context Recall Average % of expected facts found per query

  • Substring matching — each expected fact checked against retrieved chunks
  • No LLM involvement — deterministic, fast
  • We could use LLM-as-judge for semantic matching, but substring is sufficient here
  • Golden dataset: 18 queries, 2–4 reference facts each

Best Practices for RAG Evaluation

  • Golden datasets — curated Q&A pairs with ground-truth contexts
  • Separate retrieval from generation — measure each independently
  • LLM-as-judge — use strong LLM to score answer faithfulness & relevance
  • Multiple metrics — precision, recall, MRR, NDCG at different K values
  • Regression testing — run eval in CI, fail on quality drops
  • RAGAS framework — standardized metrics for RAG pipelines

Our trade-off: LLM-as-judge requires a fast, capable model — with local Qwen3 8B, substring matching gives reliable signal in seconds vs. minutes

Thank You

Questions?