Skip to content

himanshugarg06/data-context

Repository files navigation

Context Layer

This repository implements a Context Layer a service that turns a natural language question into a Resolved Query Plan backed by evidence, bound to real schema, and producing verifiable SQL.

Demo: Google Drive


The problem we are solving

Alt text

Business metric definitions are scattered across tables, Slack threads, and docs/PDFs. When business users ask questions in English, the system has to guess:

  • which metric definition to use,
  • which tables/columns map to it,
  • and what business rules apply.

That guessing creates ambiguity, broken dashboards, and untrustworthy SQL making AI tools hallucinate. In large orgs, the data is in tables but the metric definition lives in Slack, docs, tickets, and tribal knowledge.

Solution

A pluggable Context Layer service that resolves user questions into a Resolved Query Plan using schema context, lineage expansion, unstructured evidence, deterministic binding, and verification checks. Essentially a layer that helps with the business context and exact metric definitions to query the data tables.

Given a question, it:

  1. QuerySpec agent extracts a formal intent object (strict JSON, Zod‑validated).
  2. Schema agent searches the catalog and expands via lineage (plus optional join inference).
  3. Evidence agent pulls Slack/docs/PDF chunks using hybrid retrieval.
  4. Metric agent generates multiple MetricSpecCandidates with evidence citations.
  5. Binding + join planner deterministically maps hints → columns and discards invalid candidates.
  6. Ranking agent scores candidates and selects the best.
  7. SQL agent builds deterministic SQL and optionally refines on SQL errors.
  8. Verification agent runs safety checks (join duplication, null join keys, rate bounds).

Multi‑agent mode

Multi-agent is primarliy used for complex queries because the task spans multiple failure modes that a single pass often misses:

Different experts catch different errors: schema fit, evidence grounding, join plausibility, and SQL syntax are distinct checks. Debate/critique reduces hallucinations: one agent proposes, another challenges mismatches to actual tables/columns. It improves robustness on messy data: if one agent overfits to noisy evidence, another can correct using schema constraints. It gives a safety net: verification and SQL-repair agents can fix issues before execution.

How to run

  1. Install dependencies
npm install
  1. Start pgvector (if you don’t already have it) and Create the pgvector extension (once per pgvector DB)
docker run --name pgvector \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_DB=context \
  -p 5433:5432 \
  -d pgvector/pgvector:pg16
psql "postgres://postgres:postgres@127.0.0.1:5433/context" -c "CREATE EXTENSION IF NOT EXISTS vector;"

  1. Set environment variables

Update .env with correct values:

# Core
PROVIDER_MODE=real
OPENAI_API_KEY=...
PORT=3000

# Data DB (your real warehouse)
DATA_DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5432/finance
WAREHOUSE_DIALECT=postgres

# Vector DB
VECTOR_STORE=pgvector
PGVECTOR_DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/context

# Unstructured sources
DOCS_DIR=./docs/finance
SLACK_BOT_TOKEN=...
SLACK_CHANNEL_ALLOWLIST=...   # optional

MULTI_AGENT_MODE=debate
MULTI_AGENT_MAX_ROUNDS=1
SQL_REPAIR_MODE=llm
SQL_REFINER_MODE=llm
  1. Load mock financial/product data (prompted for DB name) Enter finance when prompted
node src/mock_data/scripts/load_finance_static.mjs
  1. Run migrations (pgvector DB)
PGVECTOR_DATABASE_URL=postgres://postgres:postgres@127.0.0.1:5433/context \
  node --import tsx src/scripts/migrate.ts
  1. Run the server
npm install
npm run dev
  1. Create schema embeddings and generate synonyms
curl -X POST http://127.0.0.1:3000/admin/schema/refresh
curl -X POST http://127.0.0.1:3000/admin/schema/synonyms/refresh \
  -H 'Content-Type: application/json' \
  -d '{"mode":"llm","include_described":true,"max_tables":25,"max_columns":300,"max_synonyms":6}'
  1. Sync docs + Slack
curl -X POST http://127.0.0.1:3000/admin/docs/sync
curl -X POST http://127.0.0.1:3000/admin/slack/sync

This endpoint triggers Slack ingestion: it fetches channels (filtered by SLACK_CHANNEL_ALLOWLIST if set), pulls recent messages and thread replies, normalizes them into chunks, and stores them in the unstructured store (JSONL and/or pgvector). It returns counts for channels, messages, and chunks indexed.

Embeddings are stored in pgvector (unstructured_chunks)

  1. Open frontend at http://127.0.0.1:3000/ and start Asking questions In the backend it calls following endpoint
curl -X POST http://127.0.0.1:3000/resolve \
  -H 'content-type: application/json' \
  -d '{"question_text":"What is MTD net P&L by portfolio for active accounts?"}'

sample queries

“What is MTD net P&L by portfolio for active accounts?”

“Give YTD P&L (total, realized, unrealized) by desk.”

“Exclude FX impact: report net P&L by portfolio for QTD.”

“Average balance by region for active accounts this quarter.”

“Show daily gross vs net P&L for the Wealth desk last week.”

“What are total cash balances by product line for the most recent date?”

“What is NAV by portfolio for 2025‑01‑31?”

“List top 5 accounts by ending balance last month.”

Health Endpoint

  • GET /health — returns active providers and connectivity status

QuerySpec

{
  question_text: string;
  metric_terms: string[];
  measure_intent: { kind: "count"|"sum"|"avg"|"rate"|"distinct_count", numerator?: string|null, denominator?: string|null };
  grain: string|null;
  time_window: { start?: string, end?: string, tz?: string } | null;
  group_by: string[];
  filters: Array<{ field_hint: string; op: string; value: any }>;
  entities: string[];
  ambiguity_notes: string[];
  assumptions: string[];
}

Verification Checks

Implemented in src/verify/checks.ts:

  1. sql_runs
  2. row_count_nonnegative
  3. join_duplication_check (dup_ratio <= 1.05)
  4. null_join_key_rate (<= 0.2)
  5. rate_bounds (0 ≤ value ≤ 1)

Context Layer Architecture

About

The context layer for your data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors