qmd/finetune/data
Tobi Lütke 3b87e3e224
feat: query document format, lex phrase/negation syntax, training data
The 'query document' is now a first-class concept in QMD: a structured
document with typed sub-queries that combine for best recall.

## Query types
- lex:    BM25 keyword search with phrase and negation syntax
- vec:    Semantic vector search (natural language questions)
- hyde:   Hypothetical document (write the expected answer)
- expand: Auto-expand via local LLM (max 1, default for plain queries)

## Lex syntax
Full BM25 operator support:
  "exact phrase"     verbatim match, no prefix
  -term              exclude documents containing term
  -"exact phrase"   exclude documents containing phrase

Examples:
  "C++ performance" optimization -sports -athlete
  "connection pool" timeout -redis
  "machine learning" -sports -athlete

## MCP tool description rewritten
The 'query' tool description now fully teaches AI agents the query
document format, lex syntax, and strategy for combining types.
Includes worked examples including intent-aware lex (C++ performance,
not sports) which is critical for disambiguation in dense corpora.

## Unit tests
11 new lex parser tests covering:
- plain terms, quoted phrases, negation, combined
- intent-aware disambiguation (performance -sports -athlete)
- only-negation returns null (FTS5 constraint)
- empty/whitespace handling

## Training data
12 new intent-aware examples for next model training round:
- Real technical topics with lex phrase+negation combinations
- Covers: C++ perf, Python memory, DB connections, rate limiting,
  SQL optimization, ML overfitting, Docker, JWT, async/await,
  git conflicts, Kubernetes, React state
- Each shows how context/intent shapes lex query construction
  (e.g. performance with C++ context → -sports -athlete exclusions)
2026-02-19 06:52:58 -05:00
..
train lots of training stuff 2026-01-31 23:02:23 +00:00
train-lfm2 feat(finetune): hyde-first ordering, relative paths, structured format 2026-02-17 06:31:35 -05:00
best_glm_prompt.txt lots of training stuff 2026-01-31 23:02:23 +00:00
convert_to_chatml.py feat(finetune): hyde-first ordering, relative paths, structured format 2026-02-17 06:31:35 -05:00
convert_to_structured.py feat(finetune): hyde-first ordering, relative paths, structured format 2026-02-17 06:31:35 -05:00
fix_hyde.py feat(finetune): improve query expansion dataset v3 2026-02-17 06:19:59 -05:00
fix_lex_filler.py feat(finetune): improve query expansion dataset v3 2026-02-17 06:19:59 -05:00
gepa_generated.prompts.json lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_balanced_deduped.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_diverse_addon.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_handcrafted_only.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_handcrafted.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_lex_phrases_negation.jsonl feat: query document format, lex phrase/negation syntax, training data 2026-02-19 06:52:58 -05:00
qmd_expansion_locations.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_people.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_short_nontech.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_v2.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
qmd_expansion_v3_structured.jsonl feat(finetune): hyde-first ordering, relative paths, structured format 2026-02-17 06:31:35 -05:00
qmd_expansion_v3.jsonl feat(finetune): improve query expansion dataset v3 2026-02-17 06:19:59 -05:00
qmd_only_sampled.jsonl lots of training stuff 2026-01-31 23:02:23 +00:00
verify_data.py feat(finetune): hyde-first ordering, relative paths, structured format 2026-02-17 06:31:35 -05:00