qmd/data at 3b87e3e22439a29b5776a18a82229dc58ef10b7c - qmd

History

Tobi Lütke 3b87e3e224 feat: query document format, lex phrase/negation syntax, training data The 'query document' is now a first-class concept in QMD: a structured document with typed sub-queries that combine for best recall. ## Query types - lex: BM25 keyword search with phrase and negation syntax - vec: Semantic vector search (natural language questions) - hyde: Hypothetical document (write the expected answer) - expand: Auto-expand via local LLM (max 1, default for plain queries) ## Lex syntax Full BM25 operator support: "exact phrase" verbatim match, no prefix -term exclude documents containing term -"exact phrase" exclude documents containing phrase Examples: "C++ performance" optimization -sports -athlete "connection pool" timeout -redis "machine learning" -sports -athlete ## MCP tool description rewritten The 'query' tool description now fully teaches AI agents the query document format, lex syntax, and strategy for combining types. Includes worked examples including intent-aware lex (C++ performance, not sports) which is critical for disambiguation in dense corpora. ## Unit tests 11 new lex parser tests covering: - plain terms, quoted phrases, negation, combined - intent-aware disambiguation (performance -sports -athlete) - only-negation returns null (FTS5 constraint) - empty/whitespace handling ## Training data 12 new intent-aware examples for next model training round: - Real technical topics with lex phrase+negation combinations - Covers: C++ perf, Python memory, DB connections, rate limiting, SQL optimization, ML overfitting, Docker, JWT, async/await, git conflicts, Kubernetes, React state - Each shows how context/intent shapes lex query construction (e.g. performance with C++ context → -sports -athlete exclusions)		2026-02-19 06:52:58 -05:00
..
train	lots of training stuff	2026-01-31 23:02:23 +00:00
train-lfm2	feat(finetune): hyde-first ordering, relative paths, structured format	2026-02-17 06:31:35 -05:00
best_glm_prompt.txt	lots of training stuff	2026-01-31 23:02:23 +00:00
convert_to_chatml.py	feat(finetune): hyde-first ordering, relative paths, structured format	2026-02-17 06:31:35 -05:00
convert_to_structured.py	feat(finetune): hyde-first ordering, relative paths, structured format	2026-02-17 06:31:35 -05:00
fix_hyde.py	feat(finetune): improve query expansion dataset v3	2026-02-17 06:19:59 -05:00
fix_lex_filler.py	feat(finetune): improve query expansion dataset v3	2026-02-17 06:19:59 -05:00
gepa_generated.prompts.json	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_balanced_deduped.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_diverse_addon.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_handcrafted_only.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_handcrafted.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_lex_phrases_negation.jsonl	feat: query document format, lex phrase/negation syntax, training data	2026-02-19 06:52:58 -05:00
qmd_expansion_locations.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_people.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_short_nontech.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_v2.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
qmd_expansion_v3_structured.jsonl	feat(finetune): hyde-first ordering, relative paths, structured format	2026-02-17 06:31:35 -05:00
qmd_expansion_v3.jsonl	feat(finetune): improve query expansion dataset v3	2026-02-17 06:19:59 -05:00
qmd_only_sampled.jsonl	lots of training stuff	2026-01-31 23:02:23 +00:00
verify_data.py	feat(finetune): hyde-first ordering, relative paths, structured format	2026-02-17 06:31:35 -05:00