ai-workspace-services/qmd

Author	SHA1	Message	Date
Tobi Lütke	03a25d69d9	Add QMD architecture diagram to README Generated with PaperBanana (Gemini 3 Pro). Shows query expansion fanning HyDE+Vec into vector searches, Lex into BM25, merged via reciprocal rank fusion and LLM reranking.	2026-02-14 19:17:11 -05:00
Claude	73136e4f59	fix: verify sqlite-vec readiness after extension load. Closes #169	2026-02-14 19:15:21 -05:00
Claude	96643a28ed	fix: reactivate deactivated documents on re-index. Closes #168	2026-02-14 19:15:21 -05:00
Claude	0eabfe73db	fix: allow $ route filenames in handelize. Closes #162	2026-02-14 19:14:46 -05:00
Claude	da79e77d34	feat: add --version/-v flag. Closes #88	2026-02-14 19:14:46 -05:00
Claude	5dec3ab662	fix: disable following symlinks in glob.scan. Closes #134	2026-02-14 19:14:46 -05:00
Tobi Lütke	96634da39b	feat: promote query as primary search command, add CLI aliases List query first in --help as the recommended search method. Add vector-search and deep-search as undocumented CLI aliases matching MCP tool names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 00:34:29 -05:00
Tobi Lütke	993628e768	fix: add missing context to search results markdown and XML formatters searchResultsToMarkdown and searchResultsToXml in formatter.ts were silently dropping the context field. Added formatter.test.ts covering context visibility across all output formats. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 00:34:23 -05:00
Ilya Grigorik	785bbcf319	MCP: Streamable HTTP, scoring fixes, tool improvements (#149 ) * feat: MCP HTTP transport with daemon lifecycle Add streaming HTTP transport as an alternative to stdio for the MCP server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB) on every client connection, reducing warm query latency from ~16s (CLI) to ~10s. New CLI surface: qmd mcp --http [--port N] # foreground, default port 3000 qmd mcp --http --daemon # background, PID in ~/.cache/qmd/mcp.pid qmd mcp stop # stop daemon via PID file qmd status # now shows MCP daemon liveness Server implementation (mcp.ts): - Extract createMcpServer(store) shared by stdio and HTTP transports - HTTP transport uses WebStandardStreamableHTTPServerTransport with JSON responses (stateless, no SSE) - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise - Request logging to stderr with timestamps, tool names, query args Daemon lifecycle (qmd.ts): - PID file + log file management with stale PID detection - Absolute paths in Bun.spawn (process.execPath + import.meta.path) so daemon works regardless of cwd - mkdirSync for cache dir on fresh installs - Removes top-level SIGTERM/SIGINT handlers before starting HTTP server so async cleanup in mcp.ts actually runs Move hybridQuery() and vectorSearchQuery() into store.ts as standalone functions that take a Store as first argument. Both CLI and MCP now call the identical pipeline, eliminating the class of bugs where one copy drifts from the other. Shared pipeline (store.ts): - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF → chunk → rerank (chunks only) → position-aware blending → dedup - vectorSearchQuery(): expand → vec search → dedup → sort - SearchHooks interface for optional progress callbacks - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP, RERANK_CANDIDATE_LIMIT (40), addLineNumbers() Bugs fixed by unification: - MCP now gets strong-signal short-circuit (was CLI-only) - Reranker candidate limit unified at 40 (MCP had 30) - File dedup added to hybrid query (MCP was missing it) - Collection filter pushed into searchVec DB query - Filter-then-slice ordering fixed (MCP was slice-then-filter) * feat: type-routed query expansion — lex→FTS, vec/hyde→vector expandQuery() now returns typed ExpandedQuery[] instead of string[], preserving the lex/vec/hyde type info from the LLM's GBNF-structured output. hybridQuery() and vectorSearchQuery() route searches by type: lex queries go to FTS only, vec/hyde go to vector only. Previously, every expanded query ran through BOTH backends — keyword variants wasted embedding forward passes, semantic paraphrases wasted BM25 lookups. Type routing eliminates ~4 calls/query with zero quality loss (cross-backend noise actually hurt RRF fusion). Cache format changed from newline-separated text to JSON (preserves types). Old cache entries gracefully re-expand on first access. CLI expansion tree now shows query types: ├─ original query ├─ lex: keyword variant ├─ vec: semantic meaning └─ hyde: hypothetical document... Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon): Metric Old (untyped) New (typed) Delta Avg backend calls 10.0 6.0 -40% Total wall time 1278ms 549ms -57% Avg saved/query — — 146ms "authentication setup" 12 → 7 calls 511 → 112ms "database migration strategy" 10 → 6 calls 182 → 106ms "how to handle errors in API" 10 → 6 calls 216 → 121ms "meeting notes from last week" 10 → 6 calls 228 → 110ms "performance optimization" 8 → 5 calls 141 → 100ms Savings come from skipped embed() calls (~30-80ms each). FTS is synchronous SQLite (~0ms), so lex→FTS routing is free while vec/hyde→vector-only avoids wasted embedding passes. * fix: MCP query snippets now use reranker's best chunk, not full body extractSnippet() was scanning the entire document body for keyword matches to build the snippet. But hybridQuery() already identified the most relevant chunk via cross-attention reranking — rescanning the full body is redundant and can land on a less relevant section if the query terms appear elsewhere in the document. CLI was already using bestChunk (set during the refactor). MCP was still using body — a pre-existing inconsistency, not a regression. * feat: dynamic MCP instructions + tool annotations The MCP server now generates instructions at startup from actual index state and injects them into the initialize response. LLMs see collection names, document counts, content descriptions, and search strategy guidance in their system prompt — zero tool calls needed for orientation. Previously, the only guidance was generic static tool descriptions and a user-invocable "query" prompt that no LLM would discover on its own. An LLM connecting to QMD had no idea what collections existed, what they contained, or how to scope searches effectively. * change default port to 8181 * fix: BM25 score normalization was inverted The normalization formula `1 / (1 + \|bm25\|)` is a decreasing function of match strength. FTS5 BM25 scores are negative where more negative = better match (e.g., -10 is strong, -0.5 is weak). The formula mapped: strong match (raw -10) → 1/(1+10) = 9% ← should be highest weak match (raw -0.5) → 1/(1+0.5) = 67% ← should be lowest Three downstream effects: 1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches and kept only weak ones. The MCP instructions recommend this threshold. 2. CLI `formatScore()` color bands never showed green for BM25 results (best matches scored ~9%, green threshold is 70%). 3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion when BM25 already has a clear winner) was dead code — strong matches scored ~0.09, never reaching the 0.85 threshold. Fix: `\|x\| / (1 + \|x\|)` — same (0,1) range, monotonic, no per-query normalization needed, but now correctly maps strong → high, weak → low. The normalization was born broken (Math.max(0, x) clamped all negative BM25 to 0 → every score = 1.0), then PR #76 changed to Math.abs which made scores vary but inverted the direction. Neither state was ever correct. * fix: rerank cache key ignores chunk content The rerank cache key was (query, file, model) but the actual text sent to the reranker is a keyword-selected chunk that varies by query terms. Two different queries hitting the same file can select different chunks, but the second query gets a stale cached score from the first chunk. Example: Query "auth flow" → selects chunk about authentication → score 0.92 Query "auth tokens" → same file, selects chunk about tokens → cache HIT on (query, file, model) → returns 0.92 from wrong chunk Fix: include full chunk text in cache key. getCacheKey() already SHA-256 hashes its inputs, so this adds no key bloat — just disambiguation. Old cache entries become natural misses (different key shape) and re-warm on next query. * rename MCP tools for clarity, rewrite descriptions for LLM tool selection Rename MCP tools: vsearch → vector_search, query → deep_search. LLMs see these names — self-documenting names reduce reliance on descriptions for tool selection. CLI commands stay unchanged (qmd vsearch, qmd query) — different namespace, users type those. Rewrite all search tool descriptions to be action-oriented: - search: "Search by keyword. Finds documents containing exact words and phrases in the query." - vector_search: "Search by meaning. Finds relevant documents even when they use different words than the query — handles synonyms, paraphrases, and related concepts." - deep_search: "Deep search. Auto-expands the query into variations, searches each by keyword and meaning, and reranks for top hits across all results." Rewrite instructions ladder — each tool says what it does, no "start here" / "escalate as needed" strategy language. Delete the "query" prompt (registerPrompt) — it restated what descriptions + instructions already cover. No LLM proactively calls prompts/get to learn how to use tools. * supress HTTP server logs during tests	2026-02-10 16:37:33 -05:00
Matt Galligan	63028fd5e9	feat: add Claude Code plugin support with inline status check (#99 ) - Add marketplace.json for Claude Code plugin installation - Simplify skill status check to inline `qmd status` (portable across agents) - Update SKILL.md MCP section, reference mcp-setup.md for manual config - Clean up mcp-setup.md (remove redundant prerequisites) - Rename MCP-SETUP.md to mcp-setup.md Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 14:14:24 -05:00
David Gil	47b705409e	fix: BM25 score normalization - use Math.abs instead of Math.max (#76 ) BM25 scores in SQLite FTS5 are negative (lower = better match). The previous code used Math.max(0, score) which clamped all negative scores to 0, resulting in all results showing 100% (score = 1.0). Fix: Use Math.abs(score) to properly convert negative BM25 scores to positive values for the normalization formula. Before: All results show Score: 100% After: Scores vary based on actual BM25 relevance (e.g., 16%, 5%, 6%) Fixes #74	2026-02-01 16:38:52 -05:00
Christopher Stöckl	0f87e2429d	fix: workaround Bun UTF-8 path corruption bug (#82 ) Replace Bun.file() async calls with Node.js fs sync methods to work around a Bun bug that corrupts UTF-8 file paths containing non-ASCII characters. Bug: Bun.file(filepath).stat() and Bun.file(filepath).text() internally mangle UTF-8 encoding, causing ENOENT errors with mojibake paths when accessing files in iCloud Drive and other locations. Changes: - src/qmd.ts: Use readFileSync instead of Bun.file().text() - src/qmd.ts: Use statSync instead of Bun.file().stat() for file metadata - src/store.ts: Use statSync for SQLite custom path detection	2026-02-01 16:37:04 -05:00
Matthías Páll Gissurarson	5de063ae96	Fix: Add missing --index option to argument parser (#84 ) * Fix: Add missing --index option to argument parser The --index flag was documented and used in code but not defined in parseArgs options, causing it to be ignored. Now properly handles custom index names like: qmd --index test status * Feature: Use index name for config files too Now --index <name> loads ~/.config/qmd/<name>.yml instead of index.yml. This allows completely separate indexes with their own collections. Example: qmd --index hackage status → Uses ~/.config/qmd/hackage.yml + ~/.cache/qmd/hackage.sqlite Moved hackage collection to hackage.yml for separation.	2026-02-01 16:36:51 -05:00
Tobi Lütke	102ff861d3	fix: use Qwen3 recommended sampling params to prevent repetition loops - Changed temperature from 0/0.1 to 0.7 (Qwen3 non-thinking mode default) - Added topK=20, topP=0.8 per Qwen3 docs - Added repeatPenalty with presencePenalty=0.5 for query expansion - Fixes infinite loop on acronyms like DHH, BFCM Qwen3 docs explicitly warn: 'DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions'	2026-02-01 03:24:20 +00:00
Tobi Lütke	479b68bbf1	add qmd model pull and refresh logic	2026-01-31 23:02:23 +00:00
Tobi Lütke	bf1b8fc90a	lots of training stuff	2026-01-31 23:02:23 +00:00
Tobi Lutke	17c201ea81	fix: correct QMD acronym to Query Markup Documents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 13:22:54 -05:00
Tobi Lutke	739038e1a7	docs: add explicit HuggingFace repo destinations - List all HuggingFace repos in CLAUDE.md (model, gguf, sft, grpo, train) - Update jobs scripts to use tobil/qmd-query-expansion-train (no -v2) - Clarify rules: no versioned repos, update in place Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 12:26:02 -05:00
Tobi Lutke	38073799c0	chore: clean up finetune folder and fix training workflow - Remove versioned files (sft_v4.yaml, prepare_v4_dataset.py, train_v2/) - Update configs to use local data/train/ directory - Add glob pattern support to prepare_data.py and train.py - Update .gitignore to properly ignore outputs/ and data/train*/ - Document data preparation step in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 12:21:09 -05:00
Tobi Lutke	533f0eed37	docs: add finetune CLAUDE.md and update training workflow - Add finetune/CLAUDE.md documenting the training pipeline - Update configs to output to local outputs/ directory (gitignored) - Document that all data/*.jsonl files are training data - Document local CUDA training vs HuggingFace Jobs cloud training - Enforce eval requirement before any model upload - Single model repo (no -v1, -v2, -v4 versioning) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 12:15:56 -05:00
Tobi Lutke	7de18ee066	Merge main into finetune Brings in: - /only: variants for single-type expansions - LLM session management for lifecycle safety - skills.sh integration for AI agent discovery - Various bug fixes for vector search and embeddings Merge conflicts resolved by keeping hyde-first format ordering from finetune branch while accepting expanded templates and new features from main. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 12:10:22 -05:00
Tobi Lutke	785620467a	refactor: reorder output format to put hyde line first Move the hyde (hypothetical document) line to the beginning of the output format, before lex and vec lines. This better reflects the logical flow where the hypothetical document is generated first and then informs the keyword/semantic expansions. Also adds auto-download of eval_common.py in training scripts for standalone HuggingFace Jobs execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 12:09:04 -05:00
Tobi Lütke	8cc7d8c138	Add sampled /only: variants (399) for training balance	2026-01-31 16:29:02 +00:00
Tobi Lütke	20aef8a3e9	Change format to /only:lex (slash prefix)	2026-01-31 16:24:18 +00:00
Tobi Lütke	46ff098361	Change only: format to only:lex (no space after colon)	2026-01-31 16:23:28 +00:00
Tobi Lütke	806a0cfc14	Add 'only:' mode support for single-type expansions - generate_only_variants.py: Creates training data where queries end with 'only: lex', 'only: vec', or 'only: hyde' and output contains ONLY that type - reward.py: Updated scorer to handle 'only:' mode separately - Penalizes presence of unwanted types - Type-specific quality checks - Filters templated low-quality hyde outputs - 4,444 high-quality 'only:' variants from v2 + handcrafted data	2026-01-31 16:15:59 +00:00
Tobi Lütke	32d313ad6b	Add LLM session management for lifecycle safety Adds a session layer that prevents LLM contexts from being disposed mid-operation during long-running tasks like batch embedding or multi-step search workflows (expand → embed → rerank). Key changes: - Add LLMSessionManager with reference counting for active sessions - Add LLMSession class for scoped access with automatic acquire/release - Add withLLMSession() API for multi-step workflows - Update idle timer to check canUnloadLLM() before disposing - Wrap querySearch, vectorSearch, and embed command in sessions - Add optional session parameter to searchVec and getEmbedding Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 15:20:20 +00:00
Christopher Jones	6d9871d2f5	Fix DisposedError during slow batch embedding (#41 )	2026-01-29 18:28:48 -08:00
Algimantas Krasauskas	f6a987a642	Add skills.sh integration for AI agent discovery (#64 )	2026-01-29 18:27:50 -08:00
Tobi Lutke	7b98d4d308	Link fine-tuned model to HuggingFace in README Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>	2026-01-28 23:37:18 -08:00
Tobi Lutke	5cf4958bfa	Add HuggingFace model card YAML metadata to finetune README Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>	2026-01-28 23:33:55 -08:00
Tobias Lütke	eb1b77c8cb	Deploy fine-tuned GRPO model as default query expansion (#67 ) * Add query expansion model finetuning infrastructure - Training scripts for Qwen3-0.6B and 1.7B models - Dataset generation from s-emanuilov/query-expansion - Evaluation scripts comparing finetuned vs baseline models - GRPO RL training script (optional improvement) - Export script for GGUF conversion Results: - 0.6B finetuned: 95% format compliance (lex/vec/hyde) - Baseline: 0% format compliance - Dataset: 5,157 examples on HuggingFace Hub Models available at: - tobil/qmd-query-expansion-0.6B (recommended) - tobil/qmd-query-expansion-train (dataset) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix GRPO training script for TRL API compatibility - Use max_completion_length instead of max_new_tokens - Use processing_class instead of tokenizer - Use args instead of config for GRPOTrainer - Add __name__ attribute to reward function class - Accept *kwargs in reward function for extra TRL args - Add new LoRA adapter after merging SFT weights Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Update README with final evaluation results - 0.6B SFT: 95% format compliance (best) - 0.6B GRPO: 0% (catastrophic forgetting from RL) - 1.7B v2: training completed, evaluation pending - Added GRPO evaluation results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add comprehensive scoring system for query expansion New scoring criteria (0-100 points): - Format (30): Must have lex: and vec: prefixes - Diversity (30): Multiple types, no echoing query, diverse expansions - Hyde (20): Optional, concise, no newlines, no word repetition - Quality (20): Lex=keywords, vec=natural language See SCORING.md for full documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add HuggingFace login and comprehensive scoring to GRPO v2 training - Add explicit HF_TOKEN login before training - Use SCORING.md criteria as RL reward function - Conservative training: LR 1e-6, LoRA rank 4 - Reward scores: good=0.94, bad=0.38 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor finetune folder: train/rl scripts with YAML configs Major changes: - train.py: Generic SFT training script using YAML config - rl.py: Generic GRPO training script using YAML config - configs/: YAML configs per training run (sft_v4.yaml, grpo_v4.yaml) - dataset/: Data preparation scripts moved here - tui.py: Interactive model testing interface Training results: - SFT v4: 98.8% avg score (all Excellent) - GRPO v4: 0% (failed - model drifted to verbose explanations) Removed per-model scripts (train_0.6B.py, train_1.7B.py, etc) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add named entity extraction to GRPO reward function Key changes: - Extract named entities (acronyms, proper nouns, technical terms) - Heavy penalty (-30) when lex queries miss named entities - Penalty (-15) for generic filler phrases like "find information about" - Compound entity detection (TDS motorsports -> both words) - Update GRPO config with KL regularization (beta=0.04) - Lower learning rate (5e-7) and add max_steps (200) Test results: - "who is TDS motorsports" good: 1.00, bad: 0.30 (was 0.75) - "how to use React hooks" good: 0.87, bad: 0.45 (was 0.75) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add chat template leakage detection to reward function Zero reward for outputs containing: - <\|im_start\|>, <\|im_end\|> tokens - <think>, </think> tags (Qwen3 thinking mode) - Role markers like \nassistant\n, \nuser\n - <\|endoftext\|> token Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Strict format validation: every line must be lex:/vec:/hyde: Any line that doesn't start with a valid prefix now returns 0.0 instead of just counting as a penalty. This prevents any prose, explanations, bullet points, or other invalid content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Clean up evaluation files - Remove old versioned evaluation files (0.6B, 1.7B, baseline) - Rename evaluation_v4.json -> evaluation_sft.json - Rename evaluation_v4_grpo.json -> evaluation_grpo_failed.json Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor evals into separate run and score scripts New structure: - evals/run.py: Generate model outputs to JSONL - evals/score.py: Score outputs with detailed breakdown - evals/queries.txt: Test queries (26 total) Features: - Supports both HF Hub and local model paths - Named entity preservation scoring - Chat template leakage detection - Strict format validation (every line must be lex:/vec:/hyde:) - Generic phrase detection Usage: uv run evals/run.py --model tobil/qmd-query-expansion-0.6B-v4 uv run evals/score.py evals/results_.jsonl Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Fix GRPO model loading to use SFT base first The GRPO adapter was trained on merged SFT weights, so loading it directly on the base model results in 0% score. Added --sft-model parameter to evals/run.py to load SFT first, then apply GRPO adapter. With correct loading: GRPO scores 89.7% (all 26 queries Excellent). Updated README with correct GRPO score and loading instructions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix TUI to load GRPO models with SFT base first GRPO adapters were trained on merged SFT weights, so they need SFT loaded and merged first before applying the GRPO adapter. Updated MODELS config to include sft_base path for GRPO models, and load_model() now handles the SFT -> merge -> GRPO flow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update README for unified model repository structure All models (0.6B, 1.7B, 4B) with SFT and GRPO variants now go into a single HuggingFace repo (tobil/qmd-query-expansion) with subfolders for each size and training method. Updated loading examples to show subfolder-based model loading. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update README with separate model repos Changed from subfolder approach to separate repos per model since trainer.push_to_hub() doesn't support subfolder argument. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add 1.7B and 4B GRPO training and GGUF conversion scripts Training scripts for GRPO fine-tuning: - train_1.7B_grpo.py: GRPO training for Qwen3-1.7B - train_4B_grpo.py: GRPO training for Qwen3-4B GGUF conversion scripts: - convert_1.7B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF - convert_4B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF All scripts use PEP 723 inline dependencies for HuggingFace Jobs. Models published: - tobil/qmd-query-expansion-1.7B-sft - tobil/qmd-query-expansion-1.7B-grpo - tobil/qmd-query-expansion-1.7B-gguf - tobil/qmd-query-expansion-4B-sft - tobil/qmd-query-expansion-4B-grpo - tobil/qmd-query-expansion-4B-gguf Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove beads issue tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove beads reference from CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix GRPO reward function to handle think blocks and end tokens - Strip <\|im_end\|> token from completions (model output includes it) - Change think_penalty to skipped_think bonus (+20 for not using think) - Adjust max_possible to account for bonus (120/140) - Fix typo in chat template artifact check Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Make TUI model list dynamic from HuggingFace Hub - Fetch available qmd-query-expansion models from tobil/ on Hub - Auto-detect model size (0.6B, 1.7B, 4B) and use correct base model - Group models by type (SFT vs GRPO) in menu - Skip GGUF repos in model listing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix GRPO training: apply chat template to prompts The SFT model was trained with chat template format but GRPO was passing raw prompts. Now prompts are formatted with tokenizer.apply_chat_template() so the model sees the same format it learned during SFT. Also update extract_query_from_prompt to strip chat template artifacts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Finetune 2.0: consolidate and simplify the entire training pipeline Consolidate ~2,800 lines of duplicated code across 12 files into 5 clean, well-documented files targeting Qwen3-1.7B end-to-end. Key changes: - Extract reward function into single source of truth (reward.py) Previously duplicated 3x with divergent bugs across rl.py, train_1.7B_grpo.py, and train_4B_grpo.py - Unify training into one script with sft/grpo subcommands (train.py) Replaces train.py + rl.py + train_1.7B_grpo.py + train_4B_grpo.py - Merge eval generate+score into single eval.py Replaces evals/run.py + evals/score.py - Parameterize GGUF conversion by --size (convert_gguf.py) Replaces convert_1.7B_gguf.py + convert_4B_gguf.py - Fix critical bug: rl.py silently ignored beta/temperature from config, causing the exact catastrophic drift its own comments warned about - Fix prompt consistency: all files use /no_think chat template format - Retarget configs from 0.6B to 1.7B - Comprehensive README documenting the full pipeline Removed: rl.py, train_1.7B_grpo.py, train_4B_grpo.py, convert_1.7B_gguf.py, convert_4B_gguf.py, tui.py, evals/run.py, evals/score.py Net: -3,429 lines, +382 lines Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com> * Add HF Jobs scripts, temporal query examples, and training results - jobs/sft.py and jobs/grpo.py: self-contained scripts for `hf jobs uv run` (no local GPU needed) - 12 temporal/recency query examples in training data (e.g. "recent news about Shopify" -> lex with years 2025/2026) - 4 temporal test queries in evals/queries.txt - README updated with HF Jobs workflow, training results, and updated file structure - Remove .beads tracking SFT and GRPO successfully trained on A10G via HF Jobs: SFT: eval loss 0.321, token accuracy 92.4% GRPO: mean reward 0.757, 200 steps, KL 0.00048 Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com> * Deploy fine-tuned GRPO model as default for query expansion Switch from generic Qwen3-1.7B-Q8_0 (~2.2GB) to fine-tuned qmd-query-expansion-1.7B-q4_k_m (~1.1GB). The fine-tuned Q4 scores 91.7% avg with 30/30 Excellent, outperforming the base Q8. - Update default generate model in src/llm.ts - Update README model table, architecture diagram, config block - Add v2 training data, eval scripts, and quantize job - Remove superseded v1 training data (5,742 → 1,000 examples) - Update finetune README with v2 results and file structure Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 23:25:17 -08:00
Tobi Lutke	8572c2fd94	Deploy fine-tuned GRPO model as default for query expansion Switch from generic Qwen3-1.7B-Q8_0 (~2.2GB) to fine-tuned qmd-query-expansion-1.7B-q4_k_m (~1.1GB). The fine-tuned Q4 scores 91.7% avg with 30/30 Excellent, outperforming the base Q8. - Update default generate model in src/llm.ts - Update README model table, architecture diagram, config block - Add v2 training data, eval scripts, and quantize job - Remove superseded v1 training data (5,742 → 1,000 examples) - Update finetune README with v2 results and file structure Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>	2026-01-28 23:24:58 -08:00
Tobi Lutke	5ab78d00a2	Add HF Jobs scripts, temporal query examples, and training results - jobs/sft.py and jobs/grpo.py: self-contained scripts for `hf jobs uv run` (no local GPU needed) - 12 temporal/recency query examples in training data (e.g. "recent news about Shopify" -> lex with years 2025/2026) - 4 temporal test queries in evals/queries.txt - README updated with HF Jobs workflow, training results, and updated file structure - Remove .beads tracking SFT and GRPO successfully trained on A10G via HF Jobs: SFT: eval loss 0.321, token accuracy 92.4% GRPO: mean reward 0.757, 200 steps, KL 0.00048 Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>	2026-01-28 15:46:44 -08:00
Tobi Lutke	354744af53	Finetune 2.0: consolidate and simplify the entire training pipeline Consolidate ~2,800 lines of duplicated code across 12 files into 5 clean, well-documented files targeting Qwen3-1.7B end-to-end. Key changes: - Extract reward function into single source of truth (reward.py) Previously duplicated 3x with divergent bugs across rl.py, train_1.7B_grpo.py, and train_4B_grpo.py - Unify training into one script with sft/grpo subcommands (train.py) Replaces train.py + rl.py + train_1.7B_grpo.py + train_4B_grpo.py - Merge eval generate+score into single eval.py Replaces evals/run.py + evals/score.py - Parameterize GGUF conversion by --size (convert_gguf.py) Replaces convert_1.7B_gguf.py + convert_4B_gguf.py - Fix critical bug: rl.py silently ignored beta/temperature from config, causing the exact catastrophic drift its own comments warned about - Fix prompt consistency: all files use /no_think chat template format - Retarget configs from 0.6B to 1.7B - Comprehensive README documenting the full pipeline Removed: rl.py, train_1.7B_grpo.py, train_4B_grpo.py, convert_1.7B_gguf.py, convert_4B_gguf.py, tui.py, evals/run.py, evals/score.py Net: -3,429 lines, +382 lines Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>	2026-01-28 14:00:36 -08:00
jdvmi00	64c6e6c2e3	fix: rename collectionId to collectionName in searchVec for proper filtering (#61 )	2026-01-27 22:03:02 -08:00
Freeman Jiang	bfb0eebc3e	fix: use sequential embedding on CPU-only systems to avoid race condition (#54 ) * fix: add promise guard to ensureEmbedContext to prevent race condition Root cause: ensureEmbedContext() was not thread-safe. When multiple parallel embedding requests called ensureEmbedContext() simultaneously, all would see embedContext === null and start creating new contexts. This race condition caused 'Context is disposed' errors as contexts were overwritten/orphaned. The fix adds a promise guard (embedContextCreatePromise) to ensure only one context creation runs at a time - identical to the pattern already used in ensureGenerateModel(). Changes: - Add embedContextCreatePromise field to track in-progress context creation - Modify ensureEmbedContext() to wait for existing creation if in progress - Update test comment and timeout for CPU-only systems Testing: - Fresh model download + qmd embed: 28/28 chunks succeeded (was 14/27) - All embedBatch tests pass - No warmup hack needed - full parallel performance from the start Environment tested: - Ubuntu 24.04 LTS (x64), Bun 1.3.6, node-llama-cpp 3.14.5, no GPU * test: improve race condition test to verify single context creation The previous test only verified embeddings succeeded but didn't prove the fix actually prevents multiple context creation. This improved test: - Instruments createEmbeddingContext to count invocations - Runs 5 concurrent embedBatch calls on a fresh LlamaCpp instance - Asserts exactly 1 context is created (fails with 5 without the fix) Verified locally: - With fix: 1 context created (PASS) - Without fix: 5 contexts created (FAIL) * chore: clear embedContextCreatePromise in dispose() for consistency	2026-01-27 22:02:36 -08:00
Copilot	053252ca24	Add Windows path utilities with cross-platform test coverage (#51 ) * Initial plan * Add Windows path utility functions and comprehensive tests Co-authored-by: tobi <347+tobi@users.noreply.github.com> * Add clarifying comments for Git Bash path detection logic Co-authored-by: tobi <347+tobi@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tobi <347+tobi@users.noreply.github.com>	2026-01-27 09:05:47 -05:00
sh54	ba7391832d	Add org-mode title extraction support (#50 ) Refactor extractTitle to use extension-based extractors: - .md: preserves original markdown logic (Notes skip behavior) - .org: extracts from #+TITLE: property or first * heading Extensions are lowercased for case-insensitive matching. Easy to add more file types in the future. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 11:38:49 -05:00
sh54	65c0f89560	Enable SQLite extension loading in devshell (#48 ) Override sqlite in devShell to enable extension loading for sqlite-vec support when running tests. Only sets BREW_PREFIX if not already defined to avoid overriding user's existing setup. Package build remains unchanged. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 11:37:03 -05:00
Tobi Lutke	9b3a209a97	Fix GRPO training: apply chat template to prompts The SFT model was trained with chat template format but GRPO was passing raw prompts. Now prompts are formatted with tokenizer.apply_chat_template() so the model sees the same format it learned during SFT. Also update extract_query_from_prompt to strip chat template artifacts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 17:21:22 -05:00
Tobi Lutke	3ea85eff50	Make TUI model list dynamic from HuggingFace Hub - Fetch available qmd-query-expansion models from tobil/ on Hub - Auto-detect model size (0.6B, 1.7B, 4B) and use correct base model - Group models by type (SFT vs GRPO) in menu - Skip GGUF repos in model listing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 17:17:40 -05:00
Tobi Lutke	891f3262cf	Fix GRPO reward function to handle think blocks and end tokens - Strip <\|im_end\|> token from completions (model output includes it) - Change think_penalty to skipped_think bonus (+20 for not using think) - Adjust max_possible to account for bonus (120/140) - Fix typo in chat template artifact check Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 16:32:13 -05:00
Tobi Lutke	66bb8ed963	Remove beads reference from CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 16:24:14 -05:00
Tobi Lutke	2267986302	Remove beads issue tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 16:23:53 -05:00
Tobi Lutke	8a1c4cdab0	Add 1.7B and 4B GRPO training and GGUF conversion scripts Training scripts for GRPO fine-tuning: - train_1.7B_grpo.py: GRPO training for Qwen3-1.7B - train_4B_grpo.py: GRPO training for Qwen3-4B GGUF conversion scripts: - convert_1.7B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF - convert_4B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF All scripts use PEP 723 inline dependencies for HuggingFace Jobs. Models published: - tobil/qmd-query-expansion-1.7B-sft - tobil/qmd-query-expansion-1.7B-grpo - tobil/qmd-query-expansion-1.7B-gguf - tobil/qmd-query-expansion-4B-sft - tobil/qmd-query-expansion-4B-grpo - tobil/qmd-query-expansion-4B-gguf Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 11:35:27 -05:00
Tobi Lutke	b9b1b39a76	Update README with separate model repos Changed from subfolder approach to separate repos per model since trainer.push_to_hub() doesn't support subfolder argument. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 08:13:30 -05:00
Tobi Lutke	312c281109	Update README for unified model repository structure All models (0.6B, 1.7B, 4B) with SFT and GRPO variants now go into a single HuggingFace repo (tobil/qmd-query-expansion) with subfolders for each size and training method. Updated loading examples to show subfolder-based model loading. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 01:00:17 -05:00
Tobi Lutke	2648512b7c	Fix TUI to load GRPO models with SFT base first GRPO adapters were trained on merged SFT weights, so they need SFT loaded and merged first before applying the GRPO adapter. Updated MODELS config to include sft_base path for GRPO models, and load_model() now handles the SFT -> merge -> GRPO flow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 00:47:59 -05:00
Tobi Lutke	f96766cce8	Fix GRPO model loading to use SFT base first The GRPO adapter was trained on merged SFT weights, so loading it directly on the base model results in 0% score. Added --sft-model parameter to evals/run.py to load SFT first, then apply GRPO adapter. With correct loading: GRPO scores 89.7% (all 26 queries Excellent). Updated README with correct GRPO score and loading instructions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 00:46:07 -05:00

1 2 3 4

182 Commits