Commit Graph

173 Commits

Author SHA1 Message Date
Matt Galligan
63028fd5e9
feat: add Claude Code plugin support with inline status check (#99)
- Add marketplace.json for Claude Code plugin installation
- Simplify skill status check to inline `qmd status` (portable across agents)
- Update SKILL.md MCP section, reference mcp-setup.md for manual config
- Clean up mcp-setup.md (remove redundant prerequisites)
- Rename MCP-SETUP.md to mcp-setup.md

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 14:14:24 -05:00
David Gil
47b705409e
fix: BM25 score normalization - use Math.abs instead of Math.max (#76)
BM25 scores in SQLite FTS5 are negative (lower = better match).
The previous code used Math.max(0, score) which clamped all negative
scores to 0, resulting in all results showing 100% (score = 1.0).

Fix: Use Math.abs(score) to properly convert negative BM25 scores
to positive values for the normalization formula.

Before: All results show Score: 100%
After:  Scores vary based on actual BM25 relevance (e.g., 16%, 5%, 6%)

Fixes #74
2026-02-01 16:38:52 -05:00
Christopher Stöckl
0f87e2429d
fix: workaround Bun UTF-8 path corruption bug (#82)
Replace Bun.file() async calls with Node.js fs sync methods to work
around a Bun bug that corrupts UTF-8 file paths containing non-ASCII
characters.

Bug: Bun.file(filepath).stat() and Bun.file(filepath).text() internally
mangle UTF-8 encoding, causing ENOENT errors with mojibake paths when
accessing files in iCloud Drive and other locations.

Changes:
- src/qmd.ts: Use readFileSync instead of Bun.file().text()
- src/qmd.ts: Use statSync instead of Bun.file().stat() for file metadata
- src/store.ts: Use statSync for SQLite custom path detection
2026-02-01 16:37:04 -05:00
Matthías Páll Gissurarson
5de063ae96
Fix: Add missing --index option to argument parser (#84)
* Fix: Add missing --index option to argument parser

The --index flag was documented and used in code but not defined
in parseArgs options, causing it to be ignored. Now properly handles
custom index names like: qmd --index test status

* Feature: Use index name for config files too

Now --index <name> loads ~/.config/qmd/<name>.yml instead of index.yml.
This allows completely separate indexes with their own collections.

Example:
  qmd --index hackage status
  → Uses ~/.config/qmd/hackage.yml + ~/.cache/qmd/hackage.sqlite

Moved hackage collection to hackage.yml for separation.
2026-02-01 16:36:51 -05:00
Tobi Lütke
102ff861d3
fix: use Qwen3 recommended sampling params to prevent repetition loops
- Changed temperature from 0/0.1 to 0.7 (Qwen3 non-thinking mode default)
- Added topK=20, topP=0.8 per Qwen3 docs
- Added repeatPenalty with presencePenalty=0.5 for query expansion
- Fixes infinite loop on acronyms like DHH, BFCM

Qwen3 docs explicitly warn: 'DO NOT use greedy decoding, as it can
lead to performance degradation and endless repetitions'
2026-02-01 03:24:20 +00:00
Tobi Lütke
479b68bbf1
add qmd model pull and refresh logic 2026-01-31 23:02:23 +00:00
Tobi Lütke
bf1b8fc90a
lots of training stuff 2026-01-31 23:02:23 +00:00
Tobi Lutke
17c201ea81
fix: correct QMD acronym to Query Markup Documents
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 13:22:54 -05:00
Tobi Lutke
739038e1a7
docs: add explicit HuggingFace repo destinations
- List all HuggingFace repos in CLAUDE.md (model, gguf, sft, grpo, train)
- Update jobs scripts to use tobil/qmd-query-expansion-train (no -v2)
- Clarify rules: no versioned repos, update in place

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 12:26:02 -05:00
Tobi Lutke
38073799c0
chore: clean up finetune folder and fix training workflow
- Remove versioned files (sft_v4.yaml, prepare_v4_dataset.py, train_v2/)
- Update configs to use local data/train/ directory
- Add glob pattern support to prepare_data.py and train.py
- Update .gitignore to properly ignore outputs/ and data/train*/
- Document data preparation step in CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 12:21:09 -05:00
Tobi Lutke
533f0eed37
docs: add finetune CLAUDE.md and update training workflow
- Add finetune/CLAUDE.md documenting the training pipeline
- Update configs to output to local outputs/ directory (gitignored)
- Document that all data/*.jsonl files are training data
- Document local CUDA training vs HuggingFace Jobs cloud training
- Enforce eval requirement before any model upload
- Single model repo (no -v1, -v2, -v4 versioning)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 12:15:56 -05:00
Tobi Lutke
7de18ee066
Merge main into finetune
Brings in:
- /only: variants for single-type expansions
- LLM session management for lifecycle safety
- skills.sh integration for AI agent discovery
- Various bug fixes for vector search and embeddings

Merge conflicts resolved by keeping hyde-first format ordering
from finetune branch while accepting expanded templates and
new features from main.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 12:10:22 -05:00
Tobi Lutke
785620467a
refactor: reorder output format to put hyde line first
Move the hyde (hypothetical document) line to the beginning of the
output format, before lex and vec lines. This better reflects the
logical flow where the hypothetical document is generated first and
then informs the keyword/semantic expansions.

Also adds auto-download of eval_common.py in training scripts for
standalone HuggingFace Jobs execution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 12:09:04 -05:00
Tobi Lütke
8cc7d8c138 Add sampled /only: variants (399) for training balance 2026-01-31 16:29:02 +00:00
Tobi Lütke
20aef8a3e9 Change format to /only:lex (slash prefix) 2026-01-31 16:24:18 +00:00
Tobi Lütke
46ff098361 Change only: format to only:lex (no space after colon) 2026-01-31 16:23:28 +00:00
Tobi Lütke
806a0cfc14 Add 'only:' mode support for single-type expansions
- generate_only_variants.py: Creates training data where queries end with
  'only: lex', 'only: vec', or 'only: hyde' and output contains ONLY that type
- reward.py: Updated scorer to handle 'only:' mode separately
  - Penalizes presence of unwanted types
  - Type-specific quality checks
  - Filters templated low-quality hyde outputs
- 4,444 high-quality 'only:' variants from v2 + handcrafted data
2026-01-31 16:15:59 +00:00
Tobi Lütke
32d313ad6b Add LLM session management for lifecycle safety
Adds a session layer that prevents LLM contexts from being disposed
mid-operation during long-running tasks like batch embedding or
multi-step search workflows (expand → embed → rerank).

Key changes:
- Add LLMSessionManager with reference counting for active sessions
- Add LLMSession class for scoped access with automatic acquire/release
- Add withLLMSession() API for multi-step workflows
- Update idle timer to check canUnloadLLM() before disposing
- Wrap querySearch, vectorSearch, and embed command in sessions
- Add optional session parameter to searchVec and getEmbedding

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 15:20:20 +00:00
Christopher Jones
6d9871d2f5
Fix DisposedError during slow batch embedding (#41) 2026-01-29 18:28:48 -08:00
Algimantas Krasauskas
f6a987a642
Add skills.sh integration for AI agent discovery (#64) 2026-01-29 18:27:50 -08:00
Tobi Lutke
7b98d4d308
Link fine-tuned model to HuggingFace in README
Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
2026-01-28 23:37:18 -08:00
Tobi Lutke
5cf4958bfa
Add HuggingFace model card YAML metadata to finetune README
Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
2026-01-28 23:33:55 -08:00
Tobias Lütke
eb1b77c8cb
Deploy fine-tuned GRPO model as default query expansion (#67)
* Add query expansion model finetuning infrastructure

- Training scripts for Qwen3-0.6B and 1.7B models
- Dataset generation from s-emanuilov/query-expansion
- Evaluation scripts comparing finetuned vs baseline models
- GRPO RL training script (optional improvement)
- Export script for GGUF conversion

Results:
- 0.6B finetuned: 95% format compliance (lex/vec/hyde)
- Baseline: 0% format compliance
- Dataset: 5,157 examples on HuggingFace Hub

Models available at:
- tobil/qmd-query-expansion-0.6B (recommended)
- tobil/qmd-query-expansion-train (dataset)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO training script for TRL API compatibility

- Use max_completion_length instead of max_new_tokens
- Use processing_class instead of tokenizer
- Use args instead of config for GRPOTrainer
- Add __name__ attribute to reward function class
- Accept **kwargs in reward function for extra TRL args
- Add new LoRA adapter after merging SFT weights

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README with final evaluation results

- 0.6B SFT: 95% format compliance (best)
- 0.6B GRPO: 0% (catastrophic forgetting from RL)
- 1.7B v2: training completed, evaluation pending
- Added GRPO evaluation results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add comprehensive scoring system for query expansion

New scoring criteria (0-100 points):
- Format (30): Must have lex: and vec: prefixes
- Diversity (30): Multiple types, no echoing query, diverse expansions
- Hyde (20): Optional, concise, no newlines, no word repetition
- Quality (20): Lex=keywords, vec=natural language

See SCORING.md for full documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add HuggingFace login and comprehensive scoring to GRPO v2 training

- Add explicit HF_TOKEN login before training
- Use SCORING.md criteria as RL reward function
- Conservative training: LR 1e-6, LoRA rank 4
- Reward scores: good=0.94, bad=0.38

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor finetune folder: train/rl scripts with YAML configs

Major changes:
- train.py: Generic SFT training script using YAML config
- rl.py: Generic GRPO training script using YAML config
- configs/: YAML configs per training run (sft_v4.yaml, grpo_v4.yaml)
- dataset/: Data preparation scripts moved here
- tui.py: Interactive model testing interface

Training results:
- SFT v4: 98.8% avg score (all Excellent)
- GRPO v4: 0% (failed - model drifted to verbose explanations)

Removed per-model scripts (train_0.6B.py, train_1.7B.py, etc)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add named entity extraction to GRPO reward function

Key changes:
- Extract named entities (acronyms, proper nouns, technical terms)
- Heavy penalty (-30) when lex queries miss named entities
- Penalty (-15) for generic filler phrases like "find information about"
- Compound entity detection (TDS motorsports -> both words)
- Update GRPO config with KL regularization (beta=0.04)
- Lower learning rate (5e-7) and add max_steps (200)

Test results:
- "who is TDS motorsports" good: 1.00, bad: 0.30 (was 0.75)
- "how to use React hooks" good: 0.87, bad: 0.45 (was 0.75)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add chat template leakage detection to reward function

Zero reward for outputs containing:
- <|im_start|>, <|im_end|> tokens
- <think>, </think> tags (Qwen3 thinking mode)
- Role markers like \nassistant\n, \nuser\n
- <|endoftext|> token

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Strict format validation: every line must be lex:/vec:/hyde:

Any line that doesn't start with a valid prefix now returns 0.0
instead of just counting as a penalty. This prevents any prose,
explanations, bullet points, or other invalid content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Clean up evaluation files

- Remove old versioned evaluation files (0.6B, 1.7B, baseline)
- Rename evaluation_v4.json -> evaluation_sft.json
- Rename evaluation_v4_grpo.json -> evaluation_grpo_failed.json

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor evals into separate run and score scripts

New structure:
- evals/run.py: Generate model outputs to JSONL
- evals/score.py: Score outputs with detailed breakdown
- evals/queries.txt: Test queries (26 total)

Features:
- Supports both HF Hub and local model paths
- Named entity preservation scoring
- Chat template leakage detection
- Strict format validation (every line must be lex:/vec:/hyde:)
- Generic phrase detection

Usage:
  uv run evals/run.py --model tobil/qmd-query-expansion-0.6B-v4
  uv run evals/score.py evals/results_*.jsonl

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO model loading to use SFT base first

The GRPO adapter was trained on merged SFT weights, so loading it
directly on the base model results in 0% score. Added --sft-model
parameter to evals/run.py to load SFT first, then apply GRPO adapter.

With correct loading: GRPO scores 89.7% (all 26 queries Excellent).

Updated README with correct GRPO score and loading instructions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix TUI to load GRPO models with SFT base first

GRPO adapters were trained on merged SFT weights, so they need SFT
loaded and merged first before applying the GRPO adapter.

Updated MODELS config to include sft_base path for GRPO models,
and load_model() now handles the SFT -> merge -> GRPO flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README for unified model repository structure

All models (0.6B, 1.7B, 4B) with SFT and GRPO variants now go into
a single HuggingFace repo (tobil/qmd-query-expansion) with subfolders
for each size and training method.

Updated loading examples to show subfolder-based model loading.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update README with separate model repos

Changed from subfolder approach to separate repos per model since
trainer.push_to_hub() doesn't support subfolder argument.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add 1.7B and 4B GRPO training and GGUF conversion scripts

Training scripts for GRPO fine-tuning:
- train_1.7B_grpo.py: GRPO training for Qwen3-1.7B
- train_4B_grpo.py: GRPO training for Qwen3-4B

GGUF conversion scripts:
- convert_1.7B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF
- convert_4B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF

All scripts use PEP 723 inline dependencies for HuggingFace Jobs.

Models published:
- tobil/qmd-query-expansion-1.7B-sft
- tobil/qmd-query-expansion-1.7B-grpo
- tobil/qmd-query-expansion-1.7B-gguf
- tobil/qmd-query-expansion-4B-sft
- tobil/qmd-query-expansion-4B-grpo
- tobil/qmd-query-expansion-4B-gguf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove beads issue tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove beads reference from CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO reward function to handle think blocks and end tokens

- Strip <|im_end|> token from completions (model output includes it)
- Change think_penalty to skipped_think bonus (+20 for not using think)
- Adjust max_possible to account for bonus (120/140)
- Fix typo in chat template artifact check

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Make TUI model list dynamic from HuggingFace Hub

- Fetch available qmd-query-expansion models from tobil/ on Hub
- Auto-detect model size (0.6B, 1.7B, 4B) and use correct base model
- Group models by type (SFT vs GRPO) in menu
- Skip GGUF repos in model listing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix GRPO training: apply chat template to prompts

The SFT model was trained with chat template format but GRPO was
passing raw prompts. Now prompts are formatted with tokenizer.apply_chat_template()
so the model sees the same format it learned during SFT.

Also update extract_query_from_prompt to strip chat template artifacts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Finetune 2.0: consolidate and simplify the entire training pipeline

Consolidate ~2,800 lines of duplicated code across 12 files into 5 clean,
well-documented files targeting Qwen3-1.7B end-to-end.

Key changes:
- Extract reward function into single source of truth (reward.py)
  Previously duplicated 3x with divergent bugs across rl.py,
  train_1.7B_grpo.py, and train_4B_grpo.py
- Unify training into one script with sft/grpo subcommands (train.py)
  Replaces train.py + rl.py + train_1.7B_grpo.py + train_4B_grpo.py
- Merge eval generate+score into single eval.py
  Replaces evals/run.py + evals/score.py
- Parameterize GGUF conversion by --size (convert_gguf.py)
  Replaces convert_1.7B_gguf.py + convert_4B_gguf.py
- Fix critical bug: rl.py silently ignored beta/temperature from config,
  causing the exact catastrophic drift its own comments warned about
- Fix prompt consistency: all files use /no_think chat template format
- Retarget configs from 0.6B to 1.7B
- Comprehensive README documenting the full pipeline

Removed: rl.py, train_1.7B_grpo.py, train_4B_grpo.py, convert_1.7B_gguf.py,
convert_4B_gguf.py, tui.py, evals/run.py, evals/score.py

Net: -3,429 lines, +382 lines

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

* Add HF Jobs scripts, temporal query examples, and training results

- jobs/sft.py and jobs/grpo.py: self-contained scripts for
  `hf jobs uv run` (no local GPU needed)
- 12 temporal/recency query examples in training data (e.g. "recent
  news about Shopify" -> lex with years 2025/2026)
- 4 temporal test queries in evals/queries.txt
- README updated with HF Jobs workflow, training results, and
  updated file structure
- Remove .beads tracking

SFT and GRPO successfully trained on A10G via HF Jobs:
  SFT: eval loss 0.321, token accuracy 92.4%
  GRPO: mean reward 0.757, 200 steps, KL 0.00048

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

* Deploy fine-tuned GRPO model as default for query expansion

Switch from generic Qwen3-1.7B-Q8_0 (~2.2GB) to fine-tuned
qmd-query-expansion-1.7B-q4_k_m (~1.1GB). The fine-tuned Q4
scores 91.7% avg with 30/30 Excellent, outperforming the base Q8.

- Update default generate model in src/llm.ts
- Update README model table, architecture diagram, config block
- Add v2 training data, eval scripts, and quantize job
- Remove superseded v1 training data (5,742 → 1,000 examples)
- Update finetune README with v2 results and file structure

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 23:25:17 -08:00
Tobi Lutke
8572c2fd94
Deploy fine-tuned GRPO model as default for query expansion
Switch from generic Qwen3-1.7B-Q8_0 (~2.2GB) to fine-tuned
qmd-query-expansion-1.7B-q4_k_m (~1.1GB). The fine-tuned Q4
scores 91.7% avg with 30/30 Excellent, outperforming the base Q8.

- Update default generate model in src/llm.ts
- Update README model table, architecture diagram, config block
- Add v2 training data, eval scripts, and quantize job
- Remove superseded v1 training data (5,742 → 1,000 examples)
- Update finetune README with v2 results and file structure

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
2026-01-28 23:24:58 -08:00
Tobi Lutke
5ab78d00a2
Add HF Jobs scripts, temporal query examples, and training results
- jobs/sft.py and jobs/grpo.py: self-contained scripts for
  `hf jobs uv run` (no local GPU needed)
- 12 temporal/recency query examples in training data (e.g. "recent
  news about Shopify" -> lex with years 2025/2026)
- 4 temporal test queries in evals/queries.txt
- README updated with HF Jobs workflow, training results, and
  updated file structure
- Remove .beads tracking

SFT and GRPO successfully trained on A10G via HF Jobs:
  SFT: eval loss 0.321, token accuracy 92.4%
  GRPO: mean reward 0.757, 200 steps, KL 0.00048

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
2026-01-28 15:46:44 -08:00
Tobi Lutke
354744af53
Finetune 2.0: consolidate and simplify the entire training pipeline
Consolidate ~2,800 lines of duplicated code across 12 files into 5 clean,
well-documented files targeting Qwen3-1.7B end-to-end.

Key changes:
- Extract reward function into single source of truth (reward.py)
  Previously duplicated 3x with divergent bugs across rl.py,
  train_1.7B_grpo.py, and train_4B_grpo.py
- Unify training into one script with sft/grpo subcommands (train.py)
  Replaces train.py + rl.py + train_1.7B_grpo.py + train_4B_grpo.py
- Merge eval generate+score into single eval.py
  Replaces evals/run.py + evals/score.py
- Parameterize GGUF conversion by --size (convert_gguf.py)
  Replaces convert_1.7B_gguf.py + convert_4B_gguf.py
- Fix critical bug: rl.py silently ignored beta/temperature from config,
  causing the exact catastrophic drift its own comments warned about
- Fix prompt consistency: all files use /no_think chat template format
- Retarget configs from 0.6B to 1.7B
- Comprehensive README documenting the full pipeline

Removed: rl.py, train_1.7B_grpo.py, train_4B_grpo.py, convert_1.7B_gguf.py,
convert_4B_gguf.py, tui.py, evals/run.py, evals/score.py

Net: -3,429 lines, +382 lines

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
2026-01-28 14:00:36 -08:00
jdvmi00
64c6e6c2e3
fix: rename collectionId to collectionName in searchVec for proper filtering (#61) 2026-01-27 22:03:02 -08:00
Freeman Jiang
bfb0eebc3e
fix: use sequential embedding on CPU-only systems to avoid race condition (#54)
* fix: add promise guard to ensureEmbedContext to prevent race condition

Root cause: ensureEmbedContext() was not thread-safe. When multiple parallel
embedding requests called ensureEmbedContext() simultaneously, all would see
embedContext === null and start creating new contexts. This race condition
caused 'Context is disposed' errors as contexts were overwritten/orphaned.

The fix adds a promise guard (embedContextCreatePromise) to ensure only one
context creation runs at a time - identical to the pattern already used in
ensureGenerateModel().

Changes:
- Add embedContextCreatePromise field to track in-progress context creation
- Modify ensureEmbedContext() to wait for existing creation if in progress
- Update test comment and timeout for CPU-only systems

Testing:
- Fresh model download + qmd embed: 28/28 chunks succeeded (was 14/27)
- All embedBatch tests pass
- No warmup hack needed - full parallel performance from the start

Environment tested:
- Ubuntu 24.04 LTS (x64), Bun 1.3.6, node-llama-cpp 3.14.5, no GPU

* test: improve race condition test to verify single context creation

The previous test only verified embeddings succeeded but didn't prove the fix
actually prevents multiple context creation. This improved test:

- Instruments createEmbeddingContext to count invocations
- Runs 5 concurrent embedBatch calls on a fresh LlamaCpp instance
- Asserts exactly 1 context is created (fails with 5 without the fix)

Verified locally:
- With fix: 1 context created (PASS)
- Without fix: 5 contexts created (FAIL)

* chore: clear embedContextCreatePromise in dispose() for consistency
2026-01-27 22:02:36 -08:00
Copilot
053252ca24
Add Windows path utilities with cross-platform test coverage (#51)
* Initial plan

* Add Windows path utility functions and comprehensive tests

Co-authored-by: tobi <347+tobi@users.noreply.github.com>

* Add clarifying comments for Git Bash path detection logic

Co-authored-by: tobi <347+tobi@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tobi <347+tobi@users.noreply.github.com>
2026-01-27 09:05:47 -05:00
sh54
ba7391832d
Add org-mode title extraction support (#50)
Refactor extractTitle to use extension-based extractors:
- .md: preserves original markdown logic (Notes skip behavior)
- .org: extracts from #+TITLE: property or first * heading

Extensions are lowercased for case-insensitive matching.
Easy to add more file types in the future.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 11:38:49 -05:00
sh54
65c0f89560
Enable SQLite extension loading in devshell (#48)
Override sqlite in devShell to enable extension loading for sqlite-vec
support when running tests. Only sets BREW_PREFIX if not already defined
to avoid overriding user's existing setup.

Package build remains unchanged.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 11:37:03 -05:00
Tobi Lutke
9b3a209a97
Fix GRPO training: apply chat template to prompts
The SFT model was trained with chat template format but GRPO was
passing raw prompts. Now prompts are formatted with tokenizer.apply_chat_template()
so the model sees the same format it learned during SFT.

Also update extract_query_from_prompt to strip chat template artifacts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 17:21:22 -05:00
Tobi Lutke
3ea85eff50
Make TUI model list dynamic from HuggingFace Hub
- Fetch available qmd-query-expansion models from tobil/ on Hub
- Auto-detect model size (0.6B, 1.7B, 4B) and use correct base model
- Group models by type (SFT vs GRPO) in menu
- Skip GGUF repos in model listing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 17:17:40 -05:00
Tobi Lutke
891f3262cf
Fix GRPO reward function to handle think blocks and end tokens
- Strip <|im_end|> token from completions (model output includes it)
- Change think_penalty to skipped_think bonus (+20 for not using think)
- Adjust max_possible to account for bonus (120/140)
- Fix typo in chat template artifact check

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 16:32:13 -05:00
Tobi Lutke
66bb8ed963
Remove beads reference from CLAUDE.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 16:24:14 -05:00
Tobi Lutke
2267986302
Remove beads issue tracking
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 16:23:53 -05:00
Tobi Lutke
8a1c4cdab0
Add 1.7B and 4B GRPO training and GGUF conversion scripts
Training scripts for GRPO fine-tuning:
- train_1.7B_grpo.py: GRPO training for Qwen3-1.7B
- train_4B_grpo.py: GRPO training for Qwen3-4B

GGUF conversion scripts:
- convert_1.7B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF
- convert_4B_gguf.py: Merge SFT+GRPO adapters and convert to GGUF

All scripts use PEP 723 inline dependencies for HuggingFace Jobs.

Models published:
- tobil/qmd-query-expansion-1.7B-sft
- tobil/qmd-query-expansion-1.7B-grpo
- tobil/qmd-query-expansion-1.7B-gguf
- tobil/qmd-query-expansion-4B-sft
- tobil/qmd-query-expansion-4B-grpo
- tobil/qmd-query-expansion-4B-gguf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 11:35:27 -05:00
Tobi Lutke
b9b1b39a76
Update README with separate model repos
Changed from subfolder approach to separate repos per model since
trainer.push_to_hub() doesn't support subfolder argument.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 08:13:30 -05:00
Tobi Lutke
312c281109
Update README for unified model repository structure
All models (0.6B, 1.7B, 4B) with SFT and GRPO variants now go into
a single HuggingFace repo (tobil/qmd-query-expansion) with subfolders
for each size and training method.

Updated loading examples to show subfolder-based model loading.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 01:00:17 -05:00
Tobi Lutke
2648512b7c
Fix TUI to load GRPO models with SFT base first
GRPO adapters were trained on merged SFT weights, so they need SFT
loaded and merged first before applying the GRPO adapter.

Updated MODELS config to include sft_base path for GRPO models,
and load_model() now handles the SFT -> merge -> GRPO flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:47:59 -05:00
Tobi Lutke
f96766cce8
Fix GRPO model loading to use SFT base first
The GRPO adapter was trained on merged SFT weights, so loading it
directly on the base model results in 0% score. Added --sft-model
parameter to evals/run.py to load SFT first, then apply GRPO adapter.

With correct loading: GRPO scores 89.7% (all 26 queries Excellent).

Updated README with correct GRPO score and loading instructions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:46:07 -05:00
Tobi Lutke
f6a6716c44
Refactor evals into separate run and score scripts
New structure:
- evals/run.py: Generate model outputs to JSONL
- evals/score.py: Score outputs with detailed breakdown
- evals/queries.txt: Test queries (26 total)

Features:
- Supports both HF Hub and local model paths
- Named entity preservation scoring
- Chat template leakage detection
- Strict format validation (every line must be lex:/vec:/hyde:)
- Generic phrase detection

Usage:
  uv run evals/run.py --model tobil/qmd-query-expansion-0.6B-v4
  uv run evals/score.py evals/results_*.jsonl

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:40:33 -05:00
Tobi Lutke
857a85ab58
Clean up evaluation files
- Remove old versioned evaluation files (0.6B, 1.7B, baseline)
- Rename evaluation_v4.json -> evaluation_sft.json
- Rename evaluation_v4_grpo.json -> evaluation_grpo_failed.json

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:20:39 -05:00
Tobi Lutke
dc8f5a2335
Strict format validation: every line must be lex:/vec:/hyde:
Any line that doesn't start with a valid prefix now returns 0.0
instead of just counting as a penalty. This prevents any prose,
explanations, bullet points, or other invalid content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:08:08 -05:00
Tobi Lutke
2ad507a86e
Add chat template leakage detection to reward function
Zero reward for outputs containing:
- <|im_start|>, <|im_end|> tokens
- <think>, </think> tags (Qwen3 thinking mode)
- Role markers like \nassistant\n, \nuser\n
- <|endoftext|> token

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:07:12 -05:00
Tobi Lutke
6062dc769f
Add named entity extraction to GRPO reward function
Key changes:
- Extract named entities (acronyms, proper nouns, technical terms)
- Heavy penalty (-30) when lex queries miss named entities
- Penalty (-15) for generic filler phrases like "find information about"
- Compound entity detection (TDS motorsports -> both words)
- Update GRPO config with KL regularization (beta=0.04)
- Lower learning rate (5e-7) and add max_steps (200)

Test results:
- "who is TDS motorsports" good: 1.00, bad: 0.30 (was 0.75)
- "how to use React hooks" good: 0.87, bad: 0.45 (was 0.75)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:05:40 -05:00
Tobi Lutke
32706a720f
Refactor finetune folder: train/rl scripts with YAML configs
Major changes:
- train.py: Generic SFT training script using YAML config
- rl.py: Generic GRPO training script using YAML config
- configs/: YAML configs per training run (sft_v4.yaml, grpo_v4.yaml)
- dataset/: Data preparation scripts moved here
- tui.py: Interactive model testing interface

Training results:
- SFT v4: 98.8% avg score (all Excellent)
- GRPO v4: 0% (failed - model drifted to verbose explanations)

Removed per-model scripts (train_0.6B.py, train_1.7B.py, etc)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 20:26:46 -05:00
Tobi Lutke
d32e13c172
Add HuggingFace login and comprehensive scoring to GRPO v2 training
- Add explicit HF_TOKEN login before training
- Use SCORING.md criteria as RL reward function
- Conservative training: LR 1e-6, LoRA rank 4
- Reward scores: good=0.94, bad=0.38

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:37:49 -05:00
Tobi Lutke
c35dbd6cbd
Add comprehensive scoring system for query expansion
New scoring criteria (0-100 points):
- Format (30): Must have lex: and vec: prefixes
- Diversity (30): Multiple types, no echoing query, diverse expansions
- Hyde (20): Optional, concise, no newlines, no word repetition
- Quality (20): Lex=keywords, vec=natural language

See SCORING.md for full documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:00:55 -05:00
Tobi Lutke
994a094546
Update README with final evaluation results
- 0.6B SFT: 95% format compliance (best)
- 0.6B GRPO: 0% (catastrophic forgetting from RL)
- 1.7B v2: training completed, evaluation pending
- Added GRPO evaluation results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:45:48 -05:00