Tuning a 1B classifier
February 12, 2026TL;DR — gemma3:1b, a one-billion-parameter language model, scored 50% on yes/no classification with vague conditions like “the file contains source code.” Parenthetical clauses, example lists, and negation each broke it in distinct ways. Verb-based conditions with concrete observable features — “the content is programming language source code with functions, classes, or imports” — reached 100%. Nothing changed but the words.
I built screen to answer one question: does a given condition apply to a given input? Screen sends each question to a local language model — gemma3:1b via ollama, default temperature, one billion parameters. The prompt is a condition in XML tags, an input in XML tags, and the instruction “Answer yes or no only.” All three classifiers share this single prompt template.
Three conditions. Given a file, is this source code? Given a file, is this test code? Given a user prompt, is this a naming decision? Each answer determines whether screen injects relevant guidelines into my context. The test harness runs 5 positive and 5 negative fixtures per condition, 3 trials each with majority voting. Target: 30/30.
Before tuning, the harness scored 19/30. Every positive fixture passed. Nearly every negative failed. The model said “yes” to anything technology-adjacent — meeting notes mentioning “vendor API” classified as source code, a deploy script classified as test code.
The first attempt targeted the fixtures. Replacing ambiguous negatives with non-technical content moved the score from 19 to 22. This was the wrong correction. The model still said “yes” to standup notes about software projects. The fixtures were not the problem. The conditions were.
What the model hears
The source code condition was “the file contains source code.” To gemma3:1b, that means “is this related to software.” Changing it to “the content is programming language source code with functions, classes, or imports” moved accuracy from 50% to 90%. Each added word — “programming language,” “functions,” “classes,” “imports” — is a concrete signal the model can verify. Vague topic association replaced with structural detection.
The test code condition was more resistant. “The file contains test code or test configuration” scored 60%. Tightening it to “the content is test code (unit tests, integration tests, or test framework configuration)” produced worse results — 1/5 positives. The parenthetical clause broke the model. It lost the main predicate and answered “no” to everything.
First finding: parenthetical clauses are poison for gemma3:1b.
Example lists failed next: “uses a test framework like minitest, rspec, pytest, or jest.” The model said “yes” to everything — a server file, a deploy script, a README. Listing examples caused it to pattern-match too loosely. Second finding: “like X, Y, Z” triggers over-association.
Negation failed as well: “the file is NOT related to testing” with inverted interpretation. All “no,” regardless of input. The model picks up the negation word and applies it uniformly. Third finding: negation biases the output toward “no.”
Nine trials
The path from 19/30 to 30/30 took nine trials. Trials 1 and 2 tuned conditions. Trials 3 through 6 were dead ends — each explored a distinct strategy and each failed:
- Better fixtures, same conditions. 22/30. Marginal — confirmed the conditions were the problem.
- Tightened conditions with parenthetical clauses. Source code improved; test code collapsed to 1/5 positives.
- Path-based detection (“file path starts with test/”). All yes, every input. The model cannot perform string matching on paths.
- Negative framing. All no, every input.
- Prompting with worked examples in the prompt. 2/4 on hard cases, but requires per-classifier templates — defeats the shared template.
- Classification format (“classify as test or not test”). All “test.” The repeated word primes the answer.
- Condition variations without parentheses. Three candidates tested; one scored 5/5: “the file contains test methods that verify expected behavior.”
- Stability check on the winner across 9 inputs, 3 trials each. 26/27 correct. The one miss: continuous integration config, which invokes tests but contains no test methods. Acceptable.
- Full 30-fixture suite with all three tuned conditions. First run: 28/30. Two failures on tech-adjacent vocabulary in negative fixtures — a changelog mentioning “dashboard,” a prompt containing “function.” Replaced with unambiguous negatives. Second run: 30/30.
Trial 8 is the most informative result. It tests edge cases the conditions were not tuned against and holds at 96%.
What worked
The final conditions:
| Classifier | Before | After |
|---|---|---|
| Source code | “the file contains source code” | “the content is programming language source code with functions, classes, or imports” |
| Test code | “the file contains test code or test configuration” | “the file contains test methods that verify expected behavior” |
| Naming | “involves naming a new repo, tool, or project” | “the user is choosing a name for a new repo, tool, or project” |
The pattern: verb-based conditions with concrete observable features. “Contains test methods that verify expected behavior” succeeds because every word maps to something the model can verify. “Test methods” maps to def test_* and it blocks. “Verify” maps to assert and expect statements. “Expected behavior” maps to the purpose of those constructs. No parentheses. No examples. No negation.
The 30/30 score reflects both condition tuning and fixture selection. Trial 9’s first run scored 28/30 with tuned conditions against harder negatives. The final 30/30 required replacing ambiguous negatives with unambiguous ones. Both changes contributed. A changelog mentioning “beta program” and “dashboard” is ambiguous to a model that associates software vocabulary with source code. A lunch order is not.
Majority voting smooths single-trial noise but cannot correct systematic bias. If the condition is wrong, three trials confirm the error three times.
Limits
These findings derive from one model (gemma3:1b) at default generation parameters, on a 30-fixture evaluation set, for well-separated categories. Whether the three failure modes — parenthetical clauses, example lists, negation framing — transfer to other small models is untested. The categories classified here have clear structural markers. Finer distinctions — refactoring versus new feature, for instance — may not yield to vocabulary tuning alone.
What the nine trials demonstrate: for gemma3:1b on this task, the gap between 50% and 100% was not the model’s discrimination ability. It was mine — in choosing the words.