LAUNCH ETA: 2026 May

GPT-OSS-20B Sampling & Prompting for Style Control

9 min read

Benchmarking style control with local LLMs such as GPT-OSS-20B , we find that it is dominated by two variables: system prompt design and sampling regime. Short, minimal directives consistently outperform persona-heavy or formatting-dense prompts in both behavioral stability and throughput. Longer prompts increase prefill cost, but the larger performance impact in our runs came from changes in generation behavior (verbosity, constraint-checking, termination dynamics), which altered total token count and latency more than prefill alone1. Longer prompts can also degrade instruction reliability when constraints are buried mid-context (“lost in the middle”)2. Greedy decoding is not safer for style and often locks the model into high-probability templates. Moderate stochasticity (bounded temperature with constrained top-p/top-k and modest repeat control) produces more natural, less templated output while maintaining structural reliability.

Logit bias meaningfully suppresses specific lexical tics once the prompt and sampling regime are near the desired behavioral basin, but it does not repair structural compliance or reasoning brittleness.

It is rare that the best benchmark results coincide with the fastest configuration. See the full results (updated 2026-02-11).

Loading benchmark scatterplot ('throughput')...

Style Control on Local LLMs

When deploying local instruct-tuned models (e.g., GPT-OSS-20B via llama.cpp), “style control” means shaping surface behavior without fine-tuning. Our objective is pragmatism and operational reliability to reduce boilerplate and corporate tone, suppress sycophancy and meta-commentary, maintain structural compliance when required, and maximize throughput under constrained hardware with minimal effort. The model must produce plain, direct, technically precise output while avoiding common LLM artifacts.

We assume no retraining and no external moderation layers. Available levers are limited to system prompt design, sampling parameters (temperature, top-k/top-p, min_p), repeat controls, logit bias, and stop/max_tokens bounding. The core challenge is that these controls interact.

The Three Primary Control Planes

Prompt framing sets the behavioral prior before decoding begins, sampling (entropy control) governs how probability mass is explored at each step, and logit bias makes localized adjustments to specific tokens (token-level steering). Understanding their hierarchy prevents over-optimizing secondary controls while neglecting first-order effects.

System Prompt - Prior Shaping

The system prompt shifts the model’s distribution before the first token is generated. Even minimal wording changes (e.g., removing persona framing like “You are a robotic assistant”) materially shifted tone, verbosity, boilerplate tendencies, and throughput, demonstrating that prompt framing sets a strong behavioral prior. Persona framing (“robotic assistant”) biases register. Formatting-heavy prompts increase deliberative behavior and can inflate responses; longer prompts also increase prefill latency, directly impacting time-to-first-token1.

The most reliable pattern is a minimum-effective directive to define role, output contract, and non-negotiable constraints. Structural meta-instructions should only be introduced when strictly required. In practice, prompt wording produces larger behavioral shifts than small sampler adjustments.

Experimentation variant examples:

1. Robotic / persona-based

You are a robotic assistant. Output only the solution. No prefaces or explanations.

2. Ultra-minimal directive

Output only the answer. No prefaces or explanations.

3. Formatting-first / constraint-heavy

Formatting constraints take priority over style.\nFollow any explicit requirements in the user message (lines, sentences, word limit, bullet symbol, ASCII-only, banned tokens) exactly.\nNo extra lines. No explanations. Output only the answer.

Sampling - Entropy Control

Sampling parameters shape how the model navigates its probability distribution at each decoding step. After logits are produced (and optionally biased), typically, temperature rescales logits and top-k/top-p/min-p truncate the candidate set before selection3.

Key levers:

  • Temperature: Controls randomness. Greedy (T=0) is deterministic but often brittle and template-bound. Moderate values (~0.85–1.0) reduce boilerplate and stylistic ruts.
  • Top-k / Top-p: Bound exploration. Prevent low-probability tail drift while maintaining diversity.
  • Min-p: Prunes weak tail tokens relative to the peak; particularly effective at reducing generic connective tissue.
  • Repeat penalty / repeat_last_n: Suppress local loops and scaffolding reuse.

These parameters change behavior, which changes output length and termination dynamics. Most observed performance swings are downstream effects of token count variance, not sampling overhead. If you care about throughput/latency, bounding output (stop strings, max_tokens) is the first-order control; sampler tweaks mostly change length distributions.

DRY Multiplier

dry_multiplier penalizes regeneration of previously seen text spans (not just recent tokens), reducing longer-form repetition and template reuse. It is more structural than repeat_penalty, which only targets recent token reuse.

Useful for long outputs that drift into repeated scaffolding. Too high causes awkward synonym churn or avoidance of necessary technical terms. Test it separately. DRY meaningfully changes decoding behavior and can confound sampler comparisons.

Conceptually:

  • repeat_penalty → discourages reusing recent tokens.
  • dry_multiplier → discourages regenerating previously seen sequences.

Logit Bias - Token-Level Steering

Logit bias adds a constant offset to selected token logits before sampling:

$$ p_i \propto \exp((z_i + b_i)/T) $$

where ( z_i ) is the original logit for token ( i ), ( b_i ) is the bias, and ( T ) is temperature.

It operates strictly at the token level. Multi-token phrases must be decomposed, and tokenization variants (beginning-of-sequence vs space-prefixed forms) require separate handling4. Hard bans (large negative bias) effectively zero probability; softer negatives downweight without eliminating.

Logit bias is precise but narrow. It can suppress recurrent lexical tics (e.g., corporate buzzwords, over-formal transitions, Unicode punctuation variants), but it cannot correct structural non-compliance or reasoning errors. Overuse distorts fluency by forcing the model into lower-probability continuations.

Example logits demonstrating token-level suppression mechanics applied via the logit_bias parameter in the llama.cpp API:

// -------- annotated (item+token order; piece + variant) --------
[
  [194222,-100], // ' tapestry' (space)
  [122273,-100], // ' delve' (space)
  [19008,-100], // ' crucial' (space)
  [2322,-100], // '—' (bos)
  [2733,-100], // ' —' (space)
  [1131,-100], // '…' (bos)
  [3762,-100], // ' …' (space)
  [1100,-100], // '“' (bos)
  [966,-100], // ' “' (space)
  [693,-100], // '”' (bos)
  [10736,-100], // ' ”' (space)
  [438,-100], // '’' (bos)
  [9556,-100], // ' ’' (space)
  ... // etc.
]

To use logit_bias with the llama.cpp server’s /completion endpoint, you provide a JSON payload in a POST request. The example below decreases the likelihood of the token ID 194222 (" tapestry").

curl --request POST \
  --url http://<your-llamacpp-server-endpoint>/completion \
  --header "Content-Type: application/json" \
  --data '{
    "prompt": "Output only the answer. No prefaces or explanations.\n\n<user prompt here>",
    "logit_bias": [[194222,-100]]
  }'

We’ve generated model specific lists for each test.

Minimal Prompt Evaluation Matrix (4-Variants)

While we initially tested 20x variants and shortened it to 8x with two system prompt variations, we would now recommend a more focused 4-variant matrix: Use the short, minimal system prompt as default and evaluate four targeted sampler variants to cover determinism, our current best setting, and entropy bounds. This captures most behavior differences without the overhead of a formatting-heavy prompt, which should only be tested if strict structural compliance is a product requirement.

  • Greedy baseline: temperature=0, top_k=1, top_p=1, repeat_penalty=1, repeat_last_n=0
  • Primary candidate: temperature=0.85, top_k=60, top_p=0.92, min_p=0.04, repeat_penalty=1.1, repeat_last_n=128

Each with and without logit biases. Optionally a secondary system prompt would lead to 8 variants. For some models we would likely also test to vary entropy around the primary candidate to test sensitivity, e.g. a lower-entropy temperature of 0.6 and a higher-entropy temperature of 1.0 (or top_p=0.97).

Earlier results with more variants:

Loading benchmark scatterplot ('throughput')...

Empirical Findings

Across runs, system prompt wording and sampling regime explained most observable variance. Short, minimal directives improved throughput and reduced verbosity with negligible quality loss. Persona framing (“robotic assistant”) measurably shifted tone toward stiffness and increased boilerplate tendencies. Formatting-heavy prompts sometimes improved narrow structural compliance, but in our runs they did not reliably improve Orthogonal Compliance Stress and often reduced TPS through altered generation behavior and higher latency variance.

Greedy decoding underperformed on stylistic dimensions in our runs, frequently amplifying template lock-in and repetitive connective scaffolding despite its determinism. Moderate stochasticity (temperature ~0.85–1.0 with bounded top-p/top-k and modest repeat controls) improved overall style compliance and reduced default LLM tics. Importantly, most “performance” swings tracked output length variance, not decode efficiency. As noted, sampling does not meaningfully change per-token compute but generation behavior, which alters token count and termination patterns.

Orthogonal compliance stress remained the hardest category. Neither logit bias nor formatting-heavy prompts reliably fixed it. Improvements appeared when entropy was introduced, suggesting brittle deterministic paths rather than missing lexical suppression. Reducing repeat_last_n from 256 to 128 showed no consistent quality degradation and in some configurations improved throughput, suggesting diminishing returns beyond moderate local memory windows under the tested repeat penalty.

Logit bias produced localized stylistic shifts but did not materially change structural compliance or reasoning behavior. Its impact was incremental and interaction-dependent, most effective when combined with moderate sampling. Overall, the evidence supports a control hierarchy: prompt framing > sampling regime > token-level bias, with output length as the dominant hidden confounder in throughput metrics.

Benchmark category composition materially affects observed scores. Adding freestyle or blog-style probes increases sensitivity to lexical tic suppression and tone drift. Comparisons across runs must hold benchmark composition constant to avoid false progress attribution.

AI-Style “Tics” and Practical Suppression

Instruction-tuned models exhibit recurring lexical and structural patterns that signal “LLM voice”: over-formal transitions, hedging meta-language, consulting buzzwords, sales intensifiers, templated scaffolding, and explicit self-reference. These patterns reflect high-probability continuations in the model’s training distribution5. Greedy decoding amplifies them. Moderate sampling reduces their dominance, but deterministic suppression requires targeted intervention.

Practically, we are tiering suppressions. Hard bans are appropriate only for near-certain artifacts (self-reference such as “As an AI…”, unwanted Unicode punctuation, house-style violations). Strong downweighting works well for high-signal transitions and corporate hype terms (“Moreover,” “Furthermore,” “leverage,” “groundbreaking”). Light downweighting can trim hedging and scaffolding (“It’s worth noting,” “Let’s explore,” repeated modal verbs). Logit bias must operate at the token level: bias only single-token realizations, include both BOS and space-prefixed variants, and keep the list small to avoid collateral fluency damage. Over-biasing forces awkward paraphrasing and can increase repetition by pushing the model into lower-probability regions.

Conclusions

Most of the observable variance in style, compliance, and throughput comes from two factors: system prompt framing and decoding entropy. Small changes in wording at the system level shift the model’s prior more than most sampler tweaks, and moderate stochasticity consistently outperforms strict determinism for avoiding template lock-in and boilerplate. Logit bias operates at a much narrower layer and redistributes token probability mass but does not alter structural behavior.


  1. Hugging Face. Prefill vs Decode: Understanding LLM Inference Performance. Explains how prompt length increases prefill cost and affects time-to-first-token and overall latency. https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests  ↩︎ ↩︎

  2. Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Demonstrates brittleness and retrieval degradation in long contexts, relevant to greedy template lock-in and over-reliance on high-probability continuations. https://arxiv.org/abs/2307.03172  ↩︎

  3. Holtzman et al. (2020). The Curious Case of Neural Text Degeneration. Introduces nucleus (top-p) sampling and explains degeneration under greedy and beam search. https://arxiv.org/abs/1904.09751  ↩︎

  4. Kudo & Richardson (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Describes subword tokenization and boundary variants (BOS vs space-prefixed tokens). https://arxiv.org/abs/1808.06226  ↩︎

  5. Wei et al. (2022). Finetuned Language Models Are Zero-Shot Learners. Shows how instruction tuning induces patterned stylistic priors in output distributions. https://arxiv.org/abs/2109.01652  ↩︎