by davila7
链式思考推理模型会产生冗长的自我反思令牌,增加成本和延迟。本技能实现了 NOWAIT 技术,可在推理过程中抑制不必要的反思令牌,将令牌使用量减少 27-51%,同时保持基于 RL 的推理模型的准确性。
1. 打开 Claude 聊天界面
2. 点击下方 "📋 复制" 按钮
3. 粘贴到 Claude 聊天框中并发送
4. 输入 "使用 nowait-reasoning-optimizer 技能" 开始使用
=== nowait-reasoning-optimizer 技能 === 作者: davila7 描述: 链式思考推理模型会产生冗长的自我反思令牌,增加成本和延迟。本技能实现了 NOWAIT 技术,可在推理过程中抑制不必要的反思令牌,将令牌使用量减少 27-51%,同时保持基于 RL 的推理模型的准确性。 使用方法: 1. 调用技能: "使用 nowait-reasoning-optimizer 技能" 2. 提供相关信息: 根据技能要求提供必要参数 3. 查看结果: 技能会返回处理结果 示例: "使用 nowait-reasoning-optimizer 技能,帮我分析一下这段代码"
这种方法适用于所有 Claude 用户,不需要安装额外工具。
productivity
safe
Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.
| Model Series | Type | Token Reduction |
|---|---|---|
| QwQ-32B | RL-based | 16-31% |
| Phi4-Reasoning-Plus | RL-based | 23-28% |
| Qwen3-32B | RL-based | 13-16% |
| Kimi-VL-A3B | Multimodal | 40-60% |
| QvQ-72B-Preview | Multimodal | 20-30% |
Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
from scripts.nowait_processor import NOWAITLogitProcessor
# Initialize processor for your model's tokenizer
processor = NOWAITLogitProcessor(tokenizer)
# Use during generation
outputs = model.generate(
inputs,
logits_processor=[processor],
max_new_tokens=32768
)
See references/keywords.md for the complete list. Core keywords:
wait, alternatively, hmm, but, however, check,
double-check, maybe, verify, again, oh, ah
Logits (Before) Logits (After)
Wait 0.8 → Wait -inf
First 0.6 → First 0.6
Hmm 0.5 → Hmm -inf
Let 0.4 → Let 0.4
| Model Type | NOWAIT Effect | Recommendation |
|---|---|---|
| RL-based (QwQ, Phi4, Qwen3-32B) | Stable accuracy, significant token reduction | ✅ Recommended |
| Distilled (Qwen3-4B/8B/14B) | Accuracy degradation on hard tasks | ⚠️ Use with caution |
Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
processor = NOWAITLogitProcessor(tokenizer)
response = model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
logits_processor=[processor],
max_new_tokens=32768,
do_sample=True,
temperature=0.7
)
from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids
llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
sampling_params = SamplingParams(
max_tokens=32768,
bad_words_ids=bad_words_ids
)
| Task Type | Original Tokens | NOWAIT Tokens | Reduction |
|---|---|---|---|
| Math (AIME) | 15,000 | 10,500 | 30% |
| Visual QA (MMMU) | 2,900 | 1,450 | 50% |
| Video QA (MMVU) | 1,700 | 1,250 | 27% |
references/keywords.mdscripts/nowait_processor.pyView Count
0
Download Count
0
Favorite Count
0
Quality Score
71