by supercent-io
轻量级LLM模型常因因果注意力机制而难以对齐上下文与问题。此技能可自动重复提示词,让模型在第二次处理时获得完整上下文,最多可提升76个百分点的准确性。
1. 打开 Claude 聊天界面
2. 点击下方 "📋 复制" 按钮
3. 粘贴到 Claude 聊天框中并发送
4. 输入 "使用 prompt-repetition 技能" 开始使用
=== prompt-repetition 技能 === 作者: supercent-io 描述: 轻量级LLM模型常因因果注意力机制而难以对齐上下文与问题。此技能可自动重复提示词,让模型在第二次处理时获得完整上下文,最多可提升76个百分点的准确性。 使用方法: 1. 调用技能: "使用 prompt-repetition 技能" 2. 提供相关信息: 根据技能要求提供必要参数 3. 查看结果: 技能会返回处理结果 示例: "使用 prompt-repetition 技能,帮我分析一下这段代码"
这种方法适用于所有 Claude 用户,不需要安装额外工具。
coding
safe
LLMs are trained as Causal Language Models, where each token attends only to previous tokens. This leads to:
Prompt repetition enables the second pass to reference the entire first pass, effectively mimicking some benefits of bidirectional attention.
[Context] → [Question]
↓
Cannot reference Question content when processing Context tokens
Attention weights for Context are already finalized by the time Question tokens appear
[First Pass] [Second Pass]
Context → Question → Context' → Question'
↑ ↑
Can reference entire first pass
In the second repetition, the model reprocesses information across the entire first prompt and strengthens attention weights on key concepts, resulting in improved performance.
Note: This does not change the model architecture to bidirectional; it is a prompt engineering technique to mitigate the limitations of causal models.
| Metric | Result |
|---|---|
| Significant improvement (p < 0.1) | 47 / 70 benchmarks |
| Performance degradation | 0 |
| Neutral | 23 |
| Improvement rate | 67% |
Most dramatic improvement: Gemini 2.0 Flash-Lite on NameIndex: 21.33% → 97.33% (+76%p)
| Provider | Auto-apply models | Excluded models |
|---|---|---|
| Claude | haiku series | opus, sonnet |
| Gemini | flash, flash-lite | pro, ultra |
| OpenAI | gpt-4o-mini, gpt-low | gpt-4o, gpt-4 |
| Task Type | Keyword Pattern | Repetitions | Expected Improvement |
|---|---|---|---|
| Options-First MCQ | A. B. C. D. choices first | 2× | +15-40%p |
| Index/Position | slot, position, index, N-th | 3× | +50-76%p |
| Context + Question | General question | 2× | +5-15%p |
| With CoT | step by step, think through | 0× (not applied) | ~0% |
# Check context before auto-apply
max_context = model_context_window * 0.8 # 80% safety margin
if len(prompt_tokens) * repetitions > max_context:
repetitions = max(1, int(max_context / len(prompt_tokens)))
def apply_prompt_repetition(prompt: str, times: int = 2) -> str:
"""Repeat the prompt a specified number of times
Args:
prompt: Original prompt
times: Number of repetitions (default 2)
Returns:
Repeated prompt
"""
if times <= 1:
return prompt
return "\n\n".join([prompt] * times)
Before:
A. Paris
B. London
C. Berlin
D. Madrid
Which city is the capital of France?
Reply with one letter.
After (repetition ×2 applied):
A. Paris
B. London
C. Berlin
D. Madrid
Which city is the capital of France?
Reply with one letter.
A. Paris
B. London
C. Berlin
D. Madrid
Which city is the capital of France?
Reply with one letter.
Expected output:
A
Accuracy: original 78% → after repetition 93% (+15%p)
Before:
Inventory:
1. Iron Sword
2. Leather Armor
3. Health Potion (x5)
4. Magic Staff
...
25. Dragon Scale
...
50. Ancient Map
What item is in slot 25?
After (repetition ×3 applied): Prompt repeated 3 times
Expected output:
Dragon Scale
Accuracy: original 21% → after repetition 97% (+76%p)
Note: Prompts containing tool call instructions are also repeated in their entirety. The full-repetition approach was adopted for implementation simplicity and consistency.
Before:
Use the calculator tool to compute 234 * 567.
What is the result?
After (repetition ×2):
Use the calculator tool to compute 234 * 567.
What is the result?
Use the calculator tool to compute 234 * 567.
What is the result?
Research results show that full repetition including tool call sections is also effective.
"""prompt_repetition_transformer.py"""
from dataclasses import dataclass, field
from typing import Optional, Callable, List
import re
# Context window per model (in tokens)
MODEL_CONTEXT_WINDOWS = {
"claude-3-haiku": 200_000,
"claude-haiku": 200_000,
"gemini-flash": 1_000_000,
"gemini-flash-lite": 1_000_000,
"gemini-2.0-flash": 1_000_000,
"gpt-4o-mini": 128_000,
"gpt-low": 128_000,
}
# Models targeted for auto-apply
AUTO_APPLY_MODELS = list(MODEL_CONTEXT_WINDOWS.keys())
# CoT patterns (excluded from apply)
COT_PATTERNS = [
r"step by step",
r"think through",
r"let's think",
r"reasoning:",
r"chain of thought",
]
# Position/Index patterns (3× repetition)
POSITION_PATTERNS = [
r"slot \d+",
r"position \d+",
r"index \d+",
r"\d+(st|nd|rd|th)",
r"item \d+",
r"row \d+",
r"column \d+",
]
@dataclass
class PromptRepetitionConfig:
"""Prompt repetition configuration"""
default_repetitions: int = 2
position_repetitions: int = 3
separator: str = "\n\n"
max_context_ratio: float = 0.8
applied_marker: str = "<!-- prompt-repetition-applied -->"
class PromptRepetitionTransformer:
"""Auto-apply prompt repetition transformer for lightweight models"""
def __init__(self, config: Optional[PromptRepetitionConfig] = None):
self.config = config or PromptRepetitionConfig()
def should_apply(self, model: str, prompt: str) -> bool:
"""Determine whether to auto-apply"""
# Skip if already applied
if self.config.applied_marker in prompt:
return False
# Check target model
model_lower = model.lower()
if not any(m in model_lower for m in AUTO_APPLY_MODELS):
return False
# Skip when CoT pattern detected
prompt_lower = prompt.lower()
for pattern in COT_PATTERNS:
if re.search(pattern, prompt_lower):
return False
return True
def determine_repetitions(self, prompt: str, model: str) -> int:
"""Determine repetition count based on task type"""
prompt_lower = prompt.lower()
# Position/Index pattern detected → 3×
for pattern in POSITION_PATTERNS:
if re.search(pattern, prompt_lower):
return self.config.position_repetitions
return self.config.default_repetitions
def estimate_tokens(self, text: str) -> int:
"""Simple token count estimation (speed over precision)"""
# Estimate approximately 4 characters = 1 token
return len(text) // 4
def transform(self, prompt: str, model: str) -> str:
"""Apply repetition to prompt"""
if not self.should_apply(model, prompt):
return prompt
repetitions = self.determine_repetitions(prompt, model)
# Check context limit
model_lower = model.lower()
max_tokens = 128_000 # Default value
for m, tokens in MODEL_CONTEXT_WINDOWS.items():
if m in model_lower:
max_tokens = tokens
break
max_allowed = int(max_tokens * self.config.max_context_ratio)
prompt_tokens = self.estimate_tokens(prompt)
# Reduce repetitions if token limit exceeded
while prompt_tokens * repetitions > max_allowed and repetitions > 1:
repetitions -= 1
if repetitions <= 1:
return prompt
# Apply repetition + add marker
repeated = self.config.separator.join([prompt] * repetitions)
return f"{self.config.applied_marker}\n{repeated}"
def wrap_llm_call(self, llm_fn: Callable, model: str) -> Callable:
"""Wrap LLM call function"""
def wrapped(prompt: str, **kwargs):
transformed = self.transform(prompt, model)
return llm_fn(transformed, **kwargs)
return wrapped
def run_ab_test(prompts: List[str], llm_fn, model: str, ground_truth: List[str]):
"""A/B test for prompt repetition effectiveness"""
transformer = PromptRepetitionTransformer()
results = {"baseline": [], "repeated": []}
for prompt, expected in zip(prompts, ground_truth):
# Baseline
response_a = llm_fn(prompt)
results["baseline"].append(response_a == expected)
# With Repetition
repeated_prompt = transformer.transform(prompt, model)
response_b = llm_fn(repeated_prompt)
results["repeated"].append(response_b == expected)
baseline_acc = sum(results["baseline"]) / len(prompts)
repeated_acc = sum(results["repeated"]) / len(prompts)
print(f"Baseline accuracy: {baseline_acc:.2%}")
print(f"Repeated accuracy: {repeated_acc:.2%}")
print(f"Improvement: {repeated_acc - baseline_acc:+.2%}p")
| Metric | Measurement Method |
|---|---|
| Accuracy | Compare correct answer rates |
| Consistency | Variance across 10 runs of same prompt |
| Token cost | Input token increase rate |
| Latency | Compare p50, p99 latency |
| Case | Reason |
|---|---|
| Using CoT | Reasoning process already provides context |
| Reasoning models (opus, sonnet) | Already optimized; minimal effect |
| Very long prompts | Risk of exceeding context limit |
| Already repeated | Duplicate application wastes tokens |
| Metric | Baseline | With Repetition | Change |
|---|---|---|---|
| Input tokens | 500/req | 1000/req | +100% |
| Output tokens | 100/req | 100/req | 0% |
| Latency (p50) | 450ms | 460ms | +2% |
| Latency (p99) | 1200ms | 1250ms | +4% |
| Accuracy | 78% | 89% | +14%p |
| Cost per correct answer | $0.019 | $0.020 | +5% |
Key insight: The prefill phase is highly parallelized on GPU, so doubling input tokens has minimal impact on latency.
| Agent | Model | Repetition Applied | Applied At |
|---|---|---|---|
| Claude Orchestrator | opus/sonnet | Optional | - |
| Claude Executor | haiku | Auto | skill_loader.py |
| Gemini Analyst | flash | Auto | On MCP call |
| OpenAI | gpt-4o-mini | Auto | skill_loader.py |
To prevent duplicate application in multi-agent pipelines:
<!-- prompt-repetition-applied --> markerx-prompt-repetition-applied: true header between agents[Claude Sonnet] Planning (no repetition needed)
↓
[Gemini Flash] Analysis (repetition ×2 auto-applied, marker added)
↓
[Claude Haiku] Execution (marker detected → skip duplicate apply)
# Code to add to skill_loader.py
from prompt_repetition_transformer import PromptRepetitionTransformer
class SkillLoader:
def __init__(self, ...):
# ... existing code ...
self.prompt_transformer = PromptRepetitionTransformer()
def apply_auto_skills(self, prompt: str, model: str) -> str:
"""Handle auto-apply skills"""
# Auto-apply prompt-repetition
for skill in self.skills.values():
auto_apply = skill.get('data', {}).get('auto-apply', {})
if auto_apply.get('trigger') == 'auto':
target_models = auto_apply.get('models', [])
if any(m in model.lower() for m in target_models):
prompt = self.prompt_transformer.transform(prompt, model)
return prompt
. etc. has no effect (per research)=== Auto-Apply Target Models ===
claude-3-haiku, claude-haiku
gemini-flash, gemini-flash-lite, gemini-2.0-flash
gpt-4o-mini, gpt-low
=== Repetition Count ===
General tasks: 2×
Position/Index (slot/position/index keywords): 3×
With CoT: 0× (not applied)
=== Effect (Google Research 2025) ===
Improvement rate: 67% (47/70 benchmarks)
Performance degradation: 0 cases
Maximum improvement: +76%p (NameIndex)
=== Cost ===
Input tokens: +100%
Latency: +2% (Prefill parallelization)
Cost per correct answer: +5%
=== Duplicate Application Prevention ===
Marker: <!-- prompt-repetition-applied -->
View Count
0
Download Count
0
Favorite Count
0
Quality Score
72