Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
stop-slop, taste-skill, terrashark had embedded .git dirs causing Woodpecker clone to fail on submodule update.
58 lines
2.8 KiB
Markdown
58 lines
2.8 KiB
Markdown
# Empirical Results
|
|
|
|
## 2025 Controlled Experiments
|
|
|
|
A controlled study published in December 2025 measured output truncation across several frontier models, including GPT-4 variants and DeepSeek. Three experiments were conducted:
|
|
|
|
### Experiment A: Multi-Part Instruction Compliance
|
|
|
|
Models were given complex prompts with multiple explicit requirements (formatting constraints, length requirements, mandatory sections). Results:
|
|
|
|
- No model fully satisfied both length requirements and all sub-part instructions natively
|
|
- Models frequently omitted mandatory output sections
|
|
- Required formatting constraints were routinely skipped
|
|
- Explicit length requirements were consistently undershot
|
|
|
|
### Experiment B: Decoding Suboptimality
|
|
|
|
Tested whether truncated outputs resulted from suboptimal token selection (the model "knowing" the right answer but selecting a worse token). Results:
|
|
|
|
- Limited evidence of decoding suboptimality on simple reasoning tasks
|
|
- The model's greedy, truncated output generally aligned with its highest-confidence solution
|
|
- Truncation is a deliberate behavioral choice, not a decoding failure
|
|
|
|
### Experiment C: Context Degradation
|
|
|
|
Tested whether models lose track of instructions during long, multi-turn conversations. Results:
|
|
|
|
- Surprising resilience against context degradation during 200-turn conversational tests
|
|
- Models maintained key facts and instructions significantly better than hypothesized
|
|
- Context loss is not the primary cause of truncation
|
|
|
|
### Key Conclusion
|
|
|
|
Laziness is not a failure of memory, context processing, or core model capabilities. It is a behavioral artifact triggered by:
|
|
1. Instruction complexity exceeding internal effort thresholds
|
|
2. Aggressively calibrated stopping pressure
|
|
3. Economic constraints embedded in the alignment layer
|
|
|
|
## Prompt Stimulus Effectiveness (Microsoft Research)
|
|
|
|
Controlled testing of psychological prompt stimuli documented in a Microsoft Research study:
|
|
|
|
| Stimulus | Measured Effect |
|
|
|:---|:---|
|
|
| Financial incentive framing ("$200 tip") | +45% output quality and length |
|
|
| Step-by-step instruction ("take a deep breath") | Accuracy: 34% to 80% on logic tasks |
|
|
| Stakes framing ("critical to my career") | +10% average performance |
|
|
| Combined (multiple stimuli) | Up to +115% overall performance |
|
|
|
|
These effects are reproducible and stem from statistical correlations in the training data between stakes language and high-effort human outputs.
|
|
|
|
## Seasonal Output Variation
|
|
|
|
Statistical analysis of ChatGPT outputs during November-December 2023 versus January-March 2024 confirmed:
|
|
|
|
- Measurable decrease in average output length during December
|
|
- Correlation with reduced work output in the training data during holiday periods
|
|
- Output length increased when the system prompt explicitly stated a non-winter month
|