# Empirical Results

## 2025 Controlled Experiments

A controlled study published in December 2025 measured output truncation across several frontier models, including GPT-4 variants and DeepSeek. Three experiments were conducted:

### Experiment A: Multi-Part Instruction Compliance

Models were given complex prompts with multiple explicit requirements (formatting constraints, length requirements, mandatory sections). Results:

- No model fully satisfied both length requirements and all sub-part instructions natively
- Models frequently omitted mandatory output sections
- Required formatting constraints were routinely skipped
- Explicit length requirements were consistently undershot

### Experiment B: Decoding Suboptimality

Tested whether truncated outputs resulted from suboptimal token selection (the model "knowing" the right answer but selecting a worse token). Results:

- Limited evidence of decoding suboptimality on simple reasoning tasks
- The model's greedy, truncated output generally aligned with its highest-confidence solution
- Truncation is a deliberate behavioral choice, not a decoding failure

### Experiment C: Context Degradation

Tested whether models lose track of instructions during long, multi-turn conversations. Results:

- Surprising resilience against context degradation during 200-turn conversational tests
- Models maintained key facts and instructions significantly better than hypothesized
- Context loss is not the primary cause of truncation

### Key Conclusion

Laziness is not a failure of memory, context processing, or core model capabilities. It is a behavioral artifact triggered by:
1. Instruction complexity exceeding internal effort thresholds
2. Aggressively calibrated stopping pressure
3. Economic constraints embedded in the alignment layer

## Prompt Stimulus Effectiveness (Microsoft Research)

Controlled testing of psychological prompt stimuli documented in a Microsoft Research study:

| Stimulus | Measured Effect |
|:---|:---|
| Financial incentive framing ("$200 tip") | +45% output quality and length |
| Step-by-step instruction ("take a deep breath") | Accuracy: 34% to 80% on logic tasks |
| Stakes framing ("critical to my career") | +10% average performance |
| Combined (multiple stimuli) | Up to +115% overall performance |

These effects are reproducible and stem from statistical correlations in the training data between stakes language and high-effort human outputs.

## Seasonal Output Variation

Statistical analysis of ChatGPT outputs during November-December 2023 versus January-March 2024 confirmed:

- Measurable decrease in average output length during December
- Correlation with reduced work output in the training data during holiday periods
- Output length increased when the system prompt explicitly stated a non-winter month