What's more, they exhibit a counter-intuitive scaling limit: their reasoning hard work boosts with problem complexity approximately a point, then declines In spite of obtaining an adequate token spending budget. By evaluating LRMs with their normal LLM counterparts underneath equivalent inference compute, we identify three general performance regimes: (1) https://www.youtube.com/watch?v=snr3is5MTiU