Large language models (LLMs) are increasingly used in software development, primarily for code generation, but their impact on overall productivity, reliability, and delivery speed remains contested. While proponents claim transformative gains, empirical data and theoretical frameworks suggest more modest outcomes.
Overview
LLMs have become a common tool in programming workflows, enabling rapid code generation from natural language prompts or specifications. This capability has led to claims of revolutionary improvements in developer productivity. However, long-standing principles in software engineering, particularly Fred Brooks’ 'No Silver Bullet' argument, challenge the notion that faster code generation alone can yield order-of-magnitude improvements in software development.
Brooks distinguished between 'essential' difficulties—those inherent to software design, such as specification, architecture, and testing—and 'accidental' difficulties, like syntax errors or boilerplate coding. He argued that most gains from automation target accidental complexity, which constitutes a small fraction of total development effort. Even eliminating all accidental difficulty cannot produce a 10x improvement if essential tasks dominate the workload. Historical estimates suggest coding itself accounts for less than 20% of total effort, meaning that accelerating only code production has limited systemic impact.
What the data shows
Recent industry reports support this skepticism. The 2026 DORA report on AI-assisted software development states that AI acts as an 'amplifier'—enhancing high-performing teams while exacerbating dysfunction in struggling ones. While AI adoption correlates with increased throughput (measured by lead time, deployment frequency, and recovery time), it also increases delivery instability. Specifically, change fail rates and rework rates have risen, indicating more deployments require immediate fixes or unplanned follow-ups.
The report attributes gains not to the tools themselves but to organizational maturity: strong version control, small batch sizes, robust internal platforms, and clear workflows. These practices are long-established fundamentals, not novel requirements for AI use.
Similarly, CircleCI’s 2026 'State of Software Delivery' report finds that most teams see little to no throughput improvement from LLMs. Key stability indicators worsened: recovery time for broken main branches increased by 13%, and for feature branches by 25%. Main branch success rates fell to 70.8%, far below the recommended 90% benchmark. For a team merging five changes daily, this translates to 1.5 showstopping failures per day—triple the previous rate—and an additional 250 hours of debugging annually.
Tradeoffs and workflow challenges
A common counterargument is that older LLMs or outdated workflows explain poor results, and that newer models or adversarial review systems (e.g., one LLM generating code, another reviewing it) will resolve quality issues. However, real-world examples