Large Language Models and Code Speed: An Experimental Review

TLDR: A study evaluated two leading LLMs (OpenAI o4-mini and Gemini 2.5 Pro) on their ability to generate code patches that improve performance in real-world Java projects. While LLMs can improve code performance in most cases, human developers still achieve significantly better optimizations. The study found that providing detailed problem descriptions to LLMs leads to more human-like and effective solutions, and that LLMs can occasionally propose novel optimizations that human developers miss, highlighting their potential as performance co-pilots.

Large Language Models (LLMs) have become incredibly adept at generating code, but a crucial question remains: can they generate code that is not just correct, but also fast? A recent experimental study, titled An Experimental Study of Real-Life LLM-Proposed Performance Improvements, delves into this very topic, evaluating the capabilities of leading LLMs in optimizing real-world Java software for performance.

Conducted by Lirong Yi, Gregory Gay, and Philipp Leitner from Chalmers University of Technology and University of Gothenburg, Sweden, the research explored whether LLMs could propose code changes that significantly improve the speed of existing programs. They used a unique dataset called PerfOpt, comprising 65 real-world performance optimization tasks sourced from popular open-source Java projects like Apache Kafka, Netty, Presto, and RoaringBitmap. These tasks were specifically chosen because human developers had already achieved substantial speedups in them, and each included developer-provided benchmarks to measure performance.

The study employed an automated pipeline to generate performance-improving patches using two prominent LLMs: OpenAI o4-mini and Gemini 2.5 Pro. These models were tested under four different prompting strategies, ranging from a basic request to improve performance to providing detailed problem descriptions and benchmark information. The generated patches were then rigorously benchmarked against the original code and the human-authored solutions.

LLMs Show Promise, But Humans Still Lead

The findings revealed that LLMs indeed have the potential to improve code performance. Approximately 62% of the generated solutions were considered ‘plausible’ – meaning they could be integrated, compiled, and passed existing unit tests. Of these plausible patches, a significant 64% actually led to performance improvements over the original baseline code. This suggests that LLMs can be a valuable tool for identifying and addressing performance bottlenecks.

However, the study also highlighted clear limitations. While LLMs often improved performance, they rarely matched the effectiveness of human developers. Human-authored solutions consistently outperformed LLM fixes by a statistically significant margin. For instance, human developers achieved ‘massive improvements’ (over 50% speedup) in more than 62% of cases, whereas Gemini achieved this in 46% of cases (with sufficient context) and o4-mini in 36%.

The Importance of Context and Novelty

One of the key takeaways was the impact of prompting strategies. When LLMs were provided with a natural-language description of the performance problem, they were much more likely to propose solutions that were functionally identical or very similar to what human developers had implemented. This ‘strategy match’ often led to the highest performance gains. Conversely, when LLMs received little to no context, they tended to propose ‘strategy divergent’ solutions – entirely different optimization approaches. While these original ideas were interesting, they only occasionally yielded substantial performance improvements and sometimes even led to performance regressions.

Interestingly, the study found that the prompting strategy often mattered more than the model size. In some configurations, the smaller o4-mini model, when given ample context, achieved performance comparable to the larger Gemini model.

Also Read:

LLMs as Performance Co-Pilots

The researchers conclude that LLMs are not yet ready to replace human performance engineers. Their ‘creativity’ in proposing novel solutions might be more akin to pattern-matching rather than a deep understanding of complex performance bottlenecks. However, the study also identified rare but significant instances where an LLM proposed an optimization that an experienced human developer had missed, particularly in projects like Kafka and Netty. This suggests a powerful role for LLMs as ‘performance co-pilots’ or assistants.

The vision is for integrated developer tools that embed LLMs within an automated framework. Such a system would not just suggest code, but also automatically extract context, generate candidate patches, verify correctness, and run relevant benchmarks before presenting the results to a human for review. This approach would transform LLMs into reliable partners in the performance engineering process, augmenting human expertise rather than replacing it. The study also emphasizes the need for more realistic, validated datasets like PerfOpt to accurately assess and improve LLM capabilities in this complex domain.

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Large Language Models and Code Speed: An Experimental Review

LLMs Show Promise, But Humans Still Lead

The Importance of Context and Novelty

LLMs as Performance Co-Pilots

Gen AI News and Updates

UK Pension Giants Form ‘Sterling 20’ to Boost Domestic Infrastructure and High-Growth Sectors

Ready Plan Go Secures $750K Pre-Seed Funding to Revolutionize Accounting with AI Automation

Africa to Chart its AI Future at 13th Digital Africa Conference in Abuja

New Approach to Reinforcement Learning Handles Noisy, Complex Rewards

Accelerating Optimization: A Parallel Approach to the Artificial Protozoa Optimizer

DeepAries: A New AI Framework for Smart Portfolio Rebalancing

Navigating Volatile Markets: A New AI System for Smarter Investment Portfolios

How Federated Learning is Reshaping Financial Security

Improving PET Scan Clarity with a Physics-Aware Denoising Network

Machine Learning Unlocks Earlier Detection of Kidney and Heart Disease in Diabetic Patients

VaultGemma 1B: A New Milestone in Differentially Private Language Models

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

TangledFeatures: Untangling Correlated Data for Clearer Scientific Insights

Unpacking LLM Toxicity: A Multi-Label Evaluation Framework

Generative AI’s Ability to Interpret Idioms in Essay Scoring: A Comparative Study

Boosting Wind Turbine Reliability with a Novel Deep Learning System

Bridging Neural Network Theory: Geometry-Aware Initialization for Sigmoidal MLPs

DeLeaker: A New Method to Prevent Semantic Leakage in Text-to-Image Models

Proactive Defense: How Honeypots Are Securing LLMs Against Multi-Turn Jailbreaks

Subscribe to get the latest news and updates