Optimizing Large Language Models for Finding Duplicate Code

TLDR: The paper investigates how to best select and combine Large Language Models (LLMs) for scalable code clone detection. It evaluates 76 LLMs, identifying top performers like CodeT5+110M and CuBERT. Key findings suggest that smaller embedding sizes and tailored training data are advantageous. The study also demonstrates that ensembling multiple LLMs, especially with proper score normalization and aggregation, significantly improves detection precision on larger datasets, with CodeT5+110M achieving 39.71% precision on a commercial dataset, double that of CodeBERT.

Source code clones, which are duplicated sections of code within or across software projects, pose significant risks. These risks range from potential intellectual property violations to the introduction of unintended vulnerabilities. Detecting these clones, especially those that have diverged over time, remains a complex and challenging task for software developers and engineers.

Recently, Large Language Models (LLMs) have emerged as powerful tools with applications in various programming tasks, including code clone detection. However, with the rapid proliferation of new LLMs, two critical questions have arisen: how to select the most effective LLM for large-scale clone detection, and whether combining multiple LLMs (known as ensembling) can further enhance their performance.

This research paper dives deep into these questions, starting by identifying a vast pool of 76 LLMs. Through a rigorous filtering process, this number was narrowed down to 9 suitable candidates for large-scale code clone detection. These candidates were then put to the test on two public industrial datasets, BigCloneBench and a commercial large-scale dataset, to evaluate their effectiveness.

The evaluation revealed that no single LLM consistently outperformed all others across every scenario. However, models like CodeT5+110M, CuBERT, and SPTCode emerged as top performers. An interesting insight from the analysis was that LLMs with smaller embedding sizes, smaller tokenizer vocabularies, and those trained on tailored datasets tended to perform better. For instance, on a commercial large-scale dataset, CodeT5+110M achieved an impressive 39.71% precision, which is twice the precision of the previously used CodeBERT model.

Beyond individual model performance, the paper also explored the concept of ensembling LLMs. This approach aims to improve overall effectiveness by combining the strengths of multiple models. The findings suggest that proper score normalization and favoring aggregation methods like ‘maximum’ or ‘sum’ over simple ‘averaging’ are crucial for successful ensembling. Importantly, ensembling proved to be statistically significant and highly effective on larger datasets. The best-performing ensemble achieved an even higher precision of 46.91% on the commercial large-scale code, surpassing the performance of any individual LLM.

The study highlights several key contributions to the field. It provides empirical evidence that LLM performance is highly dependent on the dataset, architecture, training, and tokenizer vocabulary. For example, CodeT5+110M excelled on datasets with smaller clone classes, while CodeT5 and StarEncoder performed better on benchmarks with larger clone classes. The research also introduced a Borda count aggregation method for fair model comparison, identifying CuBERT as a top-performing and stable model. Furthermore, the in-situ evaluation on a private industrial dataset demonstrated that real-world outcomes can differ from public benchmarks, but ensembling still provides a significant efficacy gain.

Also Read:

In conclusion, this research provides valuable guidance for selecting and combining LLMs for scalable code clone detection. It emphasizes that a more compact LLM, with a focus on high-quality and diverse training data, can be more advantageous than simply increasing model size. Moreover, strategic ensembling, particularly with appropriate normalization and aggregation techniques, offers a powerful way to boost detection performance, especially in large and complex codebases. For more detailed information, you can refer to the full research paper: Selecting and Combining Large Language Models for Scalable Code Clone Detection.

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Optimizing Large Language Models for Finding Duplicate Code

Gen AI News and Updates

C.H. Robinson Advances Logistics with Extensive AI Agent Deployment in Navisphere

Nanovate Secures $2 Million Pre-Seed Funding to Advance Arabic-Native AI Across MENA

Unlocking 6D Spatial Reasoning in AI: A New Benchmark for Multimodal Models

New Approach to Reinforcement Learning Handles Noisy, Complex Rewards

Accelerating Optimization: A Parallel Approach to the Artificial Protozoa Optimizer

DeepAries: A New AI Framework for Smart Portfolio Rebalancing

Navigating Volatile Markets: A New AI System for Smarter Investment Portfolios

How Federated Learning is Reshaping Financial Security

Improving PET Scan Clarity with a Physics-Aware Denoising Network

Machine Learning Unlocks Earlier Detection of Kidney and Heart Disease in Diabetic Patients

VaultGemma 1B: A New Milestone in Differentially Private Language Models

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

TangledFeatures: Untangling Correlated Data for Clearer Scientific Insights

Unpacking LLM Toxicity: A Multi-Label Evaluation Framework

Generative AI’s Ability to Interpret Idioms in Essay Scoring: A Comparative Study

Boosting Wind Turbine Reliability with a Novel Deep Learning System

Bridging Neural Network Theory: Geometry-Aware Initialization for Sigmoidal MLPs

DeLeaker: A New Method to Prevent Semantic Leakage in Text-to-Image Models

Proactive Defense: How Honeypots Are Securing LLMs Against Multi-Turn Jailbreaks

Subscribe to get the latest news and updates