TLDR: The paper investigates how to best select and combine Large Language Models (LLMs) for scalable code clone detection. It evaluates 76 LLMs, identifying top performers like CodeT5+110M and CuBERT. Key findings suggest that smaller embedding sizes and tailored training data are advantageous. The study also demonstrates that ensembling multiple LLMs, especially with proper score normalization and aggregation, significantly improves detection precision on larger datasets, with CodeT5+110M achieving 39.71% precision on a commercial dataset, double that of CodeBERT.
Source code clones, which are duplicated sections of code within or across software projects, pose significant risks. These risks range from potential intellectual property violations to the introduction of unintended vulnerabilities. Detecting these clones, especially those that have diverged over time, remains a complex and challenging task for software developers and engineers.
Recently, Large Language Models (LLMs) have emerged as powerful tools with applications in various programming tasks, including code clone detection. However, with the rapid proliferation of new LLMs, two critical questions have arisen: how to select the most effective LLM for large-scale clone detection, and whether combining multiple LLMs (known as ensembling) can further enhance their performance.
This research paper dives deep into these questions, starting by identifying a vast pool of 76 LLMs. Through a rigorous filtering process, this number was narrowed down to 9 suitable candidates for large-scale code clone detection. These candidates were then put to the test on two public industrial datasets, BigCloneBench and a commercial large-scale dataset, to evaluate their effectiveness.
The evaluation revealed that no single LLM consistently outperformed all others across every scenario. However, models like CodeT5+110M, CuBERT, and SPTCode emerged as top performers. An interesting insight from the analysis was that LLMs with smaller embedding sizes, smaller tokenizer vocabularies, and those trained on tailored datasets tended to perform better. For instance, on a commercial large-scale dataset, CodeT5+110M achieved an impressive 39.71% precision, which is twice the precision of the previously used CodeBERT model.
Beyond individual model performance, the paper also explored the concept of ensembling LLMs. This approach aims to improve overall effectiveness by combining the strengths of multiple models. The findings suggest that proper score normalization and favoring aggregation methods like ‘maximum’ or ‘sum’ over simple ‘averaging’ are crucial for successful ensembling. Importantly, ensembling proved to be statistically significant and highly effective on larger datasets. The best-performing ensemble achieved an even higher precision of 46.91% on the commercial large-scale code, surpassing the performance of any individual LLM.
The study highlights several key contributions to the field. It provides empirical evidence that LLM performance is highly dependent on the dataset, architecture, training, and tokenizer vocabulary. For example, CodeT5+110M excelled on datasets with smaller clone classes, while CodeT5 and StarEncoder performed better on benchmarks with larger clone classes. The research also introduced a Borda count aggregation method for fair model comparison, identifying CuBERT as a top-performing and stable model. Furthermore, the in-situ evaluation on a private industrial dataset demonstrated that real-world outcomes can differ from public benchmarks, but ensembling still provides a significant efficacy gain.
Also Read:
- Large Language Models and Code Speed: An Experimental Review
- Enhancing Dense Retrieval Models with a Single Mixture-of-Experts Block for Improved Efficiency and Generalization
In conclusion, this research provides valuable guidance for selecting and combining LLMs for scalable code clone detection. It emphasizes that a more compact LLM, with a focus on high-quality and diverse training data, can be more advantageous than simply increasing model size. Moreover, strategic ensembling, particularly with appropriate normalization and aggregation techniques, offers a powerful way to boost detection performance, especially in large and complex codebases. For more detailed information, you can refer to the full research paper: Selecting and Combining Large Language Models for Scalable Code Clone Detection.