spot_img
HomeResearch & DevelopmentBoosting Code Translation with Automated Snippet Data and Two-Stage...

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

TLDR: This research introduces an automated, LLM-driven method to augment snippet-alignment (SA) data for code translation, addressing the scarcity of fine-grained parallel corpora. It proposes a pipeline that uses LLMs to insert comments into source programs and rewrite target programs to align snippets, followed by a two-stage training strategy (Program-Alignment data first, then Snippet-Alignment data). Experiments show that this approach consistently improves code translation model performance, achieving up to a 3.78% gain, particularly for static programming languages, and demonstrates the superior quality of LLM-augmented SA data over manually constructed datasets.

Code translation, the process of converting code from one programming language to another, is a vital task in software development. It helps in migrating older systems, refactoring code, and enabling cross-platform development. While Large Language Models (LLMs) have shown great promise in this area, their effectiveness heavily relies on the availability of high-quality parallel corpora – datasets of code snippets or programs translated between languages.

These parallel corpora come in two main forms: Program-Alignment (PA) data and Snippet-Alignment (SA) data. PA data consists of entire programs aligned across languages, offering complete context for semantic understanding. However, its length can make it difficult for models to learn fine-grained details. SA data, on the other hand, comprises shorter, aligned code snippets, which are excellent for teaching models precise syntactic and minor semantic alignments.

A significant challenge in code translation research is the scarcity of these parallel corpora, especially SA data. Existing data augmentation methods have primarily focused on PA data, leaving a gap in augmenting SA data. This lack of fine-grained training signals can lead to syntactic or subtle semantic errors in translated code.

A Novel Approach to Data Augmentation

Researchers have introduced an innovative, automated method to generate SA data using LLMs, effectively bridging this gap. This new pipeline takes existing PA data and transforms it into valuable SA data. The process involves three key stages:

1. Comment Insertion: An LLM is used to analyze a source program and insert comments at strategic points. These comments act as natural separators, breaking down the longer program into logical snippets. The goal is to insert enough comments to ensure each resulting snippet is not excessively long.

2. Comment-Based Program Rewriting: With the commented source program and the original target program as input, the LLM then rewrites the target program. The crucial aspect here is that the rewritten target program must preserve the exact content and order of the comments from the source program. This ensures that the snippets in both languages are aligned based on these comments.

3. Split and Match: In the final stage, the commented source and target programs are split into individual snippets using the comments as delimiters. These corresponding snippets are then matched to create the new SA data pairs. Any discrepancies in comment count or content between the parallel programs lead to that pair being discarded, ensuring high data quality.

Two-Stage Training for Enhanced Performance

To make the most of both PA and the newly augmented SA data, a simple yet effective two-stage training strategy, called ‘2-Stage-PS’, was proposed. This method involves training the code translation model sequentially: first on PA data for one epoch, followed by training on SA data for another epoch. This approach guides the model to first grasp the broader semantic context from PA data and then refine its understanding with the fine-grained details provided by SA data.

Experimental Validation and Key Findings

Experiments were conducted using powerful open-source code LLMs like DeepSeek-Coder-Instruct and Qwen2.5-Coder-Instruct, and evaluated on the TransCoder-test dataset. The results were compelling:

  • The 2-Stage-PS method consistently outperformed models trained solely on PA data, achieving an average performance gain of up to 3.78% on the pass@k metric, which measures the semantic equivalence of translated programs.
  • This improvement was particularly notable for translations involving static programming languages like C++ and Java (X2C and X2J translations). Dynamic languages like Python (X2P translations) showed less consistent improvement, possibly due to their more flexible syntax rules, which might benefit less from the fine-grained syntactic alignment learned from SA data.
  • Further analysis revealed that the order of training matters. The ‘PA then SA’ (PS) sequence consistently yielded the best results, highlighting the importance of learning coarse-grained information before moving to fine-grained details.
  • The LLM-augmented SA data proved to be of high quality, with a usability rate of 97.2%. Surprisingly, this augmented data even surpassed manually constructed SA datasets (like XLCoST-Snippet) in effectiveness, despite being smaller in quantity. This suggests that the LLM-generated snippets, by preserving semantic integrity and having a moderate length, offer superior training signals.

Also Read:

Conclusion

This research marks a significant step forward in addressing the data scarcity bottleneck in code translation. By introducing an automated, LLM-driven pipeline for SA data augmentation and a strategic two-stage training approach, models can now acquire more comprehensive alignment knowledge. This work demonstrates the immense potential of leveraging LLMs not just for translation, but also for intelligently augmenting training data, leading to more robust and accurate code translation systems. You can read the full research paper here.

Nikhil Patel
Nikhil Patelhttp://edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -