Boosting Code Translation with Automated Snippet Data and Two-Stage Training

TLDR: This research introduces an automated, LLM-driven method to augment snippet-alignment (SA) data for code translation, addressing the scarcity of fine-grained parallel corpora. It proposes a pipeline that uses LLMs to insert comments into source programs and rewrite target programs to align snippets, followed by a two-stage training strategy (Program-Alignment data first, then Snippet-Alignment data). Experiments show that this approach consistently improves code translation model performance, achieving up to a 3.78% gain, particularly for static programming languages, and demonstrates the superior quality of LLM-augmented SA data over manually constructed datasets.

Code translation, the process of converting code from one programming language to another, is a vital task in software development. It helps in migrating older systems, refactoring code, and enabling cross-platform development. While Large Language Models (LLMs) have shown great promise in this area, their effectiveness heavily relies on the availability of high-quality parallel corpora – datasets of code snippets or programs translated between languages.

These parallel corpora come in two main forms: Program-Alignment (PA) data and Snippet-Alignment (SA) data. PA data consists of entire programs aligned across languages, offering complete context for semantic understanding. However, its length can make it difficult for models to learn fine-grained details. SA data, on the other hand, comprises shorter, aligned code snippets, which are excellent for teaching models precise syntactic and minor semantic alignments.

A significant challenge in code translation research is the scarcity of these parallel corpora, especially SA data. Existing data augmentation methods have primarily focused on PA data, leaving a gap in augmenting SA data. This lack of fine-grained training signals can lead to syntactic or subtle semantic errors in translated code.

A Novel Approach to Data Augmentation

Researchers have introduced an innovative, automated method to generate SA data using LLMs, effectively bridging this gap. This new pipeline takes existing PA data and transforms it into valuable SA data. The process involves three key stages:

1. Comment Insertion: An LLM is used to analyze a source program and insert comments at strategic points. These comments act as natural separators, breaking down the longer program into logical snippets. The goal is to insert enough comments to ensure each resulting snippet is not excessively long.

2. Comment-Based Program Rewriting: With the commented source program and the original target program as input, the LLM then rewrites the target program. The crucial aspect here is that the rewritten target program must preserve the exact content and order of the comments from the source program. This ensures that the snippets in both languages are aligned based on these comments.

3. Split and Match: In the final stage, the commented source and target programs are split into individual snippets using the comments as delimiters. These corresponding snippets are then matched to create the new SA data pairs. Any discrepancies in comment count or content between the parallel programs lead to that pair being discarded, ensuring high data quality.

Two-Stage Training for Enhanced Performance

To make the most of both PA and the newly augmented SA data, a simple yet effective two-stage training strategy, called ‘2-Stage-PS’, was proposed. This method involves training the code translation model sequentially: first on PA data for one epoch, followed by training on SA data for another epoch. This approach guides the model to first grasp the broader semantic context from PA data and then refine its understanding with the fine-grained details provided by SA data.

Experimental Validation and Key Findings

Experiments were conducted using powerful open-source code LLMs like DeepSeek-Coder-Instruct and Qwen2.5-Coder-Instruct, and evaluated on the TransCoder-test dataset. The results were compelling:

The 2-Stage-PS method consistently outperformed models trained solely on PA data, achieving an average performance gain of up to 3.78% on the pass@k metric, which measures the semantic equivalence of translated programs.
This improvement was particularly notable for translations involving static programming languages like C++ and Java (X2C and X2J translations). Dynamic languages like Python (X2P translations) showed less consistent improvement, possibly due to their more flexible syntax rules, which might benefit less from the fine-grained syntactic alignment learned from SA data.
Further analysis revealed that the order of training matters. The ‘PA then SA’ (PS) sequence consistently yielded the best results, highlighting the importance of learning coarse-grained information before moving to fine-grained details.
The LLM-augmented SA data proved to be of high quality, with a usability rate of 97.2%. Surprisingly, this augmented data even surpassed manually constructed SA datasets (like XLCoST-Snippet) in effectiveness, despite being smaller in quantity. This suggests that the LLM-generated snippets, by preserving semantic integrity and having a moderate length, offer superior training signals.

Also Read:

Conclusion

This research marks a significant step forward in addressing the data scarcity bottleneck in code translation. By introducing an automated, LLM-driven pipeline for SA data augmentation and a strategic two-stage training approach, models can now acquire more comprehensive alignment knowledge. This work demonstrates the immense potential of leveraging LLMs not just for translation, but also for intelligently augmenting training data, leading to more robust and accurate code translation systems. You can read the full research paper here.

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

A Novel Approach to Data Augmentation

Two-Stage Training for Enhanced Performance

Experimental Validation and Key Findings

Conclusion

Gen AI News and Updates

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

C.H. Robinson Advances Logistics with Extensive AI Agent Deployment in Navisphere

Nanovate Secures $2 Million Pre-Seed Funding to Advance Arabic-Native AI Across MENA

New Approach to Reinforcement Learning Handles Noisy, Complex Rewards

Accelerating Optimization: A Parallel Approach to the Artificial Protozoa Optimizer

DeepAries: A New AI Framework for Smart Portfolio Rebalancing

Navigating Volatile Markets: A New AI System for Smarter Investment Portfolios

How Federated Learning is Reshaping Financial Security

Improving PET Scan Clarity with a Physics-Aware Denoising Network

Machine Learning Unlocks Earlier Detection of Kidney and Heart Disease in Diabetic Patients

VaultGemma 1B: A New Milestone in Differentially Private Language Models

TangledFeatures: Untangling Correlated Data for Clearer Scientific Insights

Unpacking LLM Toxicity: A Multi-Label Evaluation Framework

Generative AI’s Ability to Interpret Idioms in Essay Scoring: A Comparative Study

Boosting Wind Turbine Reliability with a Novel Deep Learning System

Bridging Neural Network Theory: Geometry-Aware Initialization for Sigmoidal MLPs

DeLeaker: A New Method to Prevent Semantic Leakage in Text-to-Image Models

Proactive Defense: How Honeypots Are Securing LLMs Against Multi-Turn Jailbreaks

UrbanVerse: Creating Realistic City Simulations from Online Videos for AI Training

Subscribe to get the latest news and updates