A very common application of Natural Language Processing is Machine Translation. Machine Translation is a field which investigates the use of software to translate text from one language to another. However, the usual application we have seen is a translating app, Google Translate, or another software. These applications translate from one Natural Language to another.
Recently I was assigned a task by my college professor: we had to write a program in C, but I, being unaware of that, wrote the program in Python. Upon learning that it was necessary to code in C, I had to translate the entire code from Python to C, and I never wrote a C program as fast as I wrote that one. As I was writing, I thought to myself, “I wish there was a way to just convert this to a C code.”
Apparently there is, and companies like Facebook have been using transcoders to do it.
Facebook has launched an Open Source Library named TransCoder, which is basically an entirely self-supervised neural transcompiler system that makes code migration more efficient.
TransCoder builds a sequence-to-sequence (seq2seq) model with attention, composed of an encoder and a decoder with a transformer architecture. It uses a single shared model, based in part on Facebook AI’s previous work on XLM, for all programming languages.
The striking feature about TransCoder is that it does not require a parallel data for training. The difference between a TransCoder model from the traditional supervised models is that in self-supervised models the same code for both source and target languages is not needed. That solves many problems as both source and target codes for languages such as COBOL to C++ simply don’t exist.
In order to better understand the mechanisms behind TransCoder, one must first consider the different elements of translation, translator programs, and transcompilers.
When I first delved into the topic, I realised that Migrating codebase from an old programming language such as COBOL to a modern alternative like Java or C++ or Python is a very difficult yet essential task. It requires a lot of expertise and human effort. I learned that there are trans-compilers available. A transcompiler is a translator which converts between programming languages that operate at a similar level of abstraction. Using a transcompiler and manually adjusting the output source code may be a faster and cheaper solution than rewriting the entire codebase from scratch, however the limitation with them is that they are still rule-based code translators. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time consuming and requires expertise in both the source and target languages, making code-translation projects expensive.
Converting between two object oriented languages like Java and C++ is incredibly complex, yet they are both C based. The translator program would have to have perfect knowledge of the standard libraries for both languages to be able to know the differences in behaviour.
Also to figure out how to convert a construct in the first language to a construct in the second language is another challenge. It is fine if you’re doing an object in C++ to an object in Java but what about the C++ structs? Or the functions outside of the C++ classes?
Programming languages are similar to natural languages in many ways, and natural language translation has been studied extensively. Yet it is also true that programming languages have a distinct structure which makes it harder to use the same tools for translation. For instance, the RNN-based sequence generator, which easily generates phrases in a natural language, finds it difficult to generate long syntactically correct programs.
Translating source code from one Turing-complete language to another is always possible in theory. Unfortunately, building a translator is difficult in practice: different languages can have different syntax and rely on different platform APIs and standard-library functions.
To sum up; these advances in machine translation for programming languages would not only make code migration easier but also could help people who don’t have the time or cannot afford courses to learn to program in multiple languages, or maybe even help complete assignments for students like me. However we still have a long way to go.