Document Type : Reasearch paper
Abstract
Ever since, we have continued to deploy smarter LLMs through various training regimes, which often take advantage of self-supervised language modelling objectives such as next token prediction or span corruption. In parallel, MT (Machine Translation) systems rely on cross-lingual supervision, necessitating aligned data between a source and target language pair. To address these challenges associated with ELVs, we develop Continuous Translation Pretraining (CTP), a novel framework that maps continuous language space with reliable, constrained language mapping. We show that models pretrained in self-supervised language modelling and supervised machine translation objectives tend to perform significantly better on translation tasks across the board, particularly well on low-resource language pairs. Extensive experiments on several language pairs demonstrate substantial gains on zero-shot and fine-tuned settings, attaining up to 4.5 points of BLEU score improvement over traditional methods. This proposed framework facilitates improvement for novel lingual forms without vast parallel corpora, which is advantageous in less-proficient lingual venues and dialects. Our contributions include an in-depth look at the architecture, this model's training process and applications, and a novel evaluation framework tailored to low-resource language situations.