Supervised Project
Generation of simplified texts |
Loria - Synalp
| Claire Gardent
Categorie
URL
Don't put volumes
Summary
NLP Task:
This article want to provide a robust model for summarizing an unknow, low data language to a known language or vice-versa
Training Data:
The training data for this model is the New York times summarisation corpus. This dataset has 670K articles. They translate 112k articles from the NYT corpus into each of our three low-resource languages, Somali, Swahili, and Tagalog then again in English. After that, they pair it with the clean version of the translation.
Models Description:
There is 2 models in this paper, the Machine translation model. It was trained on a mix of clean, human-curated arallel data (about 23k sentences for Somali and Swahili and 51k for Tagalog) and data created from monolingual sources including news articles, the Common Crawl, and Wikipedia (250-600k sentences).
And there is the Summarization model. Stating pre-training on the unmodified NYT corpus to obtain a baseline system. They used each of the three noisy English corpora to train the baseline system.
Key Contributions:
They provide a robust system for the task of cross-lingual summarization of low resources language
Results:
They used an unknown article in another language, here Arabic to get result from the model.
Update