NLG Task:
Previous work only studies English reply suggestion. So, this paper presents MRS, a Multilingual Reply suggestion dataset with ten languages.
The task is to build a generation and a retrieval model as baselines for MRS.
MRS is publicly available at https://github.com/zhangmozhi/mrs
Training Data:
The Data used to investigate reply suggestion is from a dataset dubbed MRS built from publicly available Reddit threads. Message reply pairs, response sets, and machine translated examples from ten languages are extracted
For each language, we
use 80% examples for training, 10% for validation, and 10% for testing. We then create response sets for retrieval
models.
Model Description
Retrieval Model -> Selects the reply from a predetermined response set. Easier to train, run faster and a curated response set guarantees the coherence and the safety of the model output
Generation Model -> Produce replys from scratch and are more powerful because they are not constrained by the response set
Key Contributions:
This paper demonstrates that the generation model beats the retrieval model in nearly every language set, resulting in higher relevance scores and more diverse replies.
However, However, unlike
the retrieval model, the generation model fails to
generalize across languages in the zero-shot setting, despite using Unicoder-XDAE for initialization. The generation model "forgets" its multilingual knowledge acquired during pre-training. This suggests that reply suggestion poses unique challenges for cross-lingual transfer learning,
Results:
NLG Task
The task is to use resource-rich monolingual Abstractive Sentence Summarization (ASSUM system) to teach low resource cross-lingual summarization system on both summary word generation and attention.
https://github.com/KelleyYin/Cross-lingual-Summarization
Training Data
Chinese-to-English summarization system is used, which takes Chinese sentence as input, and outputs English abstractive summary. They build evaluation sets for this task by manually
translating English sentences of the existing English evaluation sets into Chinese inputs. They also use Gigaword corpus and DUC-2004 as another English data set only for testing.Data is also collected from the Chinese microblogging website Sina Weibo with 2.4M sentence-summary pairs for training and 725 pairs for testing.
Model Description
Transformer is employed .Six layers are stacked in both the encoder and decoder, and the dimensions of the embedding vectors and all hidden vectors are set 512.Then set eight heads in the multi-head attention. The source embedding, the target embedding and the linear sublayer are shared in the teacher networks, while are not shared in the student networks. Byte-pair encoding is employed with a vocabulary of about 32k tokens on English side and Chinese side respectively.
Genuine summaries paired with the generated pseudo sources to train the cross-lingual summarisation system. They use the teacher-student framework in which the monolingual summarisation system is taken as the teacher and the cross-lingual summarisation system is the student. The teacher let the student to simulate both the summary word distribution and
attention weights according to those of the teacher network.
They use a back-translation procedure that generates pseudo source sentences paired with the true summaries to build a training corpus for the cross-lingual ASSUM.
Key Contribution
Proposed a new loss function on generative probability distribution and attention. It performs significantly better (around 2 points on Rouge1 and Rouge2) than several baselines, and is able to significantly reduce the performance gap between the cross-lingual ASSUM and the monolingual ASSUM over the benchmark datasets.
Results
Gigaword : Rouge score 1 - 30.1 Rouge score 2- 12.2 Rough score 3- 27.7
DUC2004 :Rouge score 1 - 26 Rouge score 2- 8 Rough score 3- 23.1
NLP Task:
The task is to Summarise an article in a target language. This change the traditional approach from summarising -> translate or translate -> summarise to a cross-lingual summarisation
Training Data:
The experiment is based on the NCLS dataset. This dataset contains paired data between English and Chinese. There is English articles with summaries in Chinese and Chinese articles with English summaries.
Cross-lingual data has been generated using a Machine Translation Model.
For Pre-training Data, they used Wikipedia dumps, with 83 Million sentences for English and 20 Million sentences for Chinesse
Model Description:
The model has 6 layers and 8 heads. The input and output dimensions for all transformer blocks are 512 and the inner dimension is 2048. The vocabulary has a size of 33, 000 from a balanced mix of the monolingual Wikipedia corpus. The model has approximately 61M parameters.
Key Contributions:
They provide a pre-train model for English and Chinese that outperform traditional method.
Results:
NLP Task:
This article want to provide a robust model for summarizing an unknow, low data language to a known language or vice-versa
Training Data:
The training data for this model is the New York times summarisation corpus. This dataset has 670K articles. They translate 112k articles from the NYT corpus into each of our three low-resource languages, Somali, Swahili, and Tagalog then again in English. After that, they pair it with the clean version of the translation.
Models Description:
There is 2 models in this paper, the Machine translation model. It was trained on a mix of clean, human-curated arallel data (about 23k sentences for Somali and Swahili and 51k for Tagalog) and data created from monolingual sources including news articles, the Common Crawl, and Wikipedia (250-600k sentences).
Key Contributions:
They provide a robust system for the task of cross-lingual summarization of low resources language
Results:
They used an unknown article in another language, here Arabic to get result from the model.
NLG Task
Cross-Lingual Summarization from english to chinese
Training Data
– 62256.2 Dataset
Gigaword
DUC2004 D
LCSTS
CNN/DM
Model Description
transformer architectures is used with ScaleNorm normalisation. The mapper is a linear layer with a hidden size of 512, and the discriminator is a
two-layer linear layer with a hidden size of 2048. Several loss functions defined
Key Contribution
First unsupervised cross lingual summarisation
NLG Task:
This article describes a Natural Machine Translator for languages, using one encoder for the entry and language specific decoder
Training Data:
They use Monolingual news corpora from WMT 2014 for English, German and French and WMT 2013 for Spanish. This represents more than 1.3 billion tokens for the 4 languages present. They don’t use parallel data.
Model Description:
They use FastText to train the model with a Skip-gram model with vector dimension of 300. With that they get monolingual embedding. Then use the cross lingual embedding mapping code Muse to map every language to English. The embedding dimension of 300 and hidden dimension of 600, vocabulary size 50K
Key Contribution:
This outperforms the separately trained bilingual models for all translation directions.
Result:
NLG Task:
This paper proposes a cross lingual transfer to distant language of English
Training Data:
The dataset used is the Universal Dependency Treebanks where English is used as the only labeled source data. 10 distant languages and 10 nearby languages are selected as the target.
Model Description:
To establish connections between the source and target languages, they employ two instances of the structured flow model – a source model and a target model – and share parameters between them. Usage of the flow model
Key Contribution:
Their model is effective at transferring to distant language, it outperforms other solutions
Result:
NLG Task
Produce a text in multiple languages that are accurate realisations of the abstract semantic information given in the input (Meaning Representation, MR).
Training Data
– WOZ 2.0 English, German, Italian
– MultiWOZ + CrossWOZ English, Chinese
– WebNLG 17 / WebNLG 20 English, Russian
Model Description
They adapt the universal encoder-decoder framework .The input and output are first delexicalised using pretrained language-independent embeddings, and (option-ally) ordered. The multilingual generation model is trained on the delexicalised training data, and the output is relexicalised using automatic value post-editing to ensure that the values fit the context. For matching the MR values with corresponding words in the text, the system maps MR values to n-grams based on the similarity of their representations. Specifically, it calculates the similarity between a value v and all word n-grams wi . . . wj in the text.
The adoption of generic placeholders creates problems for relexicalisation as it becomes unclear which input value should replace which placeholder.They address this by ordering the model’s input based on the graph formed by its RDF triples, again by following Trisedya et al. (2018).They traverse every edge in the graph, starting from the node with the least incoming edges (or randomly in case of ties) and then visit all nodes via BFS (breadth-first search).
Key Contribution
Overcome the problem of input verbatim required in the output text for delexicalisation. Able to achieve state-of-the-art results with improvements up to 29 bleu points over competitive baselines on unseen cases.
NLG Task
Generating text in twenty one different languages. from structured data or abstract meaning representation (AMR) .
Training Data
Original corpus is AMR datasets from the LDC can be found at
https://amr.isi.edu/download.html .From this, training data for multilingual AMR-to-Text models is prepared by taking the EUROPARL multilingual corpus and automatically annotating the English data with AMRs using the jamr semantic parser.
For encoder pretraining on silver AMR, they take thirty million sentences from the English portion of CCNET.
Model Description
For encoding , large Transformer combining crosslingual word embeddings and encoder and decoder pretraining.The graph embedding provides additional information to the Transformer Encoder by encoding the depth of each node in the rooted graph and the subgraph each node belongs to.
Translation models to generate in different language are Transformer Big models trained with LayerDrop for 100k updates on public benchmark data from WMT .
To create a one-to-many multilingual model, they
model a language embedding on the encoder side to
allow the decoder to distinguish which language to
generate into.
Key Contribution
Additional embeddings are concatenated to the word and position embeddings. Such information allows the Transformer Encoder to capture some graph structure information, while still modeling a sequence.
Low resource language like Estonian sees a 2.3 BLEU point improvement from multilingual modeling.
Update
NLG Task
Multilingual Summarization Corpus
Training Data
The dataset is built from online news out-
lets, and contains over 1.5M article-summary pairs
in 5 languages: French, German, Spanish, Rus-
sian, and Turkish, which complement an already
established summarization dataset in English
Model Description
Key Contribution
Huge corpus
NLG Task:
The task is to provide a pre-train T5 model for a multitude of languages, there is 101 languages for this model
Training Data:
Data comes from Common Crawl (mC4) it is in more than 100 Languages. There is about 1 trillion token of data used for pre-training the model who is about 450 GB of data from C4 (Common Crawl web space). For non- English data, the cld3 model is used to determine the language of the page.
Model Description:
A Transformer is employed (The T5 transformer is used). It can be used with 60 Million to 11 Billion parameters. The vocabulary used is composed of 250,000 wordpiece. The model use 1 million steps on 1024 length-1024 input
Key Contribution:
This model demonstrates that the T5 model works for multilingual and can perform it quickly. Also this model can detect illegal prediction so it can be removed automatically.
All the code and models used can be found on their github at the link : https://github.com/google-research/multilingual-t5
Result:
F1 scores for every model compared to mT5
| Model | XNLI | PAWS-X | WikiAnn-NER | XQuAD | MLQA | TyDiQA-GoldP |
|---|---|---|---|---|---|---|
| mBERT | 65.4 | 81.9 | 62.2 | 64.5 | 61.4 | 59.7 |
| XLM | 69.1 | 80.9 | 61.2 | 59.8 | 48.5 | 43.6 |
| InfoXLM | 81.4 | - | - | - | 73.6 | - |
| X-STILTs | 80.4 | 87.7 | 64.7 | 77.2 | 72.3 | 76.0 |
| XLM-R | 79.2 | 86.4 | 65.4 | 76.6 | 71.6 | 65.1 |
| VECO | 79.9 | 88.7 | 65.7 | 77.3 | 71.7 | 67.6 |
| RemBERT | 80.8 | 87.5 | 70.1 | 79.6 | 73.1 | 77.0 |
| mT5-Small | 67.5 | 82.4 | 50.5 | 58.1 | 54.6 | 36.4 |
| mT5-Base | 75.4 | 86.4 | 55.7 | 67.0 | 64.6 | 59.1 |
| mT5-Large | 81.1 | 88.9 | 58.5 | 77.8 | 71.2 | 68.4 |
| mT5-XL | 82.9 | 89.6 | 65.5 | 79.5 | 73.5 | 77.8 |
| mT5-XXL | 85.0 | 90.0 | 69.2 | 82.5 | 76.0 | 82.0 |
NLG Task:
This paper describes Language Agnostic Delexicalisation an SP for Multilingual RDF to text generation
Training Data:
The training data used is provided by the organizer of the WebNLG Challenge 2020. The Russian part of the data has been created by translating English with a Machine Translation system and then post edited by crowdsourcing then spell checked. The data can be found here https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0
Model Description:
Use a transformer as architecture. fairseq toolkit is used. There is 2 encoder and 1 decoder. 4 layers 256 hidden size, 3072 size for the feed forward layers. Trained with 0.4 dropout and 0.1 attention dropout
Key Contribution:
The LAD approach outperforms other solutions for English but cannot be tested for Russian because they didn’t have enough data.
Result:
NLG Task:
They base their approach on the mBART model, which is pre-trained for multilingual denoising. This allows them to use a simple, identical, end-to-end setup for both English and Russian.
Training Data:
WebNLG2020 dataset
The data contains sets of RDF triples extracted from DBpedia accompanied with verbalizations which were crowdsourced from human annotators.
git link : https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0
Model Description:
They finetune the pre-trained mBART model (Liuet al., 2020) on the provided training data individually for each language (English and Russian). They then feed tokenized and
trivially linearized input RDF triples into the model and train it to output ground-truth references
Key Contribution:
In
automatic metrics, our solution placed in the top
third of the field (out of 35 submissions) for English and first or second (out of 12 submissions)
for Russian. In human evaluation, it scored in the
best or second-best system cluster. We believe that
our approach—with its excessive simplicity—can
serve as a benchmark for a trade-off between the
output quality and the setup complexity.
Results:
NLG Task:
WebNLG+ offers two challenges:
(i) mapping sets of RDF triples to English or Russian text (generation) and
(ii) converting English or Russian text to sets of RDF triples (semantic parsing) (15 groups)
Training Data:
The English challenge data uses the version 3.01 of the WebNLG corpus (Gardent et al., 2017a).
Russian WebNLG was translated from English WebNLG for nine DBpedia categories: Airport,
Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University
Model Description:
Blinov (2020) focuses on generation into Russian. They used the pre-trained Russian GPT language model (Radford et al., 2019) augmented with a classification head and fine-tuned on the WebNLG+ RDF-to-Russian dataset. The author
experimented with various sampling methods and with data augmentation. For data augmentation, they use the Baidu SKE dataset (194,747
RDF/Chinese text pairs) and automatically translate its text part into Russian
Key Contribution:
Neural vs Rule based approaches
The former models seem to automatically
generate text comparable in quality with human
texts in terms of adequacy, i.e., the generated texts
express exactly the communicative goals contained
in the input tripleset. On the other hand, novel neural approaches produce text comparable to human
texts in terms of fluency
Results:
NLG Task:
Describes the findings of the Third Workshop on Neural Generation and Translation 2019,
Participants were tasked with creating (NMT) systems that were both accurate and efficient and developing systems (DGT) that generate summaries from structured data, potentially with assistance from text in another language
Training Data:
A subset of the RotoWire dataset is used as the training data, accompanied by professional German translations which are sentences aligned to the original English articles.
obtained parallel dataset is called RotoWire English-German dataset
Model Description:
Workshop provided a forum for research in applications of neural models to
machine translation and other language generation tasks (including summarization (Rush et al.,
2015), NLG from structured data (Wen et al., 2015), dialog response generation (Vinyals and
Le, 2015), among others).
Key Contribution:
In NLG track, we observed apparent difference between the constrainsed and unconstrained settings. NLE's groups approach showed that pre-training of the document level generation model on news corpora is effective even if the source input differs. Among constrained systems it is worth noting that all the systems but Team EdiNLG used the Transformer, but the result did not show any noticeable improvements compared to EdiNLG.
Results:
This paper summarized the results of the Third
Workshop on Neural Generation and Translation,
where we saw a number of research advances.
Particularly, this year introduced a new document
generation and translation task, that tested the efficacy of systems for both the purposes of translation and generation in a single testbed
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Update