Categorie : Corpus Multi Lingual Generation

  • URL : https://aclanthology.org/2021.acl-long.97/
  • Title : A Dataset and Baselines for Multilingual Reply Suggestion
  • Venue : ACL | IJCNLP
  • Publication year : 2021
  • Author : Mozhi Zhang, Wei Wang, Budhaditya Deb, Guoqing Zheng, Milad Shokouhi, Ahmed Hassan Awadallah
  • Categorie : Corpus Multi Lingual Generation
Summary :

NLG Task:

Previous work only studies English reply suggestion. So, this paper presents MRS, a Multilingual Reply suggestion dataset with ten languages.
The task is to build a generation and a retrieval model as baselines for MRS.
MRS is publicly available at https://github.com/zhangmozhi/mrs

Training Data:

The Data used to investigate reply suggestion is from a dataset dubbed MRS built from publicly available Reddit threads. Message reply pairs, response sets, and machine translated examples from ten languages are extracted
For each language, we use 80% examples for training, 10% for validation, and 10% for testing. We then create response sets for retrieval models.

Model Description

Retrieval Model -> Selects the reply from a predetermined response set. Easier to train, run faster and a curated response set guarantees the coherence and the safety of the model output

Generation Model -> Produce replys from scratch and are more powerful because they are not constrained by the response set

Key Contributions:

This paper demonstrates that the generation model beats the retrieval model in nearly every language set, resulting in higher relevance scores and more diverse replies.

However, However, unlike the retrieval model, the generation model fails to generalize across languages in the zero-shot setting, despite using Unicoder-XDAE for initialization. The generation model "forgets" its multilingual knowledge acquired during pre-training. This suggests that reply suggestion poses unique challenges for cross-lingual transfer learning,

Results:

Update

Categorie : Cross lingual Summarisation

  • URL : https://aclanthology.org/P19-1305/
  • Title : Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention
  • Venue : ACL
  • Publication year : 2019
  • Author : Xiangyu Duan, Mingming Yin, Min Zhang, Boxing Chen, Weihua Luo
  • Categorie : Cross lingual Summarisation
Summary :

NLG Task

The task is to use resource-rich monolingual Abstractive Sentence Summarization (ASSUM system) to teach low resource cross-lingual summarization system on both summary word generation and attention. https://github.com/KelleyYin/Cross-lingual-Summarization

Training Data

Chinese-to-English summarization system is used, which takes Chinese sentence as input, and outputs English abstractive summary. They build evaluation sets for this task by manually translating English sentences of the existing English evaluation sets into Chinese inputs. They also use Gigaword corpus and DUC-2004 as another English data set only for testing.Data is also collected from the Chinese microblogging website Sina Weibo with 2.4M sentence-summary pairs for training and 725 pairs for testing.

Model Description

Transformer is employed .Six layers are stacked in both the encoder and decoder, and the dimensions of the embedding vectors and all hidden vectors are set 512.Then set eight heads in the multi-head attention. The source embedding, the target embedding and the linear sublayer are shared in the teacher networks, while are not shared in the student networks. Byte-pair encoding is employed with a vocabulary of about 32k tokens on English side and Chinese side respectively.

Genuine summaries paired with the generated pseudo sources to train the cross-lingual summarisation system. They use the teacher-student framework in which the monolingual summarisation system is taken as the teacher and the cross-lingual summarisation system is the student. The teacher let the student to simulate both the summary word distribution and attention weights according to those of the teacher network. They use a back-translation procedure that generates pseudo source sentences paired with the true summaries to build a training corpus for the cross-lingual ASSUM.


Key Contribution
Proposed a new loss function on generative probability distribution and attention. It performs significantly better (around 2 points on Rouge1 and Rouge2) than several baselines, and is able to significantly reduce the performance gap between the cross-lingual ASSUM and the monolingual ASSUM over the benchmark datasets.

Results

Gigaword : Rouge score 1 - 30.1 Rouge score 2- 12.2 Rough score 3- 27.7
DUC2004 :Rouge score 1 - 26 Rouge score 2- 8 Rough score 3- 23.1

Update

  • URL : https://aclanthology.org/2020.aacl-main.53
  • Title : Mixed-Lingual Pre-training for Cross-lingual Summarization
  • Venue : AACL
  • Publication year : 2020
  • Author : Ruochen Xu, Chenguang Zhu, Yu Shi, Michael Zeng, Xuedong Huang
  • Categorie : Cross lingual Summarisation
Summary :

NLP Task:

The task is to Summarise an article in a target language. This change the traditional approach from summarising -> translate or translate -> summarise to a cross-lingual summarisation

Training Data:

The experiment is based on the NCLS dataset. This dataset contains paired data between English and Chinese. There is English articles with summaries in Chinese and Chinese articles with English summaries.
Cross-lingual data has been generated using a Machine Translation Model.
For Pre-training Data, they used Wikipedia dumps, with 83 Million sentences for English and 20 Million sentences for Chinesse

Model Description:

The model has 6 layers and 8 heads. The input and output dimensions for all transformer blocks are 512 and the inner dimension is 2048. The vocabulary has a size of 33, 000 from a balanced mix of the monolingual Wikipedia corpus. The model has approximately 61M parameters.

Key Contributions:

They provide a pre-train model for English and Chinese that outperform traditional method.

Results:

Result

Update

  • URL : https://aclanthology.org/N19-1204
  • Title : A Robust Abstractive System for Cross-Lingual Summarization
  • Venue : NAACL
  • Publication year : 2019
  • Author : Jessica Ouyang, Boya Song, Kathy McKeown
  • Categorie : Cross lingual Summarisation
Summary :

NLP Task:

This article want to provide a robust model for summarizing an unknow, low data language to a known language or vice-versa

Training Data:

The training data for this model is the New York times summarisation corpus. This dataset has 670K articles. They translate 112k articles from the NYT corpus into each of our three low-resource languages, Somali, Swahili, and Tagalog then again in English. After that, they pair it with the clean version of the translation.

Models Description:

There is 2 models in this paper, the Machine translation model. It was trained on a mix of clean, human-curated arallel data (about 23k sentences for Somali and Swahili and 51k for Tagalog) and data created from monolingual sources including news articles, the Common Crawl, and Wikipedia (250-600k sentences). And there is the Summarization model. Stating pre-training on the unmodified NYT corpus to obtain a baseline system. They used each of the three noisy English corpora to train the baseline system.
Key Contributions:

They provide a robust system for the task of cross-lingual summarization of low resources language

Results:

They used an unknown article in another language, here Arabic to get result from the model.
Result

Update

  • URL : https://aclanthology.org/2020.acl-main.554
  • Title : Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization
  • Venue : ACL
  • Publication year : 2020
  • Author : Yue Cao, Hui Liu, Xiaojun Wan
  • Categorie : Cross lingual Summarisation
Summary :


NLG Task

Cross-Lingual Summarization from english to chinese

Training Data

– 62256.2 Dataset Gigaword DUC2004 D LCSTS CNN/DM

Model Description

transformer architectures is used with ScaleNorm normalisation. The mapper is a linear layer with a hidden size of 512, and the discriminator is a two-layer linear layer with a hidden size of 2048. Several loss functions defined

Key Contribution

First unsupervised cross lingual summarisation

Update

Categorie : Generate interlingual representation

  • URL : https://aclanthology.org/P19-1297/
  • Title : Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders
  • Venue : ACL
  • Publication year : 2019
  • Author : Sukanta Sen, Kamal Kumar Gupta, Asif Ekbal, Pushpak Bhattacharyya
  • Categorie : Generate interlingual representation
Summary :

NLG Task:

This article describes a Natural Machine Translator for languages, using one encoder for the entry and language specific decoder

Training Data:

They use Monolingual news corpora from WMT 2014 for English, German and French and WMT 2013 for Spanish. This represents more than 1.3 billion tokens for the 4 languages present. They don’t use parallel data.

Model Description:

They use FastText to train the model with a Skip-gram model with vector dimension of 300. With that they get monolingual embedding. Then use the cross lingual embedding mapping code Muse to map every language to English. The embedding dimension of 300 and hidden dimension of 600, vocabulary size 50K

Key Contribution:

This outperforms the separately trained bilingual models for all translation directions.

Result:

Result

Update

Categorie : Generative model for distant language

  • URL : https://aclanthology.org/P19-1311/
  • Title : Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections
  • Venue : ACL
  • Publication year : 2019
  • Author : Junxian He, Zhisong Zhang, Taylor Berg-Kirkpatrick, Graham Neubig
  • Categorie : Generative model for distant language
Summary :

NLG Task:

This paper proposes a cross lingual transfer to distant language of English

Training Data:

The dataset used is the Universal Dependency Treebanks where English is used as the only labeled source data. 10 distant languages and 10 nearby languages are selected as the target.

Model Description:
To establish connections between the source and target languages, they employ two instances of the structured flow model – a source model and a target model – and share parameters between them. Usage of the flow model

Key Contribution:

Their model is effective at transferring to distant language, it outperforms other solutions

Result: Result

Update

Categorie : Multi Lingual concept-to-text generation

  • URL : https://aclanthology.org/2021.acl-long.10
  • Title : Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation
  • Venue : ACL | IJCNLP
  • Publication year : 2021
  • Author : Giulio Zhou, Gerasimos Lampouras
  • Categorie : Multi Lingual concept-to-text generation
Summary :


NLG Task

Produce a text in multiple languages that are accurate realisations of the abstract semantic information given in the input (Meaning Representation, MR).

Training Data

– WOZ 2.0 English, German, Italian – MultiWOZ + CrossWOZ English, Chinese – WebNLG 17 / WebNLG 20 English, Russian

Model Description

They adapt the universal encoder-decoder framework .The input and output are first delexicalised using pretrained language-independent embeddings, and (option-ally) ordered. The multilingual generation model is trained on the delexicalised training data, and the output is relexicalised using automatic value post-editing to ensure that the values fit the context. For matching the MR values with corresponding words in the text, the system maps MR values to n-grams based on the similarity of their representations. Specifically, it calculates the similarity between a value v and all word n-grams wi . . . wj in the text. The adoption of generic placeholders creates problems for relexicalisation as it becomes unclear which input value should replace which placeholder.They address this by ordering the model’s input based on the graph formed by its RDF triples, again by following Trisedya et al. (2018).They traverse every edge in the graph, starting from the node with the least incoming edges (or randomly in case of ties) and then visit all nodes via BFS (breadth-first search).

Key Contribution

Overcome the problem of input verbatim required in the output text for delexicalisation. Able to achieve state-of-the-art results with improvements up to 29 bleu points over competitive baselines on unseen cases.

Update

Summary :


NLG Task

Generating text in twenty one different languages. from structured data or abstract meaning representation (AMR) .

Training Data

Original corpus is AMR datasets from the LDC can be found at https://amr.isi.edu/download.html .From this, training data for multilingual AMR-to-Text models is prepared by taking the EUROPARL multilingual corpus and automatically annotating the English data with AMRs using the jamr semantic parser. For encoder pretraining on silver AMR, they take thirty million sentences from the English portion of CCNET.

Model Description

For encoding , large Transformer combining crosslingual word embeddings and encoder and decoder pretraining.The graph embedding provides additional information to the Transformer Encoder by encoding the depth of each node in the rooted graph and the subgraph each node belongs to. Translation models to generate in different language are Transformer Big models trained with LayerDrop for 100k updates on public benchmark data from WMT . To create a one-to-many multilingual model, they model a language embedding on the encoder side to allow the decoder to distinguish which language to generate into.

Key Contribution

Additional embeddings are concatenated to the word and position embeddings. Such information allows the Transformer Encoder to capture some graph structure information, while still modeling a sequence. Low resource language like Estonian sees a 2.3 BLEU point improvement from multilingual modeling.

Update

  • URL : https://aclanthology.org/2021.eacl-main.30/
  • Title : Bootstrapping Multilingual AMR with Contextual Word Alignments
  • Venue : EACL
  • Publication year : 2021
  • Author : Janaki Sheth, Young-Suk Lee, Ramón Fernandez Astudillo, Tahira Naseem, Radu Florian, Salim Roukos, Todd Ward
  • Categorie : Multi Lingual concept-to-text generation
Summary :

Update

Categorie : Multilingual Summarization Corpus

  • URL : https://aclanthology.org/2020.emnlp-main.647
  • Title : MLSUM: The Multilingual Summarization Corpus
  • Venue : EMNLP
  • Publication year : 2020
  • Author : Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
  • Categorie : Multilingual Summarization Corpus
Summary :


NLG Task

Multilingual Summarization Corpus

Training Data

The dataset is built from online news out- lets, and contains over 1.5M article-summary pairs in 5 languages: French, German, Spanish, Rus- sian, and Turkish, which complement an already established summarization dataset in English

Model Description



Key Contribution

Huge corpus

Update

Categorie : Multilingual Text to Text

  • URL : https://aclanthology.org/2021.naacl-main.41/
  • Title : mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
  • Venue : NAACL
  • Publication year : 2021
  • Author : Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
  • Categorie : Multilingual Text to Text
Summary :

NLG Task:

The task is to provide a pre-train T5 model for a multitude of languages, there is 101 languages for this model

Training Data:

Data comes from Common Crawl (mC4) it is in more than 100 Languages. There is about 1 trillion token of data used for pre-training the model who is about 450 GB of data from C4 (Common Crawl web space). For non- English data, the cld3 model is used to determine the language of the page.

Model Description:

A Transformer is employed (The T5 transformer is used). It can be used with 60 Million to 11 Billion parameters. The vocabulary used is composed of 250,000 wordpiece. The model use 1 million steps on 1024 length-1024 input

Key Contribution:

This model demonstrates that the T5 model works for multilingual and can perform it quickly. Also this model can detect illegal prediction so it can be removed automatically.
All the code and models used can be found on their github at the link : https://github.com/google-research/multilingual-t5

Result:

F1 scores for every model compared to mT5

Model XNLI PAWS-X WikiAnn-NER XQuAD MLQA TyDiQA-GoldP
mBERT 65.4 81.9 62.2 64.5 61.4 59.7
XLM 69.1 80.9 61.2 59.8 48.5 43.6
InfoXLM 81.4 - - - 73.6 -
X-STILTs 80.4 87.7 64.7 77.2 72.3 76.0
XLM-R 79.2 86.4 65.4 76.6 71.6 65.1
VECO 79.9 88.7 65.7 77.3 71.7 67.6
RemBERT 80.8 87.5 70.1 79.6 73.1 77.0
mT5-Small 67.5 82.4 50.5 58.1 54.6 36.4
mT5-Base 75.4 86.4 55.7 67.0 64.6 59.1
mT5-Large 81.1 88.9 58.5 77.8 71.2 68.4
mT5-XL 82.9 89.6 65.5 79.5 73.5 77.8
mT5-XXL 85.0 90.0 69.2 82.5 76.0 82.0

Update

Categorie : RDF to Text Generation

  • URL : https://aclanthology.org/2020.webnlg-1.20/
  • Title : Train Hard, Finetune Easy: Multilingual Denoising for RDF-to-Text Generation
  • Venue : WebNLG
  • Publication year : 2020
  • Author : Zdeněk Kasner, Ondřej Dušek
  • Categorie : RDF to Text Generation
Summary :

NLG Task:

This paper describes Language Agnostic Delexicalisation an SP for Multilingual RDF to text generation
Training Data:

The training data used is provided by the organizer of the WebNLG Challenge 2020. The Russian part of the data has been created by translating English with a Machine Translation system and then post edited by crowdsourcing then spell checked. The data can be found here https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0

Model Description:

Use a transformer as architecture. fairseq toolkit is used. There is 2 encoder and 1 decoder. 4 layers 256 hidden size, 3072 size for the feed forward layers. Trained with 0.4 dropout and 0.1 attention dropout

Key Contribution:

The LAD approach outperforms other solutions for English but cannot be tested for Russian because they didn’t have enough data.

Result:
Result

Update

  • URL : https://aclanthology.org/2020.webnlg-1.20
  • Title : Train Hard, Finetune Easy: Multilingual Denoising for RDF-to-Text Generation
  • Venue : WebNLG
  • Publication year : 2020
  • Author : Zdeněk Kasner, Ondřej Dušek
  • Categorie : RDF to Text Generation
Summary :

NLG Task:

They base their approach on the mBART model, which is pre-trained for multilingual denoising. This allows them to use a simple, identical, end-to-end setup for both English and Russian.

Training Data:

WebNLG2020 dataset The data contains sets of RDF triples extracted from DBpedia accompanied with verbalizations which were crowdsourced from human annotators. git link : https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0

Model Description:

They finetune the pre-trained mBART model (Liuet al., 2020) on the provided training data individually for each language (English and Russian). They then feed tokenized and trivially linearized input RDF triples into the model and train it to output ground-truth references

Key Contribution:

In automatic metrics, our solution placed in the top third of the field (out of 35 submissions) for English and first or second (out of 12 submissions) for Russian. In human evaluation, it scored in the best or second-best system cluster. We believe that our approach—with its excessive simplicity—can serve as a benchmark for a trade-off between the output quality and the setup complexity. Results:

Update

  • URL : https://aclanthology.org/2020.webnlg-1.7/
  • Title : The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)
  • Venue : WebNLG
  • Publication year : 2020
  • Author : Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, Anastasia Shimorina
  • Categorie : RDF to Text Generation
Summary :

NLG Task:

WebNLG+ offers two challenges:
(i) mapping sets of RDF triples to English or Russian text (generation) and
(ii) converting English or Russian text to sets of RDF triples (semantic parsing) (15 groups)

Training Data:

The English challenge data uses the version 3.01 of the WebNLG corpus (Gardent et al., 2017a).
Russian WebNLG was translated from English WebNLG for nine DBpedia categories: Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University

Model Description:

Blinov (2020) focuses on generation into Russian. They used the pre-trained Russian GPT language model (Radford et al., 2019) augmented with a classification head and fine-tuned on the WebNLG+ RDF-to-Russian dataset. The author experimented with various sampling methods and with data augmentation. For data augmentation, they use the Baidu SKE dataset (194,747 RDF/Chinese text pairs) and automatically translate its text part into Russian

Key Contribution:

Neural vs Rule based approaches
The former models seem to automatically generate text comparable in quality with human texts in terms of adequacy, i.e., the generated texts express exactly the communicative goals contained in the input tripleset. On the other hand, novel neural approaches produce text comparable to human texts in terms of fluency

Results:

Update

Categorie : Summarization

  • URL : https://aclanthology.org/D19-5601/
  • Title : Findings of the Third Workshop on Neural Generation and Translation
  • Venue : NGT
  • Publication year : 2019
  • Author : Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, Katsuhito Sudoh
  • Categorie : Summarization
Summary :

NLG Task:

Describes the findings of the Third Workshop on Neural Generation and Translation 2019, Participants were tasked with creating (NMT) systems that were both accurate and efficient and developing systems (DGT) that generate summaries from structured data, potentially with assistance from text in another language

Training Data:

A subset of the RotoWire dataset is used as the training data, accompanied by professional German translations which are sentences aligned to the original English articles.
obtained parallel dataset is called RotoWire English-German dataset

Model Description:

Workshop provided a forum for research in applications of neural models to machine translation and other language generation tasks (including summarization (Rush et al., 2015), NLG from structured data (Wen et al., 2015), dialog response generation (Vinyals and Le, 2015), among others).

Key Contribution:

In NLG track, we observed apparent difference between the constrainsed and unconstrained settings. NLE's groups approach showed that pre-training of the document level generation model on news corpora is effective even if the source input differs. Among constrained systems it is worth noting that all the systems but Team EdiNLG used the Transformer, but the result did not show any noticeable improvements compared to EdiNLG.

Results:

This paper summarized the results of the Third Workshop on Neural Generation and Translation, where we saw a number of research advances. Particularly, this year introduced a new document generation and translation task, that tested the efficacy of systems for both the purposes of translation and generation in a single testbed

Update

Categorie : Transformers

Summary :

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Update