Supervised Project
Generation of simplified texts |
Loria - Synalp
| Claire Gardent
Categorie
URL
Don't put volumes
Summary
NLG Task
Generating text in twenty one different languages. from structured data or abstract meaning representation (AMR) .
Training Data
Original corpus is AMR datasets from the LDC can be found at https://amr.isi.edu/download.html .From this, training data for multilingual AMR-to-Text models is prepared by taking the EUROPARL multilingual corpus and automatically annotating the English data with AMRs using the jamr semantic parser. For encoder pretraining on silver AMR, they take thirty million sentences from the English portion of CCNET.
Model Description
For encoding , large Transformer combining crosslingual word embeddings and encoder and decoder pretraining.The graph embedding provides additional information to the Transformer Encoder by encoding the depth of each node in the rooted graph and the subgraph each node belongs to. Translation models to generate in different language are Transformer Big models trained with LayerDrop for 100k updates on public benchmark data from WMT . To create a one-to-many multilingual model, they model a language embedding on the encoder side to allow the decoder to distinguish which language to generate into.
Key Contribution
Additional embeddings are concatenated to the word and position embeddings. Such information allows the Transformer Encoder to capture some graph structure information, while still modeling a sequence. Low resource language like Estonian sees a 2.3 BLEU point improvement from multilingual modeling.
Update