Supervised Project
Generation of simplified texts |
Loria - Synalp
| Claire Gardent
Categorie
URL
Don't put volumes
Summary
NLG Task:
The task is to provide a pre-train T5 model for a multitude of languages, there is 101 languages for this model
Training Data:
Data comes from Common Crawl (mC4) it is in more than 100 Languages. There is about 1 trillion token of data used for pre-training the model who is about 450 GB of data from C4 (Common Crawl web space). For non- English data, the cld3 model is used to determine the language of the page.
Model Description:
A Transformer is employed (The T5 transformer is used). It can be used with 60 Million to 11 Billion parameters. The vocabulary used is composed of 250,000 wordpiece. The model use 1 million steps on 1024 length-1024 input
Key Contribution:
This model demonstrates that the T5 model works for multilingual and can perform it quickly. Also this model can detect illegal prediction so it can be removed automatically.
All the code and models used can be found on their github at the link :
https://github.com/google-research/multilingual-t5
Result:
F1 scores for every model compared to mT5
Model
XNLI
PAWS-X
WikiAnn-NER
XQuAD
MLQA
TyDiQA-GoldP
mBERT
65.4
81.9
62.2
64.5
61.4
59.7
XLM
69.1
80.9
61.2
59.8
48.5
43.6
InfoXLM
81.4
-
-
-
73.6
-
X-STILTs
80.4
87.7
64.7
77.2
72.3
76.0
XLM-R
79.2
86.4
65.4
76.6
71.6
65.1
VECO
79.9
88.7
65.7
77.3
71.7
67.6
RemBERT
80.8
87.5
70.1
79.6
73.1
77.0
mT5-Small
67.5
82.4
50.5
58.1
54.6
36.4
mT5-Base
75.4
86.4
55.7
67.0
64.6
59.1
mT5-Large
81.1
88.9
58.5
77.8
71.2
68.4
mT5-XL
82.9
89.6
65.5
79.5
73.5
77.8
mT5-XXL
85.0
90.0
69.2
82.5
76.0
82.0
Update