Announcing DeepPavlov Library 0.16.0 release

Hello everyone and welcome!

We are happy to share our newest DeepPavlov Library v0.16.0! In this release we’ve updated the underlying Transformers library to 4.6.0, delivered Distill RuBERT for Russian language, and continued the Spring Cleaning of the DeepPavlov Library. And also in this release we’ve added an experimental multi-GPU support.

We are anxious to hear your thoughts about the new version!

New Features

Distil ruBERT

Distil ruBERT (the so-called 'student model') is a reduced copy of the ruBERT (so-called 'teacher model') trained in a special way to mimic its behaviour. So we have a smaller model that works a little bit worse than a big teacher model and is better than a model of the same size trained from scratch. Rubert-base-cased-conversational was chosen as a teacher due to its popularity. DistilBERT architecture was chosen for the student model. Two students distilrubert-tiny-cased-conversational (2 layers, 768 hidden, 12 heads, 107M parameters) and distilrubert-base-cased-conversational (6-layers, 768 hidden, 12 heads, 135M parameters) were trained on the same data and vocabulary as a teacher and used its intermediate and final outputs to compute loss. The following losses were used:
1. Kullback–Leibler divergence (between teacher and student output logits)
2. Masked language modeling loss (between tokens labels and student output logits)
3. Cosine embedding loss between averaged consecutive hidden states of the teacher (sequence of six hidden states for 2-layered version; sequence of two hidden states for 6-layered version) and one hidden state of the student
4. Mean squared error loss between averaged consecutive attention maps of the teacher (sequence of six attention maps for 2-layered version; for 6-layered version this loss was not used due to limitations in our computational resources) and one attention map of the student

Models were trained on 8 nVIDIA Tesla P100-SXM2.0 16Gb. Distilrubert-tiny training took for about 30 hrs, and distilrubert-base - for about 100 hrs. After pre-training the batch of 16 random sequences with length 512 was generated and for all models we compared time required for one sample processing (latency) and number of batches that can be processed per second (throughput) on CPU (Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz). We obtained that the tiny version is 4 times faster (and 40% ligter) than the bert-teacher and the base version is 2 times faster (and 24% lighter).

Then fine-tuning was done for classification (paraphraser.ru, rusentiment), ner (Persons-1000 dataset with additional LOC and ORG markup (Collection 3)) and question answering (SDSJ Task B) tasks. On average, the average result of distilrubert-base is 2-3% worse than the teacher and distilrubert-tiny is 5-7% worse. You can check scores and configs here.

Transformers Version Update

Previously, we used a version of the Transformers library that was over a year old. There were a lot of fixes and new features introduced since then, which is why in this release we’ve updated the underlying Transformers library to 4.6.0:

deeppavlov/configs/ner/ner_ontonotes_bert_mult_torch.json
deeppavlov/configs/ner/ner_ontonotes_bert_torch.json
deeppavlov/configs/ner/ner_conll2003_torch_bert.json
deeppavlov/configs/ner/ner_rus_bert_torch.json
deeppavlov/configs/squad/squad_ru_torch_bert.json
deeppavlov/configs/squad/squad_torch_bert.json
deeppavlov/configs/squad/squad_torch_bert_infer.json

Experimental Multi-GPU Support

Finally, we’ve introduced an experimental multi-GPU support to our Classifier and SQuAD models based on data parallelization. Multi-GPU mode is applied automatically when more than one GPU is visible to the model.

There are some caveats to be aware of:

Only NVIDIA GPUs are supported at the moment
Only single machine GPUs are currently supported
Parallelization is achieved by distributing examples in a given batch between the cards, so the batch size should be at least equal to the number of GPUs available—naturally, this multi-GPU mode won’t be helpful if you’re struggling to load even a single example into a given GPU
One of the GPUs is going to be a “supervisor”, thus utilizing more memory than the others. As a result, current implementation doesn’t allow even distribution of the workload between the GPUs

To provide an example, let’s say you want to train a model to solve the MNLI task in multi-GPU mode. Assuming that you have access to 3 GPUs on a single machine with ids 0, 1, 2, you can train a model utilizing these GPUs with the following command:

CUDA_VISIBLE_DEVICES=0,1,2 python -m deeppavlov train glue_mnli_roberta

Predefined Configs for Part Tasks of GLUE/SuperGLUE

In this release, we’ve added predefined configs with reproducible scores for the following tasks (English):

Multi-Genre Natural Language Inference (MNLI)
Recognizing Textual Entailment (RTE)
Boolean Question (BoolQ)
Choice of Plausible Alternatives (COPA)

Multiple Choice Task Type

Three out of four previously mentioned configs, in essence, solve the problem of Sentence Pair Classification. However, Choice of Plausible Alternatives is a bit more elaborate: given a premise and a question (“What was the cause of this?” or “What happened as a result?”), a model is supposed to choose the most plausible option out of two possible alternatives. Effectively, the model has to rank the alternatives and choose the most likely one. We implement this using HuggingFace’s `AutoModelForMultipleChoice`, while also providing the required preprocessing steps. This means that, with some tinkering, the existing COPA config may be modified to solve a similarly structured task.

Binary Classification Head

During our experiments on the BoolQ task, we found that formalizing it specifically as binary classification (i.e. applying sigmoid to a single scalar), instead of multi-class classification with 2 classes (i.e. applying softmax to 2 scalars) appears to improve the accuracy on the test set. To achieve this we’ve added a custom Binary Classification head that, in theory, should be usable with most HuggingFace models. To use Binary Classification head a single parameter `is_binary` must be set to `true` in a corresponding classification config.

Other Fixes

Spring Cleaning Continues: 3,132 removed code lines

Starting from the 0.15 release, we’ve deprecated several ML models, which will no longer be supported by us in this and the upcoming releases. In this release this list of deprecated models was further extended with:

17 config ranking;
8 components (predominantly readers).

If you feel a need in one or more of deleted models we welcome community contributors and would be happy to see them being supported by our lovely community.

We encourage you to begin building your Conversational AI systems with our DeepPavlov Library on Github and let us know what you think! Feel free to test our BERT-based models by using our demo. And keep in mind that we have a dedicated forum, where any questions concerning the framework and the models are welcome.

Follow @deeppavlov on Twitter.