DeepPavlov Topics

Be the first to try a brand new demo of DeepPavlov Library!

DeepPavlov Topics

We present "DeepPavlov Topics", a new dataset for topic classification in conversational domain. The dataset was collected and filtered automatically from web-sites and open datasets.

IMPORTANT NOTE! By default, the data were collected as a single label classification but according to our experience topic classification is a multi-label task, so we train our topic classifiers in a mutli-label mode to provide the model an opportunity to fix errors from automatic data collection.

THE NEXT VERSIONS! We admit that the list of topics does not include all neccesary for the dialogue processing. For example, we missed weather and sexual content topics that are very helpful for a chatbot development. We are working on the improvement of the dataset.

DP_TOPICS_v0

We identify 33 topics, and present full (4.2M samples) and down-sampled (2.2M samples) versions of the "DeepPavlov Topics". The proposed topics are aimed to cover conversational domain in details but maintain interpretability. We also release pre-trained models for topic classification including distilled and multi-lingual versions. The scores are presented in the original paper (see Citation).

We define the following 33 topics: Animals&Pets, Art&Hobbies, Artificial Intelligence, Beauty, Books&Literature, Celebrities&Events, Clothes, Depression, Disasters, Education, Family&Relationships, Finance, Food, Gadgets, Garden, Health&Medicine, Home&Design, Job, Leisure, MassTransit, Movies&Tv, Music, News, Personal Transport, Politics, Psychology, Religion, Science&Technology, Space, Sports, Toys&Games, Travel, VideoGames.

dp_topics_downsampled_dataset_v0.tar.gz -- down-sampled version of the dataset.
dp_topics_full_data_v0.tar.gz -- full version of the dataset.

The structure is the following:
text - text;
topic - topic labels separated by ";".

Citation

Please, cite the following paper in case of using the DeepPavlov Topics Dataset:

"DeepPavlov Topics: Topic Classification Dataset for Conversational Domain in English", Beksultan Sagyndyk, Dilyara Baymurzina, Mikhail Burtsev. [Will be edited after the paper's publication]