DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.

Open Domain Question Answering Skill on Wikipedia

Task definition

Open Domain Question Answering (ODQA) is a task to find an exact answer to any question in Wikipedia articles. Thus, given only a question, the system outputs the best answer it can find:

Question:

What is the name of Darth Vader’s son?

Answer:

Luke Skywalker

Languages

There are pretrained ODQA models for English and Russian languages in DeepPavlov.

Models

The architecture of ODQA skill is modular and consists of two models, a ranker and a reader. The ranker is based on DrQa proposed by Facebook Research (Reading Wikipedia to Answer Open-Domain Questions) and the reader is based on R-Net proposed by Microsoft Research Asia (“R-NET: Machine Reading Comprehension with Self-matching Networks”) and its implementation by Wenxuan Zhou.

Running ODQA

Tensorflow-1.4.0 with GPU support is required to run this model.

Training

The ODQA ranker and ODQA reader should be trained separately. Warning: training the ranker on English Wikipedia requires 16 GB RAM. Run the following to fit the ranker:

python -m deeppavlov train deeppavlov/configs/odqa/en_ranker_prod.json

Read about training the reader in our separate reader tutorial.

Interacting

ODQA, reader and ranker can be interacted separately. Warning: interacting the ranker and ODQA on English Wikipedia requires 16 GB RAM. Run the following to interact ODQA:

python -m deeppavlov train deeppavlov/configs/odqa/en_odqa_infer_prod.json

Run the following to interact the ranker:

python -m deeppavlov interact deeppavlov/configs/odqa/en_ranker_prod.json

Read about interacting the reader in our separate reader tutorial.

Configuration

The ODQA configs suit only model inferring purposes. The ranker config should be used for ranker training and the reader config should be used for reader training.

Ranker

The ranker config for English language can be found at deeppavlov/configs/odqa/en_ranker_prod.json

The ranker config for Russian language can be found at deeppavlov/configs/odqa/ru_ranker_prod.json

ODQA

Default ODQA config for English language is deeppavlov/configs/odqa/en_odqa_infer_prod.json

Default ODQA config for Russian language is deeppavlov/configs/odqa/ru_odqa_infer_prod.json

The components of ODQA config can be referred to ranker config and reader config accordingly. However, main inputs and outputs are worth explaining:

Pretrained models

Wikipedia data and pretrained ODQA models are downloaded in deeppavlov/download/odqa by default.

enwiki.db

enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest enwiki (from 2018-02-11)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

enwiki_tfidf_matrix.npz

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 5159530. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer class.

ruwiki.db

ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest ruwiki (from 2018-04-01)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

ruwiki_tfidf_matrix.npz

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 1463888. This matrix is built with deeppavlov/models/vectorizers/hashing_tfidf_vectorizer.HashingTfidfVectorizer class.

References

  1. https://github.com/facebookresearch/DrQA
  2. https://github.com/HKUST-KnowComp/R-Net