DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.

License Apache 2.0 Python 3.6

Automatic spelling correction pipelines

We provide two types of pipelines for spelling correction: levenstein_corrector uses simple Damerau-Levenstein distance to find correction candidates and brillmoore uses statistics based error model for it. In both cases correction candidates are chosen based on context with the help of a kenlm language model.
You can find the comparison of these and other approaches near the end of this readme.

Quick start

You can run the following command to try provided pipelines out:

python -m deeppavlov interact <path_to_config> [-d]

where <path_to_config> is one of the provided config files.
With the optional -d parameter all the data required to run selected pipeline will be downloaded, including an appropriate language model.

After downloading the required files you can use these configs in your python code. For example, this code will read lines from stdin and print corrected lines to stdout:

import json
import sys

from deeppavlov.core.commands.infer import build_model_from_config

CONFIG_PATH = 'deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json'

with open(CONFIG_PATH) as config_file:
    config = json.load(config_file)

model = build_model_from_config(config)
for line in sys.stdin:
    print(model([line])[0], flush=True)

levenstein_corrector

This component finds all the candidates in a static dictionary on set Damerau-Levenstein distance.
It can separate one token into two but it will not work the other way around.

Component config parameters:

brillmoore

This component is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.

Component config parameters:

Training configuration

For the training phase config file needs to also include these parameters:

Component’s configuration for spelling_error_model also has to have as fit_on parameter — list of two elements: names of component’s input and true output in chainer’s shared memory.

Language model

Provided pipelines use KenLM to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

Comparison

We compared our pipelines with Yandex.Speller, JamSpell that was trained on biggest part of our Russian texts corpus that JamSpell could handle and PyHunSpell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method Precision Recall F-measure Speed (sentences/s)
Yandex.Speller 83.09 59.86 69.59 5.
Damerau Levenstein 1 + lm 53.26 53.74 53.50 29.3
Brill Moore top 4 + lm 51.92 53.94 52.91 0.6
Hunspell + lm 41.03 48.89 44.61 2.1
JamSpell 44.57 35.69 39.64 136.2
Brill Moore top 1 41.29 37.26 39.17 2.4
Hunspell 30.30 34.02 32.06 20.3