DeepPavlov

An open source library for building end-to-end dialog systems and training chatbots.

License Apache 2.0 Python 3.6

Automatic spelling correction component

Automatic spelling correction component is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model, a static dictionary and an ARPA language model to correct spelling errors.
We provide everything you need to build a spelling correction module for russian and english languages and some hints on how to collect appropriate datasets for other languages.

Usage

Component config parameters:

This module expects sentence strings with space-separated tokens in lowercase as its input, so it is advised to add appropriate preprocessing in chainer.

A working config could look like this:

{
  "chainer":{
    "in": ["x"],
    "pipe": [
      {
        "name": "str_lower",
        "in": ["x"],
        "out": ["x_lower"]
      },
      {
        "name": "nltk_tokenizer",
        "in": ["x_lower"],
        "out": ["x_tokens"]
      },
      {
        "in": ["x_tokens"],
        "out": ["y_predicted"],
        "name": "spelling_error_model",
        "window": 1,
        "save_path": "error_model/error_model.tsv",
        "load_path": "error_model/error_model.tsv",
        "dictionary": {
          "name": "wikitionary_100K_vocab"
        },
        "lm_file": "/data/data/enwiki_no_punkt.arpa.binary"
      }
    ],
    "out": ["y_predicted"]
  }
}

Usage example

This model expects a sentence string with space-separated tokens in lowercase as its input and returns the same string with corrected words. Here’s an example code that will read input data from stdin line by line and output resulting text into stdout:

import json
import sys

from deeppavlov.core.commands.infer import build_model_from_config

CONFIG_PATH = 'configs/error_model/brillmoore_kartaslov_ru.json'

with open(CONFIG_PATH) as config_file:
    config = json.load(config_file)

model = build_model_from_config(config)
for line in sys.stdin:
    print(model([line])[0], flush=True)

if we save it as example.py then it could be used like so:

cat input.txt | python3 example.py > out.txt

Training

Error model

For the training phase config file needs to also include these parameters:

Component’s configuration also has to have as fit_on parameter — list of two elements: names of component’s input and true output in chainer’s shared memory

A working training config could look something like:

{
  "dataset_reader": {
    "name": "typos_wikipedia_reader"
  },
  "dataset": {
    "name": "typos_dataset",
    "test_ratio": 0.05
  },
  "chainer":{
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "name": "str_lower",
        "id": "lower",
        "in": ["x"],
        "out": ["x_lower"]
      },
      {
        "name": "nltk_tokenizer",
        "id": "tokenizer",
        "in": ["x_lower"],
        "out": ["x_tokens"]
      },
      {
        "ref": "lower",
        "in": ["y"],
        "out": ["y_lower"]
      },
      {
        "ref": "tokenizer",
        "in": ["y"],
        "out": ["y_tokens"]
      },
      {
        "fit_on": ["x_tokens", "y_tokens"],
        "in": ["x_tokens"],
        "out": ["y_predicted"],
        "name": "spelling_error_model",
        "window": 1,
        "dictionary": {
          "name": "wikitionary_100K_vocab"
        },
        "save_path": "error_model/error_model.tsv",
        "load_path": "error_model/error_model.tsv"
      }
    ],
    "out": ["y_predicted"]
  },
  "train": {
    "validate_best": false,
    "test_best": true
  }
}

And a script to use this config:

from deeppavlov.core.commands.train import train_model_from_config

MODEL_CONFIG_PATH = 'configs/error_model/brillmoore_wikitypos_en.json'
train_model_from_config(MODEL_CONFIG_PATH)

Language model

This model uses KenLM to process language models, so if you want to build your own, we suggest you consult with its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

Comparison

We compared this module with Yandex.Speller and GNU Aspell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method Precision Recall F-measure
Yandex.Speller 83.09 59.86 69.59
Our model with the provided language model 51.92 53.94 52.91
Our model with no language model 41.42 37.21 39.20
GNU Aspell, always first candidate 27.85 34.07 30.65

Ways to improve