Spelling and Grammar Checker Plug-in for DITA-OT

See the LICENSE file in the root of this project for license details.

Introduction

The Spelling and Grammar Checker plug-in for DITA OT is an extension of the base DITA Validator which adds simple rule-based spelling and grammar validation for the text elements within DITA documents.

The plug-in supports three transtypes:

auto-correct - remove reported spelling and grammar errors from a DITA document
text-rules - create an error report in Schematron Validation Report Language (SVRL) format
text-rules-echo - display the results of an SVRL error report in a human-readable format

More information about SVRL can be found at www.schematron.com

Most of the spell-checking rules are based on a list of known typographical errors and faults, and the ruleset can be easily altered to include new constraints. Checking against a list of known errors means that no false positives should occur, but the existing list will never be fully comprehensive.

Prerequisites

Requirements

The DITA spell-checker has been tested against DITA-OT 2.2.x. It is recommended that you upgrade to the latest version. Running the spell-checker plug-in against earlier versions of DITA-OT will not work as it uses the newer getVariable template. To work with DITA-OT 1.8.5 this would need to be refactored to use getMessage.

Installing DITA-OT

The spell-checker is a plug-in for the DITA open toolkit. Futhermore, it is not a stand alone plug-in as it extends the base validator plug-in (com.here.validate.svrl).

install the DITA-OT distribution JAR file dependencies by running gradle install from your clone of the DITA-OT repository.

The required dependencies are installed to a local Maven repository in your home directory under .m2/repository/org/dita-ot/dost/.

Run the Gradle distribution task to generate the plug-in distribution package:

./gradlew dist

The distribution ZIP file is generated under build/distributions.

Installing the Base Plug-in

Run the plug-in installation command:

dita -install https://github.com/heremaps/com.here.validate.svrl/archive/v1.0.0.zip

Installing the Spelling and Grammar Checker Plug-in

Run the plug-in installation command:

dita -install https://github.com/heremaps/com.here.validate.svrl.text-rules/archive/v1.0.0.zip

The dita command line tool requires no additional configuration.

Usage

Spell-checking a document from the command line

A test document can be found within the plug-in at: PATH_TO_DITA_OT/plugins/com.here.validate.svrl.text-rules/sample

Creating an SVRL file

To create an SVRL file with the spell-checker plug-in use the text-rules transform.

From a terminal prompt move to the directory holding the document to validate
Clean the output directory (named "out" in the examples below), to ensure that the result from an old validation run is not present.

rm -rf ./out

SVRL file creation can be run like any other DITA-OT transform:

PATH_TO_DITA_OT/bin/dita -f text-rules -o out -i document.ditamap

Once the command has run, an SVRL file is created:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<svrl:schematron-output>
<active-pattern xmlns:dita-ot="http://dita-ot.sourceforge.net/ns/201007/dita-ot"
                   name="grammar-and-spelling"
                   role="content">
      <fired-rule context="p"/>
      <fired-rule context="p"/>
      <fired-rule context="p"/>
      <failed-assert location="/incorrect-spelling.dita" role="error">
         <diagnostic-reference diagnostic="incorrect-spelling">Line 17: p - [incorrect-grammar]
The phrase 'separate' is written incorrectly ('seperate') in the following text:
Seperate accommodation can be found within the main building..</diagnostic-reference>
      </failed-assert>
		</active-pattern>
</svrl:schematron-output>

Echoing results to the command line

To echo results to the command line with the spell-checker plug-in use the text-rules-echo transform.

Clean the output directory (named "out" in the examples below), to ensure that the result from an old text-rules run is not present.

rm -rf ./out

Spell-checking (text-rules-echo) can be run like any other DITA-OT transform:

PATH_TO_DITA_OT/bin/dita -f text-rules-echo -i document.ditamap

Once the command has run, all errors and warnings are echoed to the command line:

[echo] [ERROR]  [/topics/incorrect-spelling.dita]
[echo]   Line 17: p - [incorrect-spelling]
[echo] The word 'separate' is spelt incorrectly ('seperate') in the following text:
[echo]
[echo] Seperate accommodation can be found within the main building.

Additionally, if an error occurs, the command will fail

[echo] [ERROR]  [/topics/incorrect-spelling.dita]
[echo]   Line 17: p - [incorrect-spelling]
[echo] The word 'separate' is spelt incorrectly ('seperate') in the following text:
[echo]
[echo] Seperate accommodation can be found within the main building.
[echo] Found 1 Errors 0 Warnings
Error: Errors detected during validation

Auto-correction from the command line

To auto-correct spelling mistakes with the spell-checker plug-in use the auto-correct transform.

Clean the output directory (named "out" in the examples below), to ensure that the result from an old run is not present.

rm -rf ./out

Auto-correction (auto-correct) can be run like any other DITA-OT transform:

PATH_TO_DITA_OT/bin/dita -f auto-correct -i document.ditamap

Once the command has run spelling mistakes will have been removed from the document.

Note: The auto-correct transtype only removes spelling, duplicate and grammar errors specified in the dictionary files

Spell-checking a document from the using `ant`

An ant build file is supplied in the same directory as the sample document. The main target can be seen below:

<!-- The path to dita-ot, correct as necesary  -->
<dirname property="dita.dir" file="path-to-dita-ot"/>
<!--the path to the DITA document to build, change as necessary -->
<property name="args.input" value="path-to-doc/document.ditamap"/>
<!-- Minimal classpath to invoke DITA OT via ANT. -->
<path id="dita.ot.classpath">
	... etc..
</path>
<target name="validate">
	<java classname="org.apache.tools.ant.launch.Launcher" fork="true" failonerror="true" classpathref="dita.ot.classpath">
		<arg value="-Dargs.input=${args.input}"/>
		<arg value="-Ddita.dir=${dita.dir}"/>
		<arg value="-buildfile"/>
		<arg value="${dita.dir}/build.xml"/>
		<arg value="-Dgenerate-debug-attributes=false"/>
		<arg value="-Doutput.dir=out/svrl"/>
		<arg value="-Dtranstype=text-rules-echo"/>
		<!-- validation transform specific parameters -->
		<arg value="-Dargs.validate.blacklist=(kilo)?metre|colour|teh|seperate"/>
		<arg value="-Dargs.validate.check.case=Bluetooth|HTTP[S]? |ID|IoT|JSON|Java|Javadoc|JavaScript|XML"/>
		<arg value="-Dargs.validate.mode=default" />
		<!-- Run the transform quietly to avoid verbose output. -->
		<arg value="-S"/>
		<arg value="-q"/>
	</java>
</target>

Parameter Reference

args.validate.ignore.rules - Comma separated list of rule IDs to be ignored
args.validate.blacklist - Comma separated list of words that should not be present in the running text
args.validate.check.case - Comma separated list of words which have a specified capitalization
args.validate.mode - Validation reporting mode. The following values are supported:
- strict - Outputs both warnings and errors. Fails on errors and warnings.
- default - Outputs both warnings and errors. Fails on errors only
- lax - Ignores all warnings and outputs errors only. Fails on Errors only
svrl.customization.dir - Specifies the customization directory
svrl.filter.file - Specifies the location of the XSL file used to filter the echo output

Configuring the plug-in

Internationalization

The spelling and grammar checker currently supports three languages:

en - English
de - German
fr - French

The language checked is based on the default.language setting of the DITA Open toolkit. This can be modified in the lib/configuration.properties file.

Please note that error messages have not been internationalized into French and currently all error messages will be displayed in English.

Sample lists of duplicated and mis-spelt words are available in all three languages. Grammar and punctuation lists have only been supplied for en.

Additional languages can be added by creating a new language folder and files under cfg/dictionary

Adding new mis-spellings to the plug-in

The list of misspelt words to check when spell-checking can be altered by amending entries in the xml files under cfg/dictionary. The plug-in recognizes four types of errors:

duplicates.xml - Duplicated words.
grammar.xml - Grammar errors (includes a ban on the use of contractions in formal text )
punctuation.xml - Punctuation marks (includes a ban on smart quotes).
spelling.xml - Spelling mistakes

Each entry takes the form of a pair

<entry>
	<mistake>accessable</mistake>
	<corrected>accessible</corrected>
</entry>
<entry>
	<mistake>acident</mistake>
	<corrected>accident</corrected>
</entry>
<entry>
	<mistake>accidentaly</mistake>
	<corrected>accidentally</corrected>
</entry>

Altering the severity of a validator rule

The severity of a validator rule can be altered by amending entries in the cfg/ruleset/default.xml file The plug-in supports four severity levels:

FATAL - Fatal rules will fail validation and cannot be overridden.
ERROR - Error rules will fail validation. Errors can be overridden as described above.
WARNING - Warning rules will display a warning on validation, but do not fail the validation. Warnings can also be individually overridden.
INACTIVE - Inactive rules are not applied.

Ignoring validator Rules

Removing a rule globally

Rules can be made inactive by altering the severity (see above). Alternatively a rule can be commented out in the XSL configuration file.

Ignoring a rule throughout a document

Individual rules can be ignored by passing the args.validate.ignore.rules parameter to the command line. The value of the parameter should be a comma-delimited list of each rule-id to ignore.

For example to ignore the latin-abbreviation validation rule within a document you would run:

PATH_TO_DITA_OT/dita -f text-rules-echo -i document.ditamap -Dargs.validate.ignore.rules=latin-abbreviation

Ignoring a specific instance of a rule

Specific instances of a rule can be ignored by adding a comment within the *.dita file. The comment should start with ignore-rule and needs to be added at the location where the error is flagged.

<!--
	This is an example of a spelling mistake
-->
<p>
	<!-- ignore-rule:incorrect-spelling -->
	I have deliberately misspelt the word accidentaly (sic) - it should be written with a double l.
</p>

Sample Document

A sample document can be found within the plug-in which can used to test plug-in rules. The document covers both positive and negative test cases. The sample document contains valid DITA which can be built as an HTML or as a PDF document - please use the html or pdf transform to read the contents or examine the *.dita files directly.

A complete list of rules covered by the plug-in can be found below. The final <chapters> of the sample document contain a set of test DITA <topics>, each demonstrating a broken validation rule.

The <topic> files are sorted as follows:

The base validation DITA-OT plugin (com.here.validate.svrl) - this <chapter> contains two common textual validation rules.
The text-rules DITA-OT plugin (com.here.validate.svrl.text-rules) - This <chapter> contains a set of English language spelling and grammar rules.

Spell-checker Error Messages

The following table list the spell-checker error messages by message ID.

Message ID	Message	Corrective Action/Comment
a-followed-by-vowel	In the following text, change 'a' to 'an' where appropriate:	In English, the general guideline is that the indefinite article in front of count nouns that begin with a vowel sound should be 'an'.
an-followed-by-consonant	In the following text, change 'an' to 'a' where appropriate:	In English, the general guideline is that the indefinite article in front of count nouns that begin with a consonant sound should be 'a'.
blacklisted-word	The word '{word}' is not allowed in the following text:	The indicated word has been banned from use. Rewrite the phrase using alternatives.
duplicated-punctuation	The punctuation mark '{char}' is duplicated in the following text:	The indicated punctuation mark is duplicated. Delete the duplicated punctuation mark.
duplicated-words	The word '{word}' is duplicated in the following text:	The indicated word is duplicated. Delete the duplicated word..
incorrect-capitalization	The word '{word}' is incorrectly capitalized in the following text:	The indicated word is not capitalized correctly. Fix the capitalization.
incorrect-grammar	The phrase '{phrase}' is grammatically incorrect in the following text:	The indicated phrase does not make sense. Rewrite the phrase using the correct grammar.
incorrect-punctuation	The punctuation mark '{char}' has been used incorrectly in the following text:	The indicated punctuation mark is non-standard. Replace with a corrected punctuation mark.
incorrect-spelling	The word '{word}' is a spelling mistake in the following text:	The indicated word is not spelled correctly. Fix the spelling. It is assumed that documentation follows US English spelling conventions.
latin-abbreviation	The accronym i.e or e.g has been used in the following text:	Latin accronyms are difficult to understand. Consider rewriting the phrase using alternatives, such as "for example"
sentence-capitalization	The run on sentence in the following text does not start with a capital letter	The indicated sentence is not punctuated correctly. Fix the punctuation.
split-inifinitive	The phrase '{phrase}' is written using a split-infinitive in the following text:	The indicated sentence includes a split-infinitive, which is considered poor grammatical style - consider rephrasing the sentence.
where-not-were	The word 'were' has been used to start a subordinate clause in the following text:	The indicated sentence does not make sense. Rewrite the phrase using the correct grammar.

heremaps/com.here.validate.svrl.text-rules