ContextSensitiveSpellingCorrection

Index

Introduction

In this assignment, I am using Java to write a program to perform context-sensitive spelling
correction. This task is to detect spelling errors that result in valid words.

Method Used

Logistic Regression including its gradient ascent algorithm

Feature Used

Surrounding Words Feature: Each word that appears in the sentence containing the confusable word w is a feature. All surrounding words are converted to lowercase, and stop words and punctuation symbols are removed.
Collocation Feature: A collocation Ci,j is an ordered sequence of words in the local, narrow context of the confusable word w. Offsets i and j denote the starting and ending positions (relative to w) of the sequence, where a negative (positive) offset refers to a word to its left (right). The collocation features we used are C1,1 , C−1,−1 , C −2,−1 and C 1,2 . This is because the chances of the word before and after the confusing word co-occurrence with the confusing word are fairly high. (Hassel 1990) Collocation features with C 1,1 , C −1,−1 , C −2,−1
and C 1,1 , C −1,−1 are also tested because they are also believed to be reliable features. (Hassel 1990)
Stop words filtering for collocation feature does not produce a good result as suggested by Hassel (1990) and thus is not used.

Structure:

In this assignment, sctrain.java, sctest.java and Evaluation.java are written.

sctrain.java:
- It is used to train the model of the confusing words. To run it, you can use
- the following command in the ssh secure shell (unix system):
- java sctrain word1 word2 train_file model_file
- Where word1 and word2 are the confusable words, for example adapt and adopt
- the file train_file is a file containing the training sentences, the example of training sentences
- are stated below:
- 0144 Hungary joins the European Union in May 2004 and could >> adopt << the euro
- by 2008 .
- The file model_file contains the features and weights computed from the training process,
- each line i in the model file contains of a line in the format: feature:==:weight
- For example: big >>:==:-0.013325415928373883
sctest.java:
- It is used to predicts the confusing words needed in the test files. To run it,
- you can use the following command in the ssh secure shell (unix system):
- java sctest word1 word2 test_file model_file actual_file
- Where word1, word2 are the confusing words,
- test_file has a similar format with train_file except that the confusing word is not stated, for
- example:
- 0501 The decree allows the government to >> << unusual political , military and tax
- measures with an aim to restoring order .
- model_file is the model_file trained by sctrain
- For each test sentence in test_file, actual_file contains one line indicating the test sentence id
- and the disambiguated confusable word as determined by the logistic regression
- classifier:
- 0501 adopt
Evaluation.java:
- is use to check the accuracy of the actual_file generated by the sctest, command:
- java Evaluation answer_file actual_file
- Where answer_file is the correct answer expected by the program and actual file is the actual
- answer generated by sctest. It will return an accuracy value.

chanjunweimy/ContextSensitiveSpellingCorrection

ContextSensitiveSpellingCorrection

Index

Introduction

Method Used

Feature Used

Structure: