- Removes extra blank lines.
- Removes soft-hyphens followed by new line (this typically means multi-line words).
- Searches and replaces a list of words:
The script takes a csv (replacelist.csv) that carries words to be replaced, and replacement words. - Regular expression based replacement:
- Allows for 0-X consecutive errors within a word.
- Takes wordlist.csv that carries words and X for each word
- For instance if a row in wordlist.csv reads: Available,1
- Av.{0,1}??[\r\n]*ilable ==> Available
- Ava.{0,1}??[\r\n]*lable ==> Available
Clone this repository:
git clone https://github.com/soodoku/search-and-replace.git
Navigate to search-and-replace
Run python setup.py install
The script expects the following two files in the same directory:
- replacelist.csv -- carries word pairs (original_word, replace_with_this_word). (Sample replacelist.csv.)
- wordlist.csv -- carries the correct word, and number of consecutive errors tolerated. All the variously misspelled words will be replaced with the correct word. (Sample wordlist.csv.)
postprocess.py [options] source_txt_directory
Options:
-h, --help show this help message and exit
-o OUTDIR, --outdir=OUTDIR
Text output directory (default: postprocessed)
-r, --resume Resume postprocessing (Skip if existing) (default:
False)
python postprocess.py txt_dir
The script will be post process all text files in 'txt_dir' directory and save the output file to the 'postprocessed' directory. Sample input and sample output.
The script can be used for fixing dirty data. For instance, one application for the script is postprocessing dirty OCR data. See more at: A Quick Scan: From Paper to Digital.
Scripts are released under the MIT License.