/content-extractor

Extract content from text files using a filter

Primary LanguageTypeScriptGNU General Public License v3.0GPL-3.0

Content Extractor

Node module that takes a text file and splits it into 2 files based on a filter. The split is done at the level of an individual line in the file.

Usage

yarn add content-extractor

Example

Filtering a text file based on a alpha filter. The filter will be fed the text file a line at a time and make a decision about if it matches the criteria or not.

const languageExtractor = require('content-extractor');

const filter = languageExtractor.containsOnlyAlphaCharacters;

languageExtractor.extractContent(
  'input.txt',
  'alpha.txt',
  'everythingElse.txt',
  filter
);

Expected Input and Output

Input: input.txt

ABC
!!!

Apply containsOnlyAlphaCharacters filter

Output: alpha.txt

ABC

everythingElse.txt

!!!

Providing your own filter

A filter is a function that takes a string and returns a boolean. Typescript example:

const numberRegEx: RegExp = new RegExp(/[0-9]/, 'g');

const numberFilter = (s: string): boolean => {
  const chars = s.match(numberRegEx) || [];
  return chars.length === s.length;
};

Provided Filters

Persian Filter

  • The Persian language filter is based on the Unicode Range for Arabic Script. The API will not be able to filter on different languages if they both use Arabic characters

  • That a single line would not be expected to contain a mix of Persian and Non Persian and that the presence of Persian characters is an indicator of the line being in Persian

  • That a line that contains only numbers (Hindu-Arabic numerals) cannot be assumed to be Persian.

  • That latin punctuation used in Persian, e.g. full stops, should not be used as an indicator of Persian

Alpha Filter

  • Extracts text that contains only the characters a-z. It is not case sensitive and rejects text that contains whitespace.