Content Extractor

Node module that takes a text file and splits it into 2 files based on a filter. The split is done at the level of an individual line in the file.

Usage

yarn add content-extractor

Example

Filtering a text file based on a alpha filter. The filter will be fed the text file a line at a time and make a decision about if it matches the criteria or not.

const languageExtractor = require('content-extractor');

const filter = languageExtractor.containsOnlyAlphaCharacters;

languageExtractor.extractContent(
  'input.txt',
  'alpha.txt',
  'everythingElse.txt',
  filter
);

Expected Input and Output

Input: input.txt

ABC
!!!

Apply containsOnlyAlphaCharacters filter

Output: alpha.txt

ABC

everythingElse.txt

!!!

Providing your own filter

A filter is a function that takes a string and returns a boolean. Typescript example:

const numberRegEx: RegExp = new RegExp(/[0-9]/, 'g');

const numberFilter = (s: string): boolean => {
  const chars = s.match(numberRegEx) || [];
  return chars.length === s.length;
};

Provided Filters

Persian Filter

The Persian language filter is based on the Unicode Range for Arabic Script. The API will not be able to filter on different languages if they both use Arabic characters
That a single line would not be expected to contain a mix of Persian and Non Persian and that the presence of Persian characters is an indicator of the line being in Persian
That a line that contains only numbers (Hindu-Arabic numerals) cannot be assumed to be Persian.
That latin punctuation used in Persian, e.g. full stops, should not be used as an indicator of Persian

Alpha Filter

Extracts text that contains only the characters a-z. It is not case sensitive and rejects text that contains whitespace.

susiecoleman/content-extractor