/duke

Asynchronous rule-based file system walker

Primary LanguageJavaScriptISC LicenseISC

dwalker Build Status Code coverage

Asynchronous rule-based file system walker. It:

  • does things most regexp-based walkers hardly can;
  • uses super simple rule definitions;
  • handles most file system errors by default;
  • provides powerful extendable API;
  • runs real fast.

This is what a simple demo app does on my old 2,7 GHz MacBook Pro:

The version 6 is hugely different from its ancestors.

The further text describes the usage, API and version history.

Usage

NB: This package needs Node.js v12.12 or higher.

Install with yarn or npm

yarn add dwalker   ## npm i -S dwalker

The following code walks all given directory trees in parallel, gathering basic statistics:

const walker = new (require('dwalker')).Walker()
const dirs = '/dev ..'.split(' ')

Promise.all(dirs.map(dir => walker.walk(dir))).then(res => {
  console.log('Done(%d):', res.length)
}).catch(error => {
  console.log('EXCEPTION!', error)
}).finally(() => {
  console.log(walker.stats)
})
// -> Done(1): { dirs: 8462, entries: 65444, errors: 2472, retries: 0, revoked: 0 }
// -> Elapsed: 1012 ms

What it does

The Walker#walk() method recursively walks the directory tree width-first. It scans all directory entries, invoking the handler functions as it goes, keeping track of its internal rules tree. For speed, all this is done asynchronously.

Please have a glance at its core concepts, if you haven't done so already.

API

Contents: package exports, Walker, common helpers, special helpers, rule system

Package exports

Types referred to below are declared in src/typedefs.js.

Walker class

The most of the magic happens here. For details, see: methods, properties, class/static API, protected API, and exceptions handling.

constructor(options : {TWalkerOptions})

  • avoid : string | strig[] - the avoid() instance method will be called.
  • interval : number= - instance property setting.
  • rules : * - rule definitions, or a Ruler instance to be cloned.
  • symlinks : boolean= - enable symbolic links checking by onEntry() handler.

Walker instance stores given (even unrecognized) options in private _options property.

Walker instance methods

See the separate description of onDir(), onEntry() and onFinal() handler methods.

avoid(...path) : Walker - method
Injects the paths into visited collection thus preventing them from being visited. The arguments must be strings or arrays of strings - absolute or relative paths.

getDataFor(dirPath) : * - method<br/ > For accessing the data in the internal dictionary. Empty entries are created there before calling the onDir() handler. The Walker itself does not use those values.

getOverride(error) : number - method
Returns an overriding action code (if any) for the current exception and its context. The Walker calls this method internally and assigns its numeric return value to error.context.override before calling its onError() method. A non-numeric return value has no effect. Instead of overriding this method, you can directly modify the overrides export of the package.

onError(error: Error, context: TDirContext) : * - method
Called with trapped error after error.context has been set up. Default just returns error.context.override. Returned action code will be checked for special values; a non-numeric return means this was an unexpected error rejecting the walk promise.

The Walker may provide the following context.locus values: 'onDir', 'openDir', 'iterateDir', 'onEntry', 'closeDir', 'onFinal'. Overriding handlers may define their own locus names.

reset([hard : boolean]) : Walker - method
Resets a possible STC. In a hard case, it resets all internal state properties, including those available via stats. Calling this method during walk throws an unrecoverable error.

tick(count : number) - method
Called during walk automatically. Default does nothing. Override this for progress monitoring etc.

trace(handlerName, result, context, args) - method
Called right after every handler call. Use this for debugging only! Default is an empty function.

walk(startPath : string, [options : TWalkOptions]) : Promise - method
Walks the walk. The startPath may be any valid pathname defaulting to process.cwd(). Via options you can override trace() method, any handler methods, as well as data and ruler instance properties. The promise resolves to data, to non-numeric return value from a handler or rejects to unexpected error instance.

Walker instance properties

duration : number - microseconds elapsed from start of the current walk batch or duration of the most recent batch.

failures : Error[] - any exceptions overridden during a walk. The Error instances in there will have a context : TDirContext property set.

ruler : Ruler - initial ruler instance for a new walk.

stats : Object r/o - general statistics as object with numeric properties:

  • dirs - number of visited directories;
  • entries - number of checked directory entries;
  • errors - number of exceptions encountered;
  • retries - number of operation retries (e.g. in case of out of file handles);
  • revoked - number of directories recognized as already visited (may happen with symlinks option set);
  • walks - number of currently active walks.

walks : number r/o - number of currently active walks.

Walker class methods and properties

All those are directly available via the package exports.

newRuler(...args) : Ruler - factory method.

overrides : Object - error override rules as a tree: ( locus -> error.code -> actionCode ).

shadow : atring[] - mask for omitting certain parts of context parameter, before injecting it to Error instance for logging.

Walker protected API

Is described in a separate document.

Exceptions handling

The good news is: whatever will happen during a walk, the Walker instance won't throw an exception!

If an exception occurs and there is an override defined for it, a new entry will be added to the failures instance property, and the walk will continue.

Without an override defined, however, we'll have an unexpected exception. In this case, the walk will terminate with an augmented Error instance via rejection, and the example program above would output something like this:

EXCEPTION! TypeError: Cannot read property 'filter' of undefined
    at ProjectWalker.onDir (/Users/me/dev-npm/nsweep/lib/ProjectWalker.js:111:38)
    at async doDir (/Users/me/dev-npm/nsweep/node_modules/dwalker/src/Walker.js:491:15)
  context: {
    depth: 0,
    dirPath: '/Users/me/dev-npm/nsweep',
    done: undefined,
    locus: 'onDir',
    rootPath: '/Users/me/dev-npm/nsweep',
    override: undefined
  }
} 

An error stack combined with a walk context snapshot should be enough to spot the bug.

Common helpers

Those helpers are available via package exports and may be useful on writing handlers.

checkDirEntryType(type : TEntryType) : TEntryType - function
returns the argument if it is a valid type code; throws an assertion error otherwise.

dirEntryTypeToLabel(type : TEntryType, [inPlural : boolean]) : string - function
returns human readable type name for valid type; throws an assertion error otherwise.

makeDirEntry(name : string , type : TEntryType, [action : number]) : TDirEntry - function
constructs and returns a ned directory entry with action defaulting to DO_NOTHING.

makeDirEntry(nativeEntry : fs.Dirent) : TDirEntry - function
returns a new directory entry based on Node.js native one.

Special helpers

To use those helpers, load them first, like:

const symlinksFinal = require('dwalker/symlinksFinal')

pathTranslate(path, [absolute]) : string function.
Translate the path from POSIX to native format, resolves the leading '~' to user home directory. If absolute is on, then makes the path absolute, always ending with path separator.

relativize(path, [rootPath, [prefix]]) : string function.
Strips the rootPath (defaulting to homeDir)part from given path, if it is there. Optional prefix string will be applied to resulting relative path. May help to make some reports easier to read.

relativize.homeDir : string - initialized to current user's home directory.

symlinksFinal(entries, context) : * async handler.
Use it inside onFinal handler for following the symbolic links. Example:

const onFinal = function (entries, context) {
  return this._useSymLinks
    ? symlinksFinal.call(this, entries, context) : Promise.resolve(0)
}

Rule system

The main goal here was to keep rules simple (atomic), even when describing context-sensitive rules and special exclusions.

Rule definitions are tuples (action-code, {pattern}), quite similar to bash glob patterns or .gitignore rules. Example:

ruler.add(
  DO_SKIP, '.*', '!/.git/', 'node_modules/', 'test/**/*',
  11, 'package.json', '/.git/', '/LICENSE;f', '*;l')

Here the first rule tells to ignore the dreaded node_modules directory and any entries starting with '.', except the top-level .git directory. Also, nothing under the test directory, where ever found, will count. The trailing '/' indicates the directory.

The second rule asks for some sort of special care to be taken for all package.json entries with no regard to their type, for top-level .git directory, for top-level LICENSE file and for all symbolic links. And, yes, the .weirdos/package.json will be ignored.

Without explicit type, all rules created are typeless or T_DIR ('d'). Explicit type must match one in S_TYPES constant.

Behind the scenes, a Ruler instance creates and interprets a rule tree formed as an array on records
(type, expression, ancestorIndex, actionCode). For the above example, the Ruler dump would be like:

       node typ regex            parent  action
      -----+---+-----------------------+-------------
         0: 'd' null,               -1,  DO_NOTHING,
         1: ' ' /^\./,               0,  DO_SKIP,
         2: 'd' /^\.git$/,          -1, -DO_SKIP,
         3: 'd' /^node_modules$/,    0,  DO_SKIP,
         4: 'd' /^test$/,           -1,  DO_NOTHING,
         5: 'd' null,                4,  DO_NOTHING,
         6: ' ' /./,                 5,  DO_SKIP,
         7: ' ' /^package\.json$/,   0,  11,
         8: 'd' /^\.git$/,          -1,  11,
         9: 'f' /^LICENSE$/,        -1,  11,
        10: 'l' /./,                 0,  11,
_ancestors: [ [ 0, -1 ] ]

The internal ancestors array contains tuples (actionCode, ruleIndex).

The Ruler#check() method typically called from Walker#onEntry() finds all rules matching the given entry (name, type) and fills in the lastMatch array, analogous to ancestors array. Then it returns the most prominent (the highest) action code value. The DO_SKIP and other system action codes prevail the user-defined codes simply because they have higher values.

A negative value screens the actual one. Do not use negative values in rule definitions - the ruler will do this for you, when it encounters a pattern starting with '!'.

The sub-directories opened later will inherit new Ruler instances with ancestors set to lastMatch contents from the upper level. So, the actual rule matching is trivial, and the rules can be switched dynamically.

For further details, check the Ruler reference and the special demo app.

Version history

  • v6.0.0 @20201225
    • cleaned code and API (breaking changes) after using dwalker in some actual projects, so the basic use cases are clear now. As the general concepts persist, migration sould not be a major headache and reading the updated core concepts should help.
  • v5.2.0 @20201202
    • added: Walker#getOverride instance method.
  • v5.1.0 @20201121
    • removed: hadAction(), hasAction() Ruler instance methods.
  • v5.0.0 @20201120
    • Walker totally re-designed (a breaking change);
    • Ruler#check() refactored (a non-breaking change);
    • documentation and examples re-designed.
  • v4.0.0 @20200218
    • several important fixes;
    • Walker throws error if on illegal action code returned by handler;
    • added: Walker#expectedErrors, removed: Walker#getMaster;
    • added: check(), hadAction(), hasAction() to Ruler, removed: match();
    • an up-to-date documentation;
  • v3.1.0 @20200217
  • v3.0.0 @20200211
  • v2.0.0 @20200126
  • v1.0.0 @20200124
  • v0.8.3 @20200123: first (remotely) airworthy version.