/sisyphe

Sisyphe is a modulable NodeJS BIG-DATA analyser & transformer

Primary LanguageJavaScriptOtherNOASSERTION

Build Status bitHound Overall Score

sisyphe

Sisyphe

Sisyphe is a generic NodeJS recursive folder analyser terminal application & a (lerna) git monorepo.

Sisyphe-pic

Requirements

Tested with NodeJS@8.X, Redis@3.2.6

Works on Linux/OSX/Windows

Example to run a quick local redis (thanks to docker):

docker run --name sisyphe-redis -p 6379:6379 redis:3.2.6

Install it

  1. Download the latest Sisyphe version
  2. Just do : npm install (this will execute a npm postinstall)
  3. ... that's it.

Test

npm run test will test sisyphe & its workers

Help

./app.js --help Will output help

Options

-V, --version               output the version number
-n, --corpusname <name>     Corpus name
-s, --select <name>         Select all module to deal with
-c, --config-dir <path>     Configuration folder path
-t, --thread <number>       The number of process which sisyphe will take
-b, --bundle <number>       Regroup jobs in bundle of jobs
-r, --remove-module <name>  Remove module name from the workflow
-q, --quiet                 Silence output
-l, --list                  List all available workers
-h, --help                  output usage information

How it works ?

Just start Sisyphe on a folder with any files in it.

node app -n corpusname ~/Documents/customfolder/corpus

node app -n corpusname -c ~/Documents/customfolder/corpusResources ~/Documents/customfolder/corpus

Sisyphe is now working in background with all your computer thread. Just take a coffee and wait , it will prevent you when it's done :)

The results of sisyphe are present @ sisyphe/out/{timestamp}-corpusname/ (errors,info,duration..)

For a control panel & full binded app, go to Sisyphe-monitor sisyphe has a server that allows to control it and to obtain more information on its execution. Simply run the server with npm run server to access these features

Sisyphe-dashboard

Modules

There is a list of default modules (focused on xml & pdf).

Those URL NEED to be updated when merge branch will be ok.

  • FILETYPE Will detect mimetype,extension, corrupted files..
  • PDF Will get info from PDF (version, author, meta...)
  • XML Will check if it's wellformed, valid-dtd's, get elements from balises ...
  • XPATH Will generate a complete list of xpaths from submitted folder
  • OUT Will export data to json file & ElasticSearch database
  • NB Try to assing some categories to an XML document by using its abstract
  • MULTICAT Try to assing some categories to an XML document by using its identifiers
  • TEEFT Try to extract keywords of a fulltext
  • SKEEFT Try to extract keywords of a structured fulltext by using teeft algorithm and text structuration

Developpement on worker

When you work on worker, just:

  • Commit your changes as easy
  • Do a npm run updated (to check what worker has changed)
  • Do a npm run publish (it will ask you to change version of module worker & publish it to github)

Modules informations

Some bugs could occured with certains files with 'skeeft' on windows module please just disactivate it until we fix.