File System Crawler

This is a file system crawler with REST API interface for interacting with file system for any user-defined location.

Features!

File System Crawler for user-defined location
Extract text data from most UTF-8 encoded files, such as .txt, .md, .config, .xml, .html, .js, etc.
Extract text data from pdf files
Extract text from image files by performing OCR on them
Extract text data from office files, such as .docx, .pptx, .xlsx, .odt, .odp, .ods
Set location for decompressing office files in places with restricted write access for nodeJS

Scroll below to find instructions to use it as a separate package (non-npm) with REST API interface

Usage as npm package

npm i filesyscrawler

To import this repository to your project

const fileSysCrawler = require("filesyscrawler");

To set crawl location

fileSysCrawler.crawlLocation = "<set desired crawl location>";

#eg. full path (for windows, use two "\" for folder heirarchy)
fileSysCrawler.crawlLocation = "C:\\Users\\guest\\Desktop";

#eg. relative path
fileSysCrawler.crawlLocation = "/files";

To perform crawl

fileSysCrawler.crawl();

After crawling, all the file names (no directories) in the crawl location are stored in filesList

fileSysCrawler.filesList

#eg. to console log
console.log(fileSysCrawler.filesList)

After crawling, all the directory names (no files) in the crawl location are stored in dirsList

fileSysCrawler.dirsList

#eg. to console log
console.log(fileSysCrawler.dirsList)

To get text from any file

fileSysCrawler.parseFile("/path/to/file", function(data){
	// "data" string in the callback here is the text parsed from the file passed in the first argument above
	console.log(data)
})

#Eg. Absolute Path
fileSysCrawler.parseFile("C:\\Users\\Harsh\\files\\abcd.docx", function(data){
	// "data" string in the callback here is the text parsed from the file passed in the first argument above
	console.log(data)
})

#Optionally change decompression location for office Files at persionalised locations for environments with restricted write access

// Put this file before parseOffice method to take effect.
fileSysCrawler.setDecompressionLocation("/tmp");  // New decompression location would be "/tmp/officeDist"

// P.S.: Setting location on a Windows environment with '\' heirarchy requires to be entered twice '\\'
fileSysCrawler.setDecompressionLocation("C:\\tmp");  // New decompression location would be "C:\tmp\officeDist"

#Eg. Relative Path
fileSysCrawler.parseFile("files/xyzd.jpg", function(data){
	// "data" string in the callback here is the text parsed from the file passed in the first argument above
	console.log(data)
})

To activate REST API Interface

fileSysCrawler.initiateREST();

The default REST API port number is 3000

#in case port 3000 (the default port) is in use, you can set the port number manually before initiating REST API interface above
fileSysCrawler.RESTPortNumber = <port number>
#eg. to set to port number 8080
fileSysCrawler.RESTPortNumber = 8080

Usage as separate package with REST API interfaces

Clone the repository on your file system
Open terminal/cmd and go to the repository
Run "npm i" and wait for the dependent repositories to get installed
Edit fileSysCrawler.js and add "initiateREST();" command at the end of the file
run "node fileSysCrawler.js" to start the crawler

REST API interface definitions

Program starts running at port 3000 by default.
Perform /setCrawlLocation before any other /getFilesList or /getDirsList

Verb	API	Body
POST	/setCrawlLocation	"path"
GET	/getFilesList
GET	/getCrawlLocation
GET	/getDirsList
POST	/parseFile	"path"

Example

1. Set Crawl Location

REQUEST

POST http://localhost:3000/setCrawlLocation 

body
{
	"crawlLocation": "files/"
}

RESPONSE

Success. New crawl location set to files/

2. Get Files List

REQUEST

GET http://localhost:3000/getFilesList

RESPONSE

[
    "C:\\Users\\Harsh\\files\\abcd.docx",
	"C:\\Users\\Harsh\\files\\xyzd.png"
]

3. Parse file

REQUEST

POST http://localhost:3000/parseFile

body
{
	"path":  "C:\\Users\\Harsh\\files\\abcd.docx"
}

RESPONSE

{
    "data": "This is a sample document made to parse text from"
}

License

MIT

Free Software, Hell Yeah!

dalbir/fileSysCrawler

File System Crawler

Features!

Usage as npm package

To import this repository to your project

To set crawl location

To perform crawl

After crawling, all the file names (no directories) in the crawl location are stored in filesList

After crawling, all the directory names (no files) in the crawl location are stored in dirsList

To get text from any file

To activate REST API Interface

The default REST API port number is 3000

Usage as separate package with REST API interfaces

REST API interface definitions

Example

1. Set Crawl Location

2. Get Files List

3. Parse file

License