/CMServices

Web services layer for ContentMine text and data mining tools and utilities

Primary LanguageJavaScriptApache License 2.0Apache-2.0

CMServices

ContentMine Services Layer

A web services layer to allow frontend web apps to make use of core ContentMine backends and tools such as norma (and in future other ContentMine tools such as ami, getpapers and quickscrape).

Note: The initial version is at proof-of-concept status and the API is subject to change.

Installation

System requirements

Target node.js version is: node v.8.0.0
Created with npm v5.0.0.

Usage

In directory CMServices use npm start to start CMServices as a server. The port defaults to 3000. Set the PORT environment variable to override this. For instance:

$ PORT=3002 && npm start

Configuration

The server configuration uses default.json.

As well as host the following can be configured:

fileStorageCM The directory used to store all files uploaded and generated by the ContentMine tools. If the path is relative it is interpreted relative to the directory the server is running in.

normaJar The path to the jar file for ContentMine's norma, text and data mining application. An example jar is bundled with this project and no installation is needed for initial use.

Service API

Note: This API is subject to change

POST methods


/api/createCorpus
Create a new corpus containing the uploaded PDF document.
HTTP verb: POST
Form data: multipart
Fields userWorkspace A relative directory name in which all this user's files are stored
corpusName The name of the corpus to create. This will be used as a directory name
docName The document name to use for the uploaded PDF (e.g., a DOI). This will be used as a directory name and should be unique within the corpus
A PDF file

Example usage:

curl --form userWorkspace="user1" --form corpusName="corpus1" --form docName="doc1" --form "fileupload=@testpdf.pdf" http://localhost:3002/api/corpus


/api/transformPDF2SVG
For all PDF documents in the corpus, generate an SVG file for each page. This converts pages into the intermediate SVG format used for data extraction and analysis by norma.
HTTP verb: POST
Form data: x-www-form-urlencoded \

Fields
userWorkspace A relative directory name in which all this user's files are stored
corpusName The name of the corpus to create. This will be used as a directory name.\

Example usage:

curl -d "corpusName=corpus1&userWorkspace=user1" http://localhost:3002/api/transformPDF2SVG


/api/cropbox
Crop document according to coordinates, dimensions and page number to select a specific area for data extraction using norma. Assumes a single-document corpus.

HTTP verb: POST\ Form data: x-www-form-urlencoded\

Fields
userWorkspace A relative directory name in which all this user's files are stored corpusName The name of the corpus.
x0 The x coordinate of the top-left corner of the table
y0 The y coordinate of the top-left corner of the table
width The width of the table in mm
height The height of the table in mm
pageNumber The number of the page containing the table (numbering is relative to the PDF document, so page numbers start at 1).

The coordinate system defaults to ydown, with y coordinates increasing down the page, and the units to mm.

Example usage:

curl -d userWorkspace=user1 -d corpusName=corpus1 -d x0=17.5 -d y0=26 -d width=178.5 -d height=97.5 -d pageNumber=5 http://localhost:3002/api/cropbox


/api/transformSVGTABLE2HTML
Convert a table in SVG format into semantically structured HTML using norma transform svgtable2html. Assumes a single-document corpus.

HTTP verb: POST
Form data: x-www-form-urlencoded
Fields
userWorkspace A relative directory name in which all this user's files are stored corpusName The name of the corpus.

Example usage: curl -d userWorkspace=user1 -d corpusName=corpus1 http://localhost:3002/api/transformSVGTABLE2HTML


/api/transformSVGTABLE2CSV
Convert a table in SVG format into CSV using norma transform svgtable2csv. Assumes a single-document corpus.

HTTP verb: POST
Form data: x-www-form-urlencoded
Fields
userWorkspace A relative directory name in which all this user's files are stored corpusName The name of the corpus.

Example usage: curl -d userWorkspace=user1 -d corpusName=corpus1 http://localhost:3002/api/transformSVGTABLE2CSV


GET methods


/api/getTableHTML/userWorkspace/corpusName/docName
Retrieve the semantically structured data for the previously converted table in HTML format. Assumes transformSVGTABLE2HTML has already been run.

HTTP verb: GET
URL parameters:
userWorkspace A relative directory name in which all this user's files are stored
corpusName The name of the corpus
docName The name of the document within the corpus at upload time

Example usage: curl http://localhost:3002/api/getTableHTML/user1/corpus1/doc1/ returns HTML data to std out.


/api/getTableCSV/userWorkspace/corpusName/docName
Retrieve the structured data for the previously converted table in CSV format. Assumes transformSVGTABLE2CSV has already been run.

HTTP verb: GET
URL parameters:
userWorkspace A relative directory name in which all this user's files are stored
corpusName The name of the corpus
docName The name of the document within the corpus at upload time

Example usage: curl http://localhost:3002/api/getTableCSV/user1/corpus1/doc1/ returns HTML data to std out.


/api/extractTableToHTML
Extract the table data from the specific page and area of uploaded PDF document and return results in semantically structured HTML
HTTP verb: POST
Form data: multipart
Fields
userWorkspace A relative directory name in which all this user's files are stored
corpusName The name of the corpus to create. This will be used as a directory name.
docName The document name to use for the uploaded PDF (e.g., a DOI). This will be used as a directory name and should be unique within the corpus.
x0 The x coordinate of the top-left corner of the table
y0 The y coordinate of the top-left corner of the table
width The width of the table in mm
height The height of the table in mm
pageNumber The number of the page containing the table (numbering is relative to the PDF document, so page numbers start at 1).
A PDF file

Example usage:

curl --form userWorkspace="user1" --form corpusName="corpus1" --form docName="doc1" --form "fileupload=@testpdf.pdf" --form x0=17.5 --form y0=26 --form width=178.5 --form height=97.5 --form pageNumber=5 http://localhost:3002/api/extractTableToHTML


Example workflow 1: Extracting a table from a PDF file and returning HTML

Use /api/extractTableToHTML with form fields as above.

Example workflow 2: Individual calls for workflow to extract a table from a PDF file.

  1. Upload the PDF document
    /api/createCorpus
  2. Convert it to SVG (ContentMine norma intermediate format)
    /api/transformPDF2SVG
  3. Crop the specified area of the specified page to leave only the table in SVG.
    /api/cropbox
  4. Extract data and semantic structure from the table SVG. Output as HTML or CSV.
    /api/transformSVGTABLE2HTML, /api/transformSVGTABLE2CSV \
  5. Retrieve extracted/structured data results after conversion:
    /api/getTableHTML, /api/getTableCSV