/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

Primary LanguageClojureEclipse Public License 2.0EPL-2.0

tokyo

greed2411

When you hit rock-bottom, you still have a way to go until the abyss.- Tokyo, Netflix's "Money Heist" (La Casa De Papel)



image belongs to teepublic

When one is limited by the technology of the time, One resorts to Java APIs using Clojure.

This is my first attempt on Clojure to have a REST API which when uploaded a file, identifies it's mime-type, extension and text if present inside the file and returns information as JSON. This works for several type of files. Including the ones which require OCR, thanks to Tesseract. Complete list of supported file formats by Tika.

Uses ring for Clojure HTTP server abstraction, jetty for actual HTTP server, pantomime for a clojure abstraction over Apache Tika and also optionally served using traefik acting as reverse-proxy.

Installation

Two options:

  1. Download openjdk-11 and install lein. Followed by lein uberjar
  2. Use the Dockerfile (Recommended)

Building

  1. You can obtain the .jar file from releases (if it's available).
  2. Else build the docker image using Dockerfile.
docker build ./ -t tokyo
docker run tokyo:latest

Note: the server defaults to running on port 80, because it has been exposed in the docker image. You can change the port number by setting an enviornment variable TOKYO_PORT inside the Dockerfile, or in your shell prompt to whichever port number you'd like when running the .jar file.

I've also added a docker-compose.yml which uses traefik as reverse proxy. use docker-compose up.

Usage

  1. the /file route. make a POST request by uploading a file.

    • the command line approach using curl
    curl -XPOST  "http://localhost:80/file" -F file=@/path/to/file/sample.doc
    
    {"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}
    >>> import requests
    >>> import json
    
    >>> url = "http://localhost:80/file"
    >>> files = {"file": open("/path/to/file/sample.doc")}
    >>> response = requests.post(url, files=files)
    >>> json.loads(response.content)
    
    {'mime-type': 'application/msword', 'ext': '.bin', 'text': 'Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.'}

    the general API response,json-schema is of the form:

    :mime-type (string) - the mime-type of the file. eg: application/msword, text/plain etc.
    :ext       (string) - the extension of the file. eg: .txt, .jpg etc.
    :text      (string) - the text content of the file.
    

Note: The files being uploaded are stored as temp files, in /tmp and removed after an hour later. (assuming the jvm is still running for that hour or so).

  1. just a /, GET request returns Hello World as plain text. to act as ping.

If going down the path of using docker-compose. The request gets altered to

curl -XPOST  -H Host:tokyo.localhost http://localhost/file -F file=@/path/to/file/sample.doc

{"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}

and

>>> response = requests.post(url, files=files, headers={"Host": "tokyo.localhost"})

where tokyo.localhost has been mentioned in docker-compose.yml

Why?

I had to do this because neither Python's filetype (doesn't identify .doc, .docx, plain text), textract (hacky way of extracting text, and one needs to know the extension before extracting) are as good as Tika. The Go version, filetype didn't support a way to extract text. So I resorted to spiraling down the path of using Java's Apache Tika using the Clojure pantomime library.

License

Copyright © 2020 greed2411/tokyo

This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.