When you hit rock-bottom, you still have a way to go until the abyss.- Tokyo, Netflix's "Money Heist" (La Casa De Papel)
When one is limited by the technology of the time, One resorts to Java APIs using Clojure.
This is my first attempt on Clojure to have a REST API which when uploaded a file, identifies it's mime-type
, extension
and text
if present inside the file and returns information as JSON.
This works for several type of files. Including the ones which require OCR, thanks to Tesseract. Complete list of supported file formats by Tika.
Uses ring for Clojure HTTP server abstraction, jetty for actual HTTP server, pantomime for a clojure abstraction over Apache Tika and also optionally served using traefik acting as reverse-proxy.
Two options:
- Download openjdk-11 and install lein. Followed by
lein uberjar
- Use the
Dockerfile
(Recommended)
- You can obtain the
.jar
file from releases (if it's available). - Else build the docker image using
Dockerfile
.
docker build ./ -t tokyo
docker run tokyo:latest
Note: the server defaults to running on port 80, because it has been exposed in the docker image.
You can change the port number by setting an enviornment variable TOKYO_PORT
inside the Dockerfile
, or in your shell prompt to whichever port number you'd like when running the .jar
file.
I've also added a docker-compose.yml
which uses traefik as reverse proxy. use docker-compose up
.
-
the
/file
route. make aPOST
request by uploading a file.- the command line approach using curl
curl -XPOST "http://localhost:80/file" -F file=@/path/to/file/sample.doc {"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}
- The Python Way using requests
>>> import requests >>> import json >>> url = "http://localhost:80/file" >>> files = {"file": open("/path/to/file/sample.doc")} >>> response = requests.post(url, files=files) >>> json.loads(response.content) {'mime-type': 'application/msword', 'ext': '.bin', 'text': 'Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.'}
the general API response,json-schema is of the form:
:mime-type (string) - the mime-type of the file. eg: application/msword, text/plain etc. :ext (string) - the extension of the file. eg: .txt, .jpg etc. :text (string) - the text content of the file.
Note: The files being uploaded are stored as temp files, in /tmp
and removed after an hour later. (assuming the jvm is still running for that hour or so).
- just a
/
,GET
request returnsHello World
as plain text. to act as ping.
If going down the path of using docker-compose
. The request gets altered to
curl -XPOST -H Host:tokyo.localhost http://localhost/file -F file=@/path/to/file/sample.doc
{"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}
and
>>> response = requests.post(url, files=files, headers={"Host": "tokyo.localhost"})
where tokyo.localhost
has been mentioned in docker-compose.yml
I had to do this because neither Python's filetype (doesn't identify .doc, .docx, plain text), textract (hacky way of extracting text, and one needs to know the extension before extracting) are as good as Tika. The Go version, filetype didn't support a way to extract text. So I resorted to spiraling down the path of using Java's Apache Tika using the Clojure pantomime library.
Copyright © 2020 greed2411/tokyo
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.