/pdf2word

CRUD REST API file server and Convert pdf files to word files using Expressjs, MongoDB, Nginx, Mongo-Express

Primary LanguageJavaScriptApache License 2.0Apache-2.0

A Dockerized PDF to Word converter using Express.JS & MongoDB

An Application to convert text and scanned PDF files to word document. We use MongoDB as database for store files and metadata.

Used packages:

  • Mongoose (This package will translate the node.JS code to MongoDB)
  • Config (It lets you define a set of default parameters, and extend them for different deployment environments.
  • Express (You’ll need this package for any HTTP requests you want to run)
  • BodyParser (This package lets you receive content from HTML forms)
  • Multer (This package enables easy file upload into MongoDB
  • Gridfs-stream (Easily stream files to and from MongoDB GridFS.)
  • Multer-gridfs-storage (You need this package to implement the MongoDB GridFS feature with multer).
  • pdf-extract (Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text.)
  • pdf-parse (Pure javascript cross-platform module to extract texts from PDFs.)
  • stream-to-array (Concatenate a readable stream's data into a single array. The data that we fetch from the database is in the form of a stream, it is necessary to buffer the data to convert the stream to PDF.)
  • cors (CORS is a node.js package for providing a Connect/Express middleware that can be used to enable CORS with various options.)
  • officegen (Creating Office Open XML files (Word, Excel and Powerpoint) for Microsoft Office 2007 and later without external tools, just pure Javascript.)

Pdf-extract prerequisites:

  • pdftk

    pdftk splits multi-page pdf into single pages.

  • pdftotext

    pdftotext is used to extract text out of searchable pdf documents

  • ghostscript

    ghostscript is an ocr preprocessor which convert pdfs to tif files for input into tesseract

  • tesseract

    tesseract performs the actual ocr on your scanned images

More explanations for installing each of these packages on any operating system are written here I have written these prerequisites in the docker file.

Dockerfile:

FROM node:14
RUN apt update
RUN apt install -y pdftk poppler-utils ghostscript tesseract-ocr tesseract-ocr-fas
RUN apt autoclean && apt autoremove
RUN mkdir /app
WORKDIR /app
COPY package*.json ./
RUN npm install 
COPY . .
EXPOSE 3000
CMD ["npm", "run", "start"]

NOTE Install tesseract-ocr-fas for support persian language, Visit this Github project for more information on using your preferred language.

docker-compose.yml:

version: "3"
services:
  backend-file-server:
    image: file-server
    container_name: file-server-container
    build:
      context: .
    restart: on-failure
    volumes:
      - "./word/:/app/word/"
    depends_on: 
      - mongodb
    networks:
      - file-net
    ports: 
      - "3000:3000"
  mongodb:
    image: mongo:4.2
    container_name: mongodb
    restart: on-failure
    env_file: ./mongo_env
    volumes: 
      - ./mongo-data:/data/db
    networks:
      - file-net
  mongo-express:
    image: mongo-express:0.54.0
    container_name: mongo-express
    depends_on:
      - mongodb
    networks:
      - file-net
    env_file: ./mongo-express_env 
  nginx:
    image: nginx:1.21
    container_name: nginx_proxy
    restart: on-failure
    depends_on:
      - backend
    networks:
      - file-net
    ports:
      - "8080:8080"
      - "8081:8081"
    volumes:
      - ./conf.d/:/etc/nginx/conf.d/
networks:
  file-net: