This program is designed to convert files to text format. It supports Word documents (.docx, .doc), PDF files (.pdf), and images (.jpg).
- Python 3.x
- Libraries:
- extracting_pdf_to_text
- extracting_word_to_text
- pytesseract
- PIL (Python Imaging Library)
- Place the files you want to convert in the appropriate directories:
- Word documents: "./InputDataLoad"
- PDF files: "./InputDataLoad"
- Images: "./InputDataLoad"
- Run the
main()
function to initiate the conversion process. - The program will scan the input directories for supported files and convert them to text.
- The converted text will be stored in separate text files in the "./OutputData" directory.
- The program will print the file paths of the converted text files.
- For Word documents, the
extracting_word_to_text.wordtotext()
function will be used to convert them to text. - For PDF files, the
extracting_pdf_to_text.pdftotext()
function will be used to extract the text. - For images, the program will use OCR (optical character recognition) to extract text using the
pytesseract
library.
Note: Make sure to install the required libraries and provide the correct file paths in the code for image and Word document conversion.
Suppose you have the following files in the "./InputDataLoad" directory:
- document1.docx
- document2.doc
- document3.pdf
- image1.jpg
After running the program, the text files will be created in the "./OutputData" directory as follows:
- word_to_txt.txt
- word_to_txt.txt
- pdf_to_txt.txt
- jpg_to_txt.txt
The program will print the file paths of the converted text files: