This code is designed to traverse through a specified folder and its subfolders, search for files with certain extensions, and extract words from those files. The extracted words are then processed to create a dictionary of unique words with their capitalized forms. Finally, the dictionary is written to a JSON file.
This code requires the following:
- Python (version 3 or above) installed on your system.
- The
os
,re
, andjson
modules, which are part of the Python standard library.
- Set the value of the
folder_path
variable to the path of the folder you want to traverse. This folder should contain the files you want to analyze. - Optionally, modify the
extensions
list to include or exclude file extensions according to your requirements. - Optionally, modify the
excluded_chars
list to include any characters that should be excluded from the word dictionary. - Run the code.
- The code starts by initializing an empty dictionary called
word_dict
to store the words and their capitalized forms. - It uses the
os.walk()
function to traverse through the specified folder and its subfolders. - For each file encountered, the code checks if it has one of the allowed extensions. If it does, the file is opened and its content is read.
- The
re.findall()
function is then used to extract words from the file's content. The regular expression pattern\b(?<![A-Z])(?<![a-z])(?![A-Z])(?![a-z])(?!\d)\w+\b
is used to match words. This pattern ensures that the words are not preceded or followed by any letters (uppercase or lowercase) or digits, and they are not part of any larger word. This helps in extracting individual words without including other elements like function names, variable names, or numbers. - Each extracted word is checked against the
excluded_chars
list. If it is not present in the list, it is added to theword_dict
dictionary with its capitalized form as the value. If the word is already present in the dictionary, it is skipped to avoid duplicates. - After processing all the files, the
word_dict
dictionary is written to a JSON file namedoutput.json
. Thejson.dump()
function is used to convert the dictionary into a JSON string and write it to the file. - The code prints a message indicating the location where the output file has been written.
The output of the code is a JSON file named output.json
. The file contains a dictionary where each key represents a unique word extracted from the specified files, and the corresponding value is its capitalized form.
- Ensure that the specified
folder_path
exists and contains the files you want to analyze. - Make sure the necessary permissions are granted to read the files and write the output file in the specified location.
- The code assumes that the files in the specified folder are encoded using UTF-8. If your files are encoded differently, you may need to modify the encoding in the
open()
function calls. - The code does not perform any error handling or validation, so it is recommended to provide valid inputs and handle exceptions as per your requirements.