- Víctor Pérez Cester (vp19885@essex.ac.uk)
- Joel Valiente Sanchez (jv19228@essex.ac.uk)
To run the project you must install all the dependencies first by running the following command on your linux terminal.
pip3 install -r requirements.txt
Once all the dependencies are installed, you can execute the code as above.
python3 main.py --url <your_url_to_parse>
If you want the system to be verbose you can add the --verbose flag as above.
python3 main.py --url <your_url_to_parse> --verbose
Once the code finished the execution you can check the outputs inside the output/ folder.
Provided in /src/ folder.
The URLDownloader module will download a given URL and return the html as plain text.
The Parser module will download a given URL using the URLDownloader module and parse it's content to extract all the visible data.
The Preprocessing module will receive a plain text and preprocess it, separating it in documents and tokens from the collection.
- The downloaded HTML is called the collection
- A document is every line or sentence in the HTML
- A token is every word contained in a document
The Pos module will compute the Part Of Speach Tags for every token.
The KeywordsFilter module will compute wich words and sentences are more important for indexing. To do so, it will compute the Inverse Document Frequency (IDF) as log(N/DFT), where
- N : The number of documents in our collection.
- DFT: Document Frequency Term
The Term Frequency table will be computed as well and, mixing both results, the table TF-IDF will be computed.
The Stemming module is dedicated to compute the stems for every token. It uses the WordNetLemmatizer class from the nltk module to compute the stems.