Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.
-
pip install -r requirements.txt
-
Run Screaming Frog and use Extraction to pull the content out of a specific DOM element.
-
Export the internal HTML to a CSV file.
-
Run the script using the following arguments.
Example Usage:
-i : Input filename
-o : Output filename
-c : Column from Screaming Frog that contains your extracted content.
Example invocation:
python sf_shingling.py -i internal_html_ap.csv -o output_html_ap.csv -c "BodyContent 1"