URL Extraction - Filtering
jjackson37 opened this issue · 4 comments
jjackson37 commented
Once the URLs have been retrieved from the RegEx they will most likely need to be filtered.
I think seperating out the filters into their own classes would be the best approach here, I can then create objects of them in a collection and loop the URL collection through them all.
I can think of two filters that we need so far :
- Error/Incorrect URLs (Possible it might need to run this one before the incomplete URL building?)
- URLs that link to different domains
- Create filter data structure and interface
- Create media file filter
- Create duplicate address filter
- Create domain URL filter
- Create error URLs filter
- Create 404 filter (Not sure about this one yet)
BoggartBear commented
- Filter out unnecessary media content such as images, videos, and sound files.
The list of types could be managed in a config for now?
BoggartBear commented
- Filter out pages that are unresponsive (resulting in 404 etc) though this could be an expensive process.
jjackson37 commented
BoggartBear commented
Pushed duplicate and media removal filters under 5706d93
We need configs to store the media types to remove.