Simplify HTML IP address parser in crawler
cermmik opened this issue · 2 comments
cermmik commented
IP searching in HTML finds non-address strings. Reduce regular expressions to search for common address strings only. Maybe add info about address source (connection, HTML, DNS, ...) to the output YAML file.
When using the attached file test.zip the crawler produces two incorrectly identified addresses 7.0.5.010
and '::'
.
TomasMadeja commented
IPv4 and IPv6 matching now checks for some leading and tailing characters. Output yaml now provides info about matched protocols, first occurance and number of occurances. 992f1e2
TomasMadeja commented
Crawler and Normalizer ignore (some) IP and MAC addresses with special meaning. ee3cf53