Normalization of Relative Links
michascholz opened this issue · 2 comments
Thanks a lot for all the work!
Several websites include links with relative references (e.g., "page-1.html" instead of "http://domain.com/page-1.html"). The LinkNormalization function works fine for absolute links but fails to correctly normalize relative links. Can you please extend that function so that it correctly recognizes relative links and, if necessary, not only adds the protocol to a link but also the base url.
Best wishes,
Michael
If you can be more specific about the link structure you want to normalize.
To process relative links you shoud also set "current" argument which represent the current web document URL.
For now, these are supported link structures
thanks for your feedback will try to improve the function in the next release
links<-c("http://www.twitter.com/share?url=http://glofile.com/page.html", "/finance/banks/page-2017.html", "./section/subscription.php", "//section/", "www.glofile.com/home/", "glofile.com/sport/foot/page.html", "sub.glofile.com/index.php", "http://glofile.com/page.html#1", "?tags%5B%5D=votingrights&sort=popular")
> LinkNormalization(links,"http://glofile.com" ) [1] "http://glofile.com/finance/banks/page-2017.html" [2] "http://glofile.com/section/subscription.php" [3] "http://www.glofile.com/home/" [4] "http://glofile.com/sport/foot/page.html" [5] "http://sub.glofile.com/index.php" [6] "http://glofile.com/page.html" [7] "http://glofile.com?tags%5B%5D=votingrights&sort=popular"
Rcrawler v0.1.9 is released with a lot of features,
subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s