salimk/Rcrawler

Crawling pages with same url

ioanasend opened this issue · 1 comments

Hello,

I'm trying to scrape press releases from the UN Office of the High Commissioner for Human Rights. The problem is that the website uses the same URL for its news search tool and any specific search that one runs -- it's always http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx. I should note that while the articles themselves have unique URLs, I also need the data from the search tables for my project.

So how can I crawl a website structured like this using Rcrawler? The program doesn't seem to find the table segments even if I specify them using CSS.

I've run the following script for a whole day without the crawler finding any match:
Rcrawler(Website = "http://www.ohchr.org/EN/NewsEvents/Pages/NewsSearch.aspx", ExtractCSSPat=c("#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblTitle", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblDate", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_NewsType li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_CountryID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_MandateID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_SubjectID li"), ManyPerPattern=T, PatternsNames = c("Title","Date", "News type", "Country ID", "Mandate", "Subject"))

Any help you can provide would be very much appreciated!

Hello,
At the moment our crawler can only crawl webpages through extracted links,
For your case you have some solutions to achieve your goal,
A press release URL is like
https://www.ohchr.org/EN/NewsEvents/Pages/DisplayNews.aspx?NewsID=**ID**&LangID=**LANG**
I noticed that the most recent post have ID is 23740 and the old one ( from 2014) have 15464 . Therefore you can loop through all these post using ContentScraper function

DATA<- foreach(id=15464:23740,  .verbose=FALSE, .inorder=FALSE, .errorhandling='pass')  %do%
        {
         ContentScraper(Url =paste0("https://www.ohchr.org/EN/NewsEvents/Pages/DisplayNews.aspx?NewsID=",ID,"&LangID=E"), ExtractCSSPat=c("#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblTitle", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_lblDate", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_NewsType li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_CountryID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_MandateID li", "#ctl00_PlaceHolderMain_SearchNewsID_gvNewsSearchresult_ctl03_SubjectID li"), ManyPerPattern = TRUE)
        }