salimk/Rcrawler

Rcrawler not crawling some websites

Rifakh opened this issue · 4 comments

Hi Salim,

i am running Rcrawler on a vector of websites. i have noticed that it is failing to crawl some ex:

http://www.alahleia.com
http://www.almalki.com

tried several depth levels and timeout.

thank you

I'm having the same problem. I was only having the issue with https:// sites, but confirmed that the ones you listed were not working for me as well. Some that I was having trouble with were:

https://manager.submittable.com/beta/discover/?page=1&sort=
https://www.estheticapostle.com/

@Rifakh
Both websites can be crawled now.

http://www.alahleia.com
http://www.almalki.com

almalki2
almalki

@amarbut
https website can be crawed since version 0.1.3. can you be more specific about the issue
almalki3

subscribe to our mailing list to receive notification of the release http://eepurl.com/dMv_7s

@amarbut
Good news
Password protected website can be scraped with the last version.
For your case

LS<-run_browser()
LS<-LoginSession(Browser = LS, LoginURL = 'https://manager.submittable.com/login', LoginCredentials = c('your email','your password'), cssLoginFields =c('#email', '#password'), XpathLoginButton ='//*[\@type=\"submit\"]' )
#Then scrape data with the session
DATA<-ContentScraper(Url='https://manager.submittable.com/beta/discover/119087', XpathPatterns = c('//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[3]', '//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[2]/div[1]/div[1]' ), PatternsName = c("Article","Title"), astext = TRUE, browser = LS )

check the update to know more features