Rcrawler not crawling some websites

Question

Rcrawler not crawling some websites

Rifakh opened this issue 7 years ago · 4 comments

Rifakh commented 7 years ago

Hi Salim,

i am running Rcrawler on a vector of websites. i have noticed that it is failing to crawl some ex:

http://www.alahleia.com
http://www.almalki.com

tried several depth levels and timeout.

thank you

Answer 1 · 2018-04-20T01:12:19.000Z

I'm having the same problem. I was only having the issue with https:// sites, but confirmed that the ones you listed were not working for me as well. Some that I was having trouble with were:

https://manager.submittable.com/beta/discover/?page=1&sort=
https://www.estheticapostle.com/

Answer 2 · 2018-10-22T10:52:25.000Z

@Rifakh
Both websites can be crawled now.

http://www.alahleia.com
http://www.almalki.com

@amarbut
https website can be crawed since version 0.1.3. can you be more specific about the issue

Answer 3 · 2018-11-11T01:13:51.000Z

subscribe to our mailing list to receive notification of the release http://eepurl.com/dMv_7s

Answer 4 · 2018-11-13T20:05:08.000Z

@amarbut
Good news
Password protected website can be scraped with the last version.
For your case

LS<-run_browser()
LS<-LoginSession(Browser = LS, LoginURL = 'https://manager.submittable.com/login', LoginCredentials = c('your email','your password'), cssLoginFields =c('#email', '#password'), XpathLoginButton ='//*[\@type=\"submit\"]' )
#Then scrape data with the session
DATA<-ContentScraper(Url='https://manager.submittable.com/beta/discover/119087', XpathPatterns = c('//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[3]', '//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[2]/div[1]/div[1]' ), PatternsName = c("Article","Title"), astext = TRUE, browser = LS )

check the update to know more features