RsparkleR
provides an R interface for launching virtual machines and deploying Sparkler as painless as possible with a few lines from your local R session.
Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix.
See all documentation on the sparkler website
Detailled instructions here : https://data-seo.com/2017/12/17/advanced-r-programming-seo-crawler
-
Configure a OVH Cloud Project with billing https://api.ovh.com/createToken/index.cgi?GET=/*&POST=/*&PUT=/*&DELETE=/*
-
Create your SSH keys : sshPubKeyPath, sshPrivKeyPath
-
Put your regionVM ( SBG3,BHS3,WAW1,UK1,DE1,GRA3)
- SBG3 Datacenter is in France
- BHS3 Datacenter is in Canada
- WAW1 Datacenter is in Poland
- UK1 Datacenter is in UK
- DE1 Datacenter is in Deutch
- GRA3 Datacenter is in France
-
Put your typeVM (s1-2,s1-4,...) and SSH Key About range of cloud servers : https://www.ovh.co.uk/public-cloud/instances/prices/
-
Run
library(RsparkleR)
-
ovh <- importOvh()
-
client <- loadClient(ovh,endpoint,application_key,application_secret,consumer_key)
-
Run
vm <- createSparkler(client,regionVM='UK1',typeVM='s1-4',sshPubKeyPath,sshPrivKeyPath)
-
Wait for it to install and your instance is ready, you get vm object with ip and port 22 is open
-
Now you can deploy your Sparkler
-
Deploy your Docker with Sparkler : Run
startSparkler(vm, prod=TRUE, debug=TRUE)
. Be patient for the first time. -
Launch crawl :
crawlid <- startCrawl(vm, url="https://data-seo.com", topUrls=100, topGroups=5, maxIter=2, debug=TRUE)
-
Get results from SolR
crawlDF <- readSolr(vm, pattern, crawlid, topUrls=100, extracted=TRUE)
- Thamme Gowda and USC Data Science ( http://irds.usc.edu ) for creating Sparkler
- Mark Edmondson for the googleComputeEngineR package for providing an R interface to the Google Cloud Compute Engine API, for launching virtual machines.
- Scott Chamberlin for the analogsea package for launching Digital Ocean VMs, which inspired the SSH connector functions for this one.
- Winston Chang for the harbor package where the docker functions come from. If
harbor
will be published to CRAN, it will become a dependency for this one.
Github
library(devtools)
install_github("voltek62/RsparkleR")
CRAN version:
Waiting...