A Scala web scraping library, based on Scalext, for building Akka actor systems that scrape and collect data from any type of website.
Scalescrape is available on Maven Central
(since version 0.4.0
), and it is cross compiled and published for Scala 2.12 and 2.11.
Older artifacts versions are not available anymore due to the shutdown of my self-hosted Nexus Repository in favour of Bintray
Using SBT, add the following dependency to your build file:
libraryDependencies ++= Seq(
"io.bfil" %% "scalescrape" % "0.4.1"
)
If you have issues resolving the dependency, you can add the following resolver:
resolvers += Resolver.bintrayRepo("bfil", "maven")
The library offers to main actor traits that can be extended:
- A
ScrapingActor
: which can be used to define the web scraping logic of an actor - A
CollectionActor
: which can be used to communicate to aScrapingActor
and collect all the data needed
The following example can be used to get some insight of how to use the library
The first step is to try to create a representation of the website that we are going to scrape, something like the following:
class ExampleWebsite {
private val baseUrl = "http://www.example.com"
val homePage = s"$baseUrl/home"
def loginForm(username: String, password: String) =
Form(s"$baseUrl/vm_sso/idp/login.action", Map(
"username" -> username,
"password" -> password))
def updateAccountEmailRequest(newEmail: String) =
Request(s"$baseUrl/account/update", s"""{"email": "$newEmail" }""")
}
The ExampleWebsite
defines the url of the homepage, a login form and request object that can be used to update the account email on the example website.
Form
and Request
are part of the library, and are used to define forms or requests that you need to do in order to scrape the website.
The following will be the message protocol used by the actors to communicate:
object ExampleProtocol {
case class UpdateAccountEmailWithCredentials(username: String, password: String, newEmail: String)
case class Login(username: String, password: String)
case object LoggedIn
case object LoginFailed
case class UpdateAccountEmail(newEmail: String)
case object EmailUpdated
case object EmailUpToDate
}
An example scraping actor can be defined like this:
class ExampleScraper extends ScrapingActor {
// actor logic
}
In our actor logic we are going to create an instance of our ExampleWebsite
for later use, we also create a variable to store some session cookies:
val website = new ExampleWebsite
var savedCookies: Map[String, HttpCookie] = Map.empty
In order to do anything on the website we have to login first, so let's define a method on the actor that logs a user in using his credentials:
private def login(username: String, password: String) =
scrape { // (1)
postForm(website.loginForm(username, password)) { response => // (2)
response.asHtml { doc => // (3)
doc.$("title").text match { // (4)
case "Login error" => complete(LoginFailed) // (5)
case _ =>
cookies { cookies => // (6)
savedCookies = cookies // (7)
complete(LoggedIn) // (8)
}
}
}
}
}
- Uses the
scrape
method to initialize the scraping action - Posts the login form using the
postForm
method, passing theForm
instance - Parse the response as HTML and provides a
JSoup
document - Uses
JSoup
to get the text of the title tag - If the title tag is "Login error" it completes by sending back
LoginFailed
- Otherwise it gets the cookies from the current session
- Stores the session cookies in our actor variable
- Completes and sends back
LoggedIn
Please note that the ScrapingActor
retains the cookies automatically between requests that are part of the same action (between scrape
and complete
), the cookies can be manipulated using the actions addCookie
, dropCookie
, and withCookies
.
After logging in we can used the session cookies to perform other actions as authenticated users, let's create a method to update our email address on the website
private def updateAccountEmail(newEmail: String) =
scrape { // (1)
withCookies(savedCookies) { // (2)
get(website.homePage) { response => // (3)
response.asHtml { doc => // (4)
val currentEmail = doc.$("#account-email").text // (5)
if (currentEmail != newEmail) { // (6)
post(website.updateAccountEmailRequest(newEmail)) { response => // (7)
response.asJson { jsonResponse => // (8)
(jsonResponse \ "error") match {
case JString(message) => fail // (9)
case _ => complete(EmailUpdated) // (10)
}
}
}
} else complete(EmailUpToDate) // (11)
}
}
}
}
- Uses the
scrape
method to initialize the scraping action - Adds the session cookies we saved previously to the scraping context so that they will be sent with the following requests
- Gets the homepage of the example website
- Parses the response as HTML
- Gets the value of the current account email from the
JSoup
document - If the current email is different from the one we want to set
- Posts a JSON request to the website to update our email
- Parses the response as JSON and checks if there is an error message
- Fails if the update email response contains an error message
- Completes and sends back
EmailUpdated
if the email update was successful - If the current email is the same as the one we want to set we complete and send back
EmailUpToDate
Finally, we can define our actor's receive
method:
def receive = {
case Login(username, password) => login(username, password)
case UpdateAccountEmail(newEmail) => updateAccountEmail(newEmail)
}
This actor can now be used to login to our example website and update our email address by sending the appropriate messages to it.
Let's continue and create an actor that performs both actions for us.
An example collection actor can be defined like this:
class ExampleCollector extends CollectionActor[ExampleScraper] {
// actor logic
}
Here's the actor logic:
def receive = {
case UpdateAccountEmailWithCredentials(username, password, newEmail) =>
collect { // (1)
askTo(Login(username, password)) { // (2)
case LoggedIn => // (3)
askTo(UpdateAccountEmail(newEmail)) { // (4)
case x => complete(x) // (5)
}
case LoginFailed => complete(LoginFailed) // (6)
}
}
}
- Uses the
collect
method to initialize the collection action by creating anExampleScraper
actor under the hood - Asks the scraper to login with the credentials received
- If the scraper returns
LoggedIn
- It goes on by asking it to
UpdateAccountEmail
with the new email - Then it completes and sends back whatever is received by the scraper as the response of the action (the complete action kills the internal scraping actor)
- In case the login fails it sends back
LoginFailed
This was a simple example of some of the capabilities of the library, for more details use the documentation.
The main components of Scalescrape are the ScrapingActor
and the CollectionActor
traits.
To understand the details of the internal mechanics of the DSL read the documentation of Scalext.
You can create a scraping Akka actor and use the scraping DSL by extending the ScrapingActor
trait.
scrape
def scrape[T](scrapingAction: Action)(implicit ac: ActorContext): Unit
It creates a ScrapingContext
with a reference to the current message sender and an empty cookie jar, and passes it to the inner action:
scape {
ctx => println(ctx.requestor, ctx.cookies) // current sender, cookies
}
get
def get(url: String)(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]
It sends a GET
request to the url provided and passes the response into the inner action:
get("http://www.example.com/home") { response =>
ctx => Unit
}
post
def post[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]
It sends the POST
request and passes the response into the inner action:
post(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}
postForm
def postForm[T](form: Form)(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]
It sends the POST
request with form data and passes the response into the inner action:
postForm("http://www.example.com/submit-form", Map("some" -> "data")) { response =>
ctx => Unit
}
put
def put[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]
It sends the PUT
request and passes the response into the inner action:
put(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}
delete
def delete[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]
It sends the DELETE
request and passes the response into the inner action:
delete(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}
cookies
def cookies: ChainableAction1[Map[String, HttpCookie]]
It extracts the cookies from the current contexts and passes them into the inner function:
cookies { cookies =>
ctx => Unit
}
withCookies
def withCookies(cookies: Map[String, HttpCookie]): ChainableAction0
It replaces the cookies of the current contexts with the ones specified and calls the inner function with the new context:
withCookies(newCookies) {
ctx => Unit
}
addCookie
def addCookie(cookie: HttpCookie): ChainableAction0
Adds a cookie to the current contexts and calls the inner function with the new context:
addCookie(newCookie) {
ctx => Unit
}
dropCookie
def dropCookie(cookieName: String): ChainableAction0
Adds a cookie to the current contexts and calls the inner function with the new context:
dropCookie("someCookie") {
ctx => Unit
}
complete
def complete[T](message: Any): ActionResult
Completes the scraping action by sending the specified message back to the original sender:
complete("done")
fail
def fail: ActionResult
Returns an Akka status failure message back to the original sender:
fail
You can create a collection Akka actor and use the collection DSL by extending the CollectionActor[T]
trait, where T
is a ScrapingActor
.
collect
def collect(collectionAction: Action)(implicit tag: ClassTag[Scraper], ac: ActorContext): Unit
It spawns an instance of the ScarpingActor
specified as a type parameter under the hood. It creates a CollectionContext
with a reference to the scraping actor and to the current message sender, and passes the context to the inner action:
collect {
ctx => println(ctx.requestor, ctx.scraper) // current sender, scraping actor
}
collectUsingScraper
def collectUsingScraper(scraper: ActorRef)(collectionAction: Action)(implicit ac: ActorContext): Unit
It creates a CollectionContext
with a reference to the scraping actor specified and to the current message sender, and passes the context to the inner action:
collectUsingScraper(myScrapingActor) {
ctx => println(ctx.requestor, ctx.scraper) // current sender, scraping actor
}
askTo
def askTo(messages: Any)(implicit ec: ExecutionContext): ChainableAction1[Any]
It sends messages (using akka.pattern.ask
) to the scraping actor in the collection context and passes the received messages to the inner action:
Please note: it currently handles correctly only up to 3 parameters.
askTo("say hello") {
case "hello" => complete("thanks")
case _ => fail
}
askTo("say hello", "say world") {
case ("hello", "world") => complete("thanks")
case _ => fail
}
askTo("say hello", "say world", "say bye") {
case ("hello", "world", "bye") => complete("bye")
case _ => fail
}
scraper
def scraper: ChainableAction1[ActorRef]
It extracts the scraper from the current contexts and passes them into the inner function:
scraper { scraper =>
ctx => Unit
}
withScraper
def withScraper(scraper: ActorRef): ChainableAction0
It replaces the scraper of the current contexts with the ones specified and calls the inner function with the new context:
withScraper(newScraper) {
ctx => Unit
}
notify
def notify[T](message: Any): ChainableAction0
Sends a message back to the original sender and calls the inner action:
notify("hello") {
ctx => Unit
}
complete
def complete[T](message: Any): ActionResult
Completes the collection action by sending the specified message back to the original sender:
complete("done")
keepAlive
def keepAlive: ActionResult
Completes the collection action by not sending any message back to the original sender and keeping the scraping actor alive:
keepAlive
fail
def fail: ActionResult
Returns an Akka status failure message back to the original sender and kills the scraping actor:
fail
This software is licensed under the Apache 2 license, quoted below.
Copyright © 2014-2017 Bruno Filippone http://bfil.io
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
[http://www.apache.org/licenses/LICENSE-2.0]
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.