/get37

get37 🪠 is a Scala / ZIO based web scraper/spider

Primary LanguageScala

get37 🪠

get37 is a Scala / ZIO based web scraper/spider built as part of technical assignment at 13|37.

get37 in action

CircleCI

🏃‍♂️ Usage

After the project is assembled (instructions) into "über-JAR", you can simply use it like this:

$ java -jar target/*/get37.jar https://tretton37.com
$ java -jar target/*/get37.jar --maxFibers 10 --preFetchDelay 70 --maxDepth 4 https://zio.dev
$ java -jar target/*/get37.jar --help # for more help

get37 currently supports three configuration flags that can be passed along when the tool is started.

  • maxFibers, set to 10 by default tells the ZIO runtime how many concurrent fibers can be used when sub-requests are beeing made.
  • preFetchDelay, set to 10 milliseconds by defaul, adds a time delay before the sub-sequential requests are made.
  • maxDepth, set to 3 by default will serve as hard-limit when the spider tries to go deeper into the sites structure.

🏗 Development

This project uses Nix Shell (shell.nix) for project dependencies management. JDK and SBT are only dependencies.

$ sbt "run https://tretton37.com"

To build "über-JAR" this project uses sbt-assembly and sbt-native-packager plugins.

$ sbt assembly
$ java -jar target/*/get37.jar

Testing

This project also comes with tests that can be invoked with SBT and CircleCI setup.

$ sbt test

Dependencies

  • zio - High-performance, type-safe, composable asynchronous and concurrent programming library and framework for Scala.
  • zio-cli - Powerful command-line applications framework for ZIO.
  • zio-http (ex-zhttp) - A scala library for building HTTP apps. It is powered by ZIO and Netty and aims at being the defacto solution for writing, highly scalable and performant web applications using idiomatic Scala.
  • jsoup - is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. Although in this project is only used for content/link extraction.
  • os-lib - a simple, flexible, high-performance Scala interface to common OS filesystem and subprocess APIs

Resources

Author

Oto Brglez

Twitter Follow