This project has educational goals and aims to explore how to use Scala in Databricks, consuming an API, and dealing with the response using Spark and native solutions.
link to the solution in Databricks --> here <--
The Reddit APIs were used. It returns the top 50 stocks discussed in the Wallstreetbets subreddit over the last 15 minutes, including a sentiment analysis of the discussions. Documentation is available here.
For this project, DBR 13.3 was used*, and the current ENV was set to use JDK-11 on the cluster:
JNAME=zulu11-ca-amd64
This JVM is necessary to enable java.net.http.HttpRequest
** in Databricks and is required by the request libraries described below:
Scala package | Description | Maven coordinates | Reference |
---|---|---|---|
sttp.client3 | Scala library that provides HTTP request and response handlers. | com.softwaremill.sttp.model:core_2.12:1.7.10 |
Documentation |
sttp.model | Provides HTTP models such as headers, URIs, methods, etc. Required for sttp.client . |
com.softwaremill.sttp.tapir:tapir-sttp-client_2.12:1.10.6 |
Documentation |
*DBR 11.3 until 14.3 are tested and is not expected incompatibility.
**error found: BootstrapMethodError: java.lang.NoClassDefFoundError: java/net/http/HttpRequest
. Solution find here.
A class to deal with sttp.client Response, with attributes:
client
: A SimpleHttpClient from sttp.client instance. Used to execute the request.requestEndpoint
: The endpoint informedsuccessStatusCode
: The 200 status code
And the methods:
getResponse
: Return the get response from the endpoint informed.checkRequestStatusCode
: Raises a exception if response status code is different from 200.transformResponseToDataframe
: Return a spark dataframe if request was succesfull.
- Upload
request_with_scala.dbc
orrequest_with_scala.scala
on your Databricks Workspace; - Install the packages listed in Cluster Configs;
- Open a PR with your improvements!
- Use the tables for a logistic regression model.
- Made a star schema with the current layers.