Small Elixir project that scrapes HTML and convert its tags into a map. Showcases Elixir's features like pattern matching, behaviours, doctests, test mocks etc.
- Elixir 1.14.2
- Erlang/OTP 25
- Clone:
git clone https://github.com/icarooliv/scrapex.git
- Install dependencies
cd scrapex
mix deps.get
- Open the project with IEx:
iex -S mix
You can call the Scrapex.run/2 method. You can pass a valid URL and also a list of tuples in the following order: {key_name, html_tag, html_attr}
.
Successful:
iex> Scrapex.run("http://www.columbia.edu/~fdc/sample.html", [{"assets", "img", "src"}, {"links", "a", "href"}])
{:ok,
%{
"assets" => ["http://www.columbia.edu/~fdc/picture-of-something.jpg"],
"links" => ["http://www.columbia.edu/~fdc/",
"https://kermitproject.org/newdeal/",
"http://www.columbia.edu/cu/computinghistory",
"http://www.columbia.edu/~fdc/family/dcmall.html",
"http://www.columbia.edu/~fdc/family/hallshill.html",
...
]
}}
Invalid string:
iex(3)> Scrapex.run("abc", [{"assets", "img", "src"}, {"links", "a", "href"}])
{:error, :invalid_format}
Page not found:
iex> Scrapex.run("https://elixirforum.com/t/page-do-not-exist", [{"assets", "img", "src"}, {"links", "a", "href"}])
{:error, 404}
Here I discuss some thoughts and decisions that I had during this challenge.
I wanted to create the simplest that I could without missing the opportunities to show my skills even that it took more that the suggested time.
I also used TDD to guide me through the code design and its capabilities.
I used the Arrange Act Assert strategy to write my tests.
As José Valim explained here: "Mocks/stubs do not eliminate the need to define an explicit interface between your components." If I were to go with a non-contract approach, I would couple my Scrapex module with HTTPoison. However, I do not want to test HTTPoison; I want to test the contract between Scrapex and an HTTP client. Assuming that the project will evolve, it will be easier to keep the contract and change the implementation. Also, it'll not break tests that depend on this implementation.
In the test environment, I can define the HTTP.ClientMock as my client using Mox. This will rely on the Scrapex.HTTP.Client behaviour. In other environments, I could use either a new module defined at the environment (such as a Tesla implementation) or a default one, such as HTTPoison in this case.
Because my time window to this task is small and the other modules are smaller enough to rely on integration tests. Obviously it would be a good idea to unit test the HTTP behaviour. As the Scrapex
module have only one public function, the private ones are tested indirectly.
Because it's a function with side effects. See here.
- The url should accepts the formats
www.example.com
,https://example.com
and other variations. If the url doesn't come with the schema prefix ashttp
orhttps
, it must be prepended tohttps://
. - The scrape function must merge partial URL strings with the base URL.
- The scrape function will not handle cases where a link points to a image and it's added to the
links
instead ofassets
. - This project is "platform agnostic" meaning it should be made in a way that it can be ported to a lib that can be installed, used as a umbrella app, imported as a git submodule, copy-pasted inside a Phoenix project etc.
- Define a better structure for the HTTP responses. Now is just a map and if someone mistakenly changes the structure it'll break the code without warnings.
- Do unit tests in the HTTP client behaviour.
- Cache the results using Cachex or Nebulex. The k/v store can look like:
{url, scraping_result}
. TheScrapex.run/2
function will accept anopts
array where the user can define the TTL and if it should refresh the data, like so:[refresh: true, ttl: 1000]
.
- mix test.watch
- Dashbit's Bytepack project for URL validations
- Using Mox with behaviours
- Scraping data with Floki
- This discussion over http clients. I usually choose Tesla but wanted to try HTTPoison.