Shortest URL

URL shortener API example using Go and Chi.

You enter a URL such as https://github.com/darioblanco and it returns a short URL such as http://urlshortener/64fc5e.

Dependencies

The following dependencies are required:

go
git

The following dependency is optional (but recommended to test in a production-like scenario)

docker

The command make init-deps will always check if any of them is missing. Such command is always part of any other Makefile command.

Configuration

A default configuration is set in cmd/config.default.yaml, but it can be overwritten using environment variables. This already gives flexibility for any kind of deployment.

Yaml config	Environment Variable	Description	Default
`environment`	`SHORTESTURL_ENVIRONMENT`	The environment used to define different logging and cache strategies. It can be one of `prod`, `stage`, `test` and `dev`. If `dev` is provided, Redis is not needed as a dependency. The `test` environment is used in docker-compose.	`dev`
`httpHost`	`SHORTESTURL_HTTP_HOST`	The http host for the server, swagger and encoded urls.	`localhost`
`httpPort`	`SHORTESTURL_HTTP_PORT`	The http port for the server, swagger and encoded urls.	`3000`
`httpScheme`	`SHORTESTURL_HTTP_SCHEME`	The http scheme to use in the server, swagger and encoded urls.	`http`
`redisHost`	`SHORTESTURL_REDIS_HOST`	The redis host. Ignored for `dev` environment.	`localhost`
`redisPort`	`SHORTESTURL_REDIS_PORT`	The redis port. Ignored for `dev` environment.	`6379`
`urlLength`	`SHORTESTURL_URL_LENGTH`	The length of the shortened url, a bigger number will reduce possible collisions (solving collisions requires extra computational effort).	`6`
`urlExpirationInHours`	`SHORTESTURL_URL_EXPIRATION_IN_HOURS`	The maximum time in hours in which a shortened url will live in the system. A value of `0` means they are kept indefinitely.	`0`
`version`	`SHORTESTURL_VERSION`	The version of the released application. Useful for CI/CD pipelines.	`unknown`

How to Run

Local

To run the application locally

make run

To run with hot reload, so any changes saved in your files will be reflected in the running server.

make run-hmr

Docker

Alternatively, to

To run the application in docker:

Make sure you run make init before (docker is needed as a dependency)
Type docker-compose up --build, both the redis and server container will run in the foreground

See the docker-compose CLI for more ways to run the dockerized application.

Swagger

You can browse the swagger documentation at http://localhost:3000/docs/index.html.

Deploy

This application can be deployed with docker, building the Dockerfile and running the image somewhere. Or directly with the binary generated in tmp/shortesturl.

A reverse proxy in front of this application that does SSL termination is strictly recommended. This application does not handle HTTPS and it expects a proxy to handle that part.

Commit style

The commit messages follow the Conventional Commits style. Thanks to this, we could develop automated pipelines that automatically create release notes based on developer commits and the type of change that is set.

I have an open source tool called release-wizard which does exactly this task.

Project Structure

The code is structured in three folders

app: holds all application code
cmd: defines different entrypoints for the application (in case we would like to run it as a server, or as a script, etc...)
docs: has autogenerated Swagger code and documentation

In addition, there are a few relevant files:

.air.toml: configures HMR
.goverreport.yml: defines test coverage settings
docker-compose.yaml: sets up the server and redis docker containers and their network
Dockerfile: contains commands to assemble the server docker image
Makefile: sets useful rules for running the application, testing, etc...
shortesturl.http: allows testing the endpoints with VisualStudioCode/humao.rest-client. The application should be running first.

`app` folder

The app folder defines the app package, which exposes an abstraction that can be configurable in different entrypoints.

In addition, this folder defines a series of internal packages (won't be browsable outside the app package scope):

cache: fast store abstraction with basic Get and SetIfNotExists commands. It implements redis under the hood. If the environment is dev, miniredis will be loaded instead (eliminating the need to have redis as a dependency to the project)
config: the configuration auto loader. It implements viper under the hood.
http: http abstraction that conforms to Go's http.Handler. It implements chi under the hood.
logging: logging abstraction that implements zap under the hood.

`cmd` folder

The cmd folder will have the default configuration stored as a YAML file. Such configuration is always used in case that environment variables are not overriding. See configuration for more details.

In addition, every subfolder will define a different entrypoint to the application, with a main.go file inside. At this moment, we only need a server entrypoint.

Tests

To run the tests (with its respective coverage):

make test

This project aims for a 98% test coverage of all files, with the exception of cmd/server/main.go and app/app.go. Those files are considered entry points of the application, thus they are tested with make run.

In addition, app/internal/cache/cache.go is not properly shown in the coverage due to go limitations, as technically the app/internal/http/api_test.go file is using the optimistic lock and eventually could trigger its retry mechanism. Therefore, a threshold of 98% test coverage has been set instead of 100%. It would be possible to test the retry and error branch in the SetIfNotExists function, but that would require creating a mock over the cache inherent client instead of using miniredis. I decided not to spend extra time on that mock wrapper as the benefit for this task is little and I consider that the code is production ready. However, if the project would grow, I would strongly consider even reaching 100% test coverage with unit tests there.

All files ending in _test.go are performing unit tests, except api_test.go that follows an integration test approach for the /encode and /decode endpoints.

Files ending in _mock.go are thought to expose mocks and stubs to different packages that might need them. Therefore, these mocks will be defined within the package of the real implementation, even if it does not use them. Mockery is an interesting tool to automatically generate such mocks, though this project does not implement it as it is not complex enough. The mock files are automatically ignored from the coverage report.

Manual tests

You can manually check the endpoints, using the shortesturl.http file and VisualStudioCode/humao.rest-client.

I consider that strong automated testing is the way to go in order to have a codebase that is maintainable (especially when developing collaboratively), but manual tests is helpful also during development and for demo purposes.

Benchmarks

In addition, I have create a few basic benchmarks that would help to analyze the speed of the encode and decode functions. These are set in app/internal/http/benchmarks_test.go.

These are not tests! Therefore, no asserts are done in this file. However, they will be very helpful to analyze the impact of possible improvements in the algorithms.

Algorithm

The encode/decode algorithm implemented has no restrictions about how should it work, and the following requirements picked from the README:

The encoded url MUST be decodable into its original form
The short urls MUST be kept in memory. This one actually tries to reduce the overhead of the challenge, thus a persistence solution is not required.

In order to match the requirements, we need a bijective function, which for a pairing between L (a set of long urls l) and S (a set of short urls s), hold the following properties:

Each element of L must be paired with one (and only one) element of S: f(l1) = s1 and f(l1) != f(l2) if l1 != l2
Each element of S must be paired with one (and only one) element of L: g(s1) = l1 and g(s1) != g(s2) if s1 != s2

Persistence

I have selected Redis for production for several reasons:

The data structure required is very simple, a key/store database is enough
It is easy to set keys to expire (e.g. if expiring urls is required in the future)
I am very familiar with this store
It has high availability options and can be configured to persist to disk if necessary while retaining the in-memory speed
It gives close to O(1) complexity for search, insert and delete in any scale*

*We could think that an O(n) in the worst case could be terrible if Redis has millions of entries, but Redis use separate chaining hash tables, splitting the items (n) into buckets (k), which will grow with the number of entries, giving a O(1+n/k). In practice Redis keep that n/k low, because it performs the rehashing in the background. Therefore, we can consider that a solution like Redis could be implemented for our url shortener at scale.

However, the task explicitly defines that You do not need to persist short URLs to a database. Keep them in memory.. I thought about creating a very simple hash table by myself, which in order to do at scale, I should then define smaller hash tables partitioned by keyspace. In addition, I strongly considered this other requirement: Please organize, design, test and document your code as if it were going into production. I would not recommend going to production with a custom hash table, because there are a lot of implications like race conditions that will have to be solved. We will basically need to protect the data structure from other go routines whenever we check if the value is there and insert. Redis is doing this very easily with the SetNX operation for instance.

In order to keep this task loyal to what has been asked, I have determined the following approach:

If the environment is dev, miniredis is used in the background, which is just a simple, cheap and in-memory redis replacement. Therefore, the urls are kept in memory, and this simplifies a lot the amount of code I need to write.
If the environment is not dev, the application will read the redis connection parameters and use an external instance. This can be tested also locally, but it requires docker.
I have developed a cache package that transparently uses miniredis or a real redis based on the given environment variable.

With this decision I hope to give a solution to the requirement of keeping the urls in memory (without having a persistence dependency in development mode) while defining a codebase that can be easily deployed to production with a more resilient setup.

Thought process

First approach: auto incremented counter

My first approach towards a URL shortener was basically to store the urls as they come and have a counter, thus the counter will be incremented for each url that is shortened. Technically, that would be already a URL shortener, because we could use the database id like http://localhost:3000/200, but it makes obvious that somebody could iterate through those urls and consume /decode to figure out what long url was before, which is not what we want.

Transforming the database id with base62 (for instance) would make the problem less apparent but it would still be possible to have url iteration if somebody decides to encode the ids with that format and use /decode.

Basically, any solution relying only in the auto increment value set up in the database is a no-go for me, because it would be quite simple to browse urls shortened by others. I want to give a solution that would make it reasonably difficult to do such url iteration, thus nobody would bother trying when there is so little to earn from it (at the end they would not know from which user the url belongs).

The MD5 strategy

Therefore, I decided to use an MD5 hash because is fast (for being a hash) and seems like a good fit for this use case, as it should be enough to discourage url iteration. Also, I am not worried about MD5 collisions, because if we use the rule of thumb where we will see our first collision after hashing 2^n/2 items, and MD5 creates strings of 128 bits, we could technically have 2^64 "collision-free" urls, and we will never store so many entries in our database, as I would then set up an expiration in order to reduce that load.

However, we want to create a url shortener, and not a url longener, if I output the MD5 directly, a website like https://darioblanco.com would then be http://localhost:3000/72b746fa3f83d4d932516cb27313c0eb, which is terrible. Therefore, I am picking the first 6 digits of such MD5, creating something like http://localhost:3000/72b746. If we deploy this to production, for instance to the shy.to domain, then we will have https://shy.to/72b746, which saves 2 characters from a root domain url that is usually not one we would shorten in any case.

Picking the first 6 digits of an MD5 greatly increases the probability of having collisions. As MD5 yields hexadecimal digits ([0-9a-f]), each hex character is a nibble (4 bits) and the strings have 128 bits, 128/4 gives 32 characters. If we use only 6 characters, then we will have 24 bits, and 2^12 "collision-free" urls, which is 4096, so collisions might happen often, as soon as our service is getting loaded with data!

In order to solve the collision problem, I will use a simple in-memory key/store (Redis), abstracted in the cache package. If the first 6 digits of such MD5 is already in the store, it will shift the MD5 and keep trying until it finds one that can be stored. I am not worried about the complexity when looking up if the MD5, because Redis claims to have a close to O(1) complexity in that operation, and collisions can be reduced greatly if we increase the characters of the url or if we aggressively expire the urls stored in the system.

In addition, the advantage of using the "short MD5" approach is that we won't have duplicated urls in our database, which is a problem that arises with some auto increment strategies. Thus I am fine committing to the trade-off of looking up for an MD5 at least once for each url encode in an in-memory database, in order to protect from automatic url iteration, while having as a side-effect no duplicates in the store.

The url decode is then really trivial, the encoded url will be stored as a key, and its value will be the original url. Therefore a close to O(1) complexity for the /decode endpoint.

The race condition

However, when encoding URLs, there could be a race condition if two operations happen over the same URL, because there is a check to verify if the shortened md5 from the url was already generated. That one is simple to solve, because the set could be idempotent in the worst case.

However, what about a race condition between two completely different URLs that resolve to the same 6 digits shortened slug? This is when things can get problematic because I would not achieve the bijective requirement. When the system has already a slug for a URL, and a second URL resolves to the same slug, by design I am shifting that slug (what I call a "collision"). As there is a GET operation to verify if the candidate slug has the same url (or a different one), the SET operation that goes afterwards should be performed if the previous condition does not change.

Initially, I decided to put a mutex for the GET and SET operations that fixed the problem, but it could impact performance under heavy loads and it would not fix the problem in a multiple replica scenario.

Finally, I decided to implement a Redis transaction in conjunction with the WATCH command, following the optimistic locking approach, because conflicts will be rare, it does not lock the system (as a pessimistic locking like mutex would put a burden in the performance), and we expect that a single retry would be enough to solve the conflict.

Therefore, the transaction will be aborted if the watched key changed (as the assumption we had when doing GET is different now). With optimistic locking we just retry the redis operation, which in a second time would retrieve the shortened url in GET and the race condition would never happen again for that url. The maximum number of operations is set to 5 (4 retries).

Improvements

Possible improvements to this project:

A very useful endpoint would be GET /{shortUrlSlug}, that will automatically perform a redirect to the long url. Therefore, when somebody types https://{shortesturl-hostname}/64fc5e, it will automatically redirect to https://github.com/darioblanco. I have not developed it because it was not in the scope of the task.
For a production deploy, it is recommended to have a reverse proxy in front, that does SSL termination. To handle the load, I would use Kubernetes with horizontal pod autoscaling that will automatically create replicas of this application based on CPU and memory usage. As we use Redis and a transaction to solve race conditions from its tooling, this code can already scale with different replicas.
Instead of using Redis as final persistence, it could be used as a cache, with an expiration for the keys that can be refreshed based on cache hits, or an LRU approach. The final persistence could be done in Cassandra or MongoDB, using the shortUrlSlug as a partition key (or sharding key) to distribute the storage and increase search performance. These databases could give us different tools to handle the above mentioned race condition, even with multiple replicas. However, Redis could scale very well without introducing more complex persistence layers, it can be HA and it could also store data in disk, thus I would be very careful before analyzing a different database.

darioblanco/shortesturl