Progress report

Question

Progress report

Opened this issue 5 years ago · 0 comments

Goal

The user need for this project can be described as, "How, in a rapidly growing company of >100 people writing SQL and exploring data every day, can everyone in the company stay abreast of what data is useful, what is cruft, and where to find data relating to specific questions?"

Design and similar software

There is the very impressive https://github.com/lyft/amundsen that solves this very issue, but after spending some time in that application and running it myself locally I felt it was possible to deliver a better user experience with much simpler / fewer lines of code-- especially given the use case I'm interested in of companies typically using a single data lake in an MPP database like AWS Redshift. Lyft and companies of that ilk are using Hive/Hadoop and multiple datastores, so they have a slightly different problem.

The solution is essentially to make available a website where users can view and search useful metadata about the data warehouse (e.g. tables and columns), and add their own comments about what data is useful. Other features like having an "owner" for given tables, liking/favoriting, might also be useful.

What I think would be the killer app would be including metadata about queries run, so you can see a timeline of activity for a given table, which tables are commonly co-queried, etc. This will wait for a v2, but I note that Redshift makes available the stv_queries table which could be useful.

Architecture

I knew I was going to write the front-end in svelte because after writing a few apps with jquery, and then a few in React, I feel svelte delivers both an incredible developer experience and a great feeling UI. I recently discovered https://bulma.io/ and decided to give that a try as well.

I was inclined to write the back end in elixir which has been my favorite language lately to develop in, but because I'm doing this work with the engineering team at better.com who work almost 100% in python, I'm opting to do the API in python (flask) which has been a learning experience.

I opted for Elasticsearch as the only datastore. It delivers a really amazing and flexible search experience, which with suitable snapshots should be safe enough to store this largely uncritical data.

Progress

UI is largely built, with pages to see tables and the columns for a specific table. This part takes by far the longest and there will be many tweaks to come.
Back end is getting there. Flask is... weird, and minimal, but I found a setup that works and it's kind of nice not to worry about schemas and migrations. Just added an end-to-end working flow of viewing a list of tables, clicking on one to view the list of columns/relations of that table, and then editing a column description and saving it.
- Generalizing this to table descriptions as well as table and column tags will largely complete the v1 front-end experience.
Search is going to be the next big push-- the user will want to query for column names, table names, or tags, and rank those searches by heavy usage. We may also want to add a way to get an easy visual overview of the most used tables.