We have a few projects on GitHub ...
We want to instantly see the popularity of each of our repos to know what people are finding useful and help us decide where we need to be investing our time.
While GitHub has a basic "traffic" tab which displays page view stats, GitHub only records the data for the past 14 days and then it gets reset. The data is not relayed to the "owner" in "real time" and you would need to use the API and "poll" for data ... Manually checking who has viewed a project is exceptionally tedious when you have more than a handful of projects.
A simple & easy way to see how many people have viewed your GitHub Repository.
There are already many "badges" that people use in their repos.
See: github.com/dwyl/repo-badges
But we haven't seen one that gives a "hit counter"
of the number of times a GitHub page has been viewed ...
So, in today's mini project we're going to create a basic Web Counter.
https://en.wikipedia.org/wiki/Web_counter
If you simply want to display a "hit count badge" in your project's GitHub page, visit: http://hits.dwyl.io to get the Markdown!
To run the code on your localhost in 3 easy steps:
Download (clone) the code to your local machine:
git clone https://github.com/dwyl/hits.git && cd hits
Note: you will need to have Node.js running on your localhost.
Install dependencies:
npm install
Run locally:
npm run dev
Now open Two web browser windows/tabs:
- first tab: http://localhost:8000/ (this is the hits "home page")
- second tab: http://localhost:8000/any/url/count.svg
In case anyone wants to know the thought process that went into building this...
The first question we asked ourselves was: What is the minimum possible amount of (useful/unique) info we can store per visit (to one of our projects)?
-
date + time (timestamp) when the person visited the site/page.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now -
url being visited. i.e. which project was viewed.
-
user-agent the browser/device (or "crawler") visiting the site/page https://en.wikipedia.org/wiki/User_agent
-
IP Address of the client. (for checking uniqueness)
-
Language of the person's web browser. Note: While not "essential", we added Browser Language as the 5th piece of data (when it is set/sent by the browser/device) because it's insightful to know what language people are using so that we can determine if we should be translating/"localising" our content.
We initially considered using the "Common Log Format" (CLF) because it's well-known/understood. see: https://en.wikipedia.org/wiki/Common_Log_Format
An example log entry:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Real example:
84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247
The data makes sense when viewed as a table:
IP Address of Client | User Identifier | User ID | Date+Imte of Request | Request "Verb" and URL of Request | HTTP Status Code | Size of Response |
---|---|---|---|---|---|---|
84.91.136.21 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 007 | [05/Aug/2017:16:50:51 -0000] | "GET github.com/dwyl/phase-two HTTP/1.0" | 200 | 42247 |
On further reflection, we think the "Common Log Format" is inneficient as it contains a lot of duplicate and some useless data.
We can do better.
From the CLF we can remove:
- IP Address, User Identifier and User ID can be condensed into a single hash (see below).
- "GET"" - the word is implied by the service we are running (we only accept GET requests)
- Response size is irrelevant and will be the same for most requests.
Timestamp | URL | User Agent | IP Address | Language | Hit Count |
---|---|---|---|---|---|
1436570536950 | github.com/dwyl/the-book | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 84.91.136.21 | EN-GB | 42 |
In the log entry (example) described above the first 3 bits of data will identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can hash it!
Any repeating user-identifying data should be concactenated
Log entries are stored as a ("pipe" delimited) String
which can be parsed and re-formatted into any other format:
1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42
If a person views multiple pages, three pieces of data are duplicated: User Agent, IP Address and Language for each request/log. Rather than storing this data multiple times, we hash the data and store the hash as a lookup.
If we run the following Browser|IP|Language
String
:
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'
through a SHA hash function we get: 8HKg3NB5Cf
(always)1.
Sample code:
var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf
1Note: SHA hash is always 40 characters, but we truncate it because 10 alphanumeric characters (selected from a set of 26 letters + 10 digits) means there are 3610 = 3,656,158,440,062,976 (three and a half Quadrillion) possible strings which we consider "enough" entropy. (if you disagree, tell us why in an issue!)
1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42
We're sure you will agree this is considerably more compact.
Note: our log also strips the
github.com/
from the url so it's:
1436570536950|dwyl/the-book|8HKg3NB5Cf|42
Which is a considerable saving on "CLF" (see above)
We aren't using a "Database", rather we are using the filesystem.
For implementation see:
/lib/db_filesystem.js
Yes, we know Heroku does not give access to the Filesystem... If you want to run this on Heroku see: dwyl#54
How many user agents (web browsers + crawlers) are there? there appear to be fewer than a couple of thousand user agents. http://www.useragentstring.com/pages/useragentstring.php which means we could store them using a numeric index; 1 - 3000
But, storing the user agents using a numeric index means we
need to perform a lookup on each hit which requires network IO ...
(expensive!)
What if there was a way of deriving a String
representation of the
the user-agent string ... oh, that's right, here's one I made earlier...
https://github.com/dwyl/aguid
- Apache Log Sample: http://www.monitorware.com/en/logsamples/apache.php (looked at the existing log formats, all were too verbose/wasteful for us!)
https://nodejs.org/api/http.html#http_message_rawheaders
The test suite includes tests for 3 databases
therefore running the tests on your localhost
requires all 3 to be running.
Deploying and using the app only requires one of the databases to be available.