hits

Why?

We have a few projects on GitHub ...

We want to instantly see the popularity of each of our repos to know what people are finding useful and help us decide where we need to be investing our time.

While GitHub has a basic "traffic" tab which displays page view stats, GitHub only records the data for the past 14 days and then it gets reset. The data is not relayed to the "owner" in "real time" and you would need to use the API and "poll" for data ... Manually checking who has viewed a project is exceptionally tedious when you have more than a handful of projects.

What?

A simple & easy way to see how many people have viewed your GitHub Repository.

There are already many "badges" that people use in their repos. See: github.com/dwyl/repo-badges
But we haven't seen one that gives a "hit counter" of the number of times a GitHub page has been viewed ...
So, in today's mini project we're going to create a basic Web Counter.

https://en.wikipedia.org/wiki/Web_counter

How?

If you simply want to display a "hit count badge" in your project's GitHub page, visit: http://hits.dwyl.io to get the Markdown!

Want to Run it Yourself?!

To run the code on your localhost in 3 easy steps:

1. Download the Code:

Download (clone) the code to your local machine:

git clone https://github.com/dwyl/hits.git && cd hits

Note: you will need to have Node.js running on your localhost.

2. Install the Dependencies

Install dependencies:

npm install

3. Run the Server

Run locally:

npm run dev

Now open Two web browser windows/tabs:

first tab: http://localhost:8000/ (this is the hits "home page")
second tab: http://localhost:8000/any/url/count.svg

Implementation Detail

In case anyone wants to know the thought process that went into building this...

What Data to Capture/Store?

The first question we asked ourselves was: What is the minimum possible amount of (useful/unique) info we can store per visit (to one of our projects)?

date + time (timestamp) when the person visited the site/page.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now
url being visited. i.e. which project was viewed.
user-agent the browser/device (or "crawler") visiting the site/page https://en.wikipedia.org/wiki/User_agent
IP Address of the client. (for checking uniqueness)
Language of the person's web browser. Note: While not "essential", we added Browser Language as the 5th piece of data (when it is set/sent by the browser/device) because it's insightful to know what language people are using so that we can determine if we should be translating/"localising" our content.

"Common Log Format" (CLF) ?

We initially considered using the "Common Log Format" (CLF) because it's well-known/understood. see: https://en.wikipedia.org/wiki/Common_Log_Format

An example log entry:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Real example:

84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247

The data makes sense when viewed as a table:

IP Address of Client	User Identifier	User ID	Date+Imte of Request	Request "Verb" and URL of Request	HTTP Status Code	Size of Response
84.91.136.21	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)	007	[05/Aug/2017:16:50:51 -0000]	"GET github.com/dwyl/phase-two HTTP/1.0"	200	42247

On further reflection, we think the "Common Log Format" is inneficient as it contains a lot of duplicate and some useless data.

We can do better.

Alternative Log Format ("ALF")

From the CLF we can remove:

IP Address, User Identifier and User ID can be condensed into a single hash (see below).
"GET"" - the word is implied by the service we are running (we only accept GET requests)
Response size is irrelevant and will be the same for most requests.

Timestamp	URL	User Agent	IP Address	Language	Hit Count
1436570536950	github.com/dwyl/the-book	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)	84.91.136.21	EN-GB	42

In the log entry (example) described above the first 3 bits of data will identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can hash it!

Any repeating user-identifying data should be concactenated

Log entries are stored as a ("pipe" delimited) String which can be parsed and re-formatted into any other format:

1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42

Reducing Storage (Costs)

If a person views multiple pages, three pieces of data are duplicated: User Agent, IP Address and Language for each request/log. Rather than storing this data multiple times, we hash the data and store the hash as a lookup.

Hash Long Repeating (Identical) Data

If we run the following Browser|IP|Language String:

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'

through a SHA hash function we get: 8HKg3NB5Cf (always)¹.

Sample code:

var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf

¹Note: SHA hash is always 40 characters, but we truncate it because 10 alphanumeric characters (selected from a set of 26 letters + 10 digits) means there are 36¹⁰ = 3,656,158,440,062,976 (three and a half Quadrillion) possible strings which we consider "enough" entropy. (if you disagree, tell us why in an issue!)

Hit Data With Hash

1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42

We're sure you will agree this is considerably more compact.

Note: our log also strips the github.com/ from the url so it's:

1436570536950|dwyl/the-book|8HKg3NB5Cf|42

Which is a considerable saving on "CLF" (see above)

Data Storage

We aren't using a "Database", rather we are using the filesystem.

Filesystem

For implementation see: /lib/db_filesystem.js

Yes, we know Heroku does not give access to the Filesystem... If you want to run this on Heroku see: dwyl#54

Research

User Agents

How many user agents (web browsers + crawlers) are there? there appear to be fewer than a couple of thousand user agents. http://www.useragentstring.com/pages/useragentstring.php which means we could store them using a numeric index; 1 - 3000

But, storing the user agents using a numeric index means we need to perform a lookup on each hit which requires network IO ... (expensive!) What if there was a way of deriving a String representation of the the user-agent string ... oh, that's right, here's one I made earlier... https://github.com/dwyl/aguid

Log Formats

Apache Log Sample: http://www.monitorware.com/en/logsamples/apache.php (looked at the existing log formats, all were too verbose/wasteful for us!)

Node.js http module headers

https://nodejs.org/api/http.html#http_message_rawheaders

Running the Test Suite locally

The test suite includes tests for 3 databases therefore running the tests on your localhost requires all 3 to be running.

Deploying and using the app only requires one of the databases to be available.

haozhestat/hits