npm
This is a failure resilient npm registry to Algolia index replication process. It will replicate all npm packages to an Algolia index and keep it up to date.
The state of the replication is saved in Algolia index settings.
The replication should always be running. Only one instance per Algolia index must run at the same time. If the process fails, restart it and the replication process will continue at the last point it remembers.
- Algolia Index
- Usage
- Env variables
- How does it work?
- Tests
- Deploying new version
- Forcing a complete re-index
For every single NPM package, we create a record in the Algolia index. The resulting records have the following schema:
{
"name": "babel-core",
"concatenatedName": "babelcore",
"downloadsLast30Days": 10978749,
"downloadsRatio": 0.08310651682685861,
"humanDownloadsLast30Days": "11m",
"jsDelivrHits": 11684192,
"popular": true,
"version": "6.26.0",
"versions": {
// [...]
"7.0.0-beta.3": "2017-10-15T13:12:35.166Z"
},
"tags": {
"latest": "6.26.0",
"old": "5.8.38",
"next": "7.0.0-beta.3"
},
"description": "Babel compiler core.",
"dependencies": {
"babel-code-frame": "^6.26.0"
// [...]
},
"devDependencies": {
"babel-helper-fixtures": "^6.26.0"
// [...]
},
"repository": {
"url": "https://github.com/babel/babel/tree/master/packages/babel-core",
"host": "github.com",
"user": "babel",
"project": "babel",
"path": "/tree/master/packages/babel-core",
"branch": "master"
},
"readme":
"# babel-core\n\n> Babel compiler core.\n\n\n [... truncated at 200kb]",
"owner": {
// either GitHub owner or npm owner
"name": "babel",
"avatar": "https://github.com/babel.png",
"link": "https://github.com/babel"
},
"deprecated": false,
"badPackage": false,
"homepage": "https://babeljs.io/",
"license": "MIT",
"keywords": [
"6to5",
"babel",
"classes",
"const",
"es6",
"harmony",
"let",
"modules",
"transpile",
"transpiler",
"var",
"babel-core",
"compiler"
],
"created": 1424009748555,
"modified": 1508833762239,
"lastPublisher": {
"name": "hzoo",
"email": "hi@henryzoo.com",
"avatar": "https://gravatar.com/avatar/851fb4fa7ca479bce1ae0cdf80d6e042",
"link": "https://www.npmjs.com/~hzoo"
},
"owners": [
{
"email": "me@thejameskyle.com",
"name": "thejameskyle",
"avatar": "https://gravatar.com/avatar/8a00efb48d632ae449794c094f7d5c38",
"link": "https://www.npmjs.com/~thejameskyle"
}
// [...]
],
"lastCrawl": "2017-10-24T08:29:24.672Z",
"dependents": 3321,
"humanDependents": "3.3k",
"changelogFilename": null, // if babel-core had a changelog, it would be the raw GitHub url here
"objectID": "babel-core",
"_searchInternal": {
"popularName": "babel-core",
"downloadsMagnitude": 8,
"jsDelivrPopularity": 5
}
}
If you want to learn more about how Algolia's ranking algorithm is working, you can read this blog post.
We're restricting the search to use a subset of the attributes only:
_searchInternal.popularName
name
description
keywords
owner.name
owners.name
Algolia provides default prefix search capabilities (matching words with only the beginning). This is disabled for the owner.name
and owners.name
attributes.
Algolia provides default typo-tolerance.
Using the optionalFacetFilters
feature of Algolia, we're boosting exact matches on the name of a package to always be on top of the results.
For each package, we use the number of downloads in the last 30 days as Algolia's customRanking
setting. This will be used to sort the results having the same textual-relevance against each others.
For instance, search for babel
with match both babel-core
and babel-messages
. From a textual-relevance point of view, those 2 packages are exactly matching in the same way. In such case, Algolia will rely on the customRanking
setting and therefore put the package with the highest number of downloads in the past 30 days first.
Some packages will be considered as popular if they have been downloaded "more" than others. We currently consider the packages having more than 0.005%
of the total number of downloads on the whole registry as the popular packages. This popular
flag is also used to boost some records over non-popular ones.
yarn
apiKey=... yarn start
To restart from a particular point (or from the begining):
seq=0 apiKey=... yarn start
This is useful when you want to completely resync the npm registry because:
- you changed the way you format packages
- you added more metadata (like GitHub stars)
- you are in an unsure state and you just want to restart everything
seq
represents a change sequence
in CouchDB lingo.
Since the code is in ES6 and node.js, we compile to ES5 at the install
process. To avoid having to rebuild
while developing, use:
seq=0 apiKey=... yarn dev
Be careful to develop on a different index than the production one when necessary.
See config.js:
apiKey
: Algolia apiKey - requiredappId
: Algolia appId - defaultOFCNCOG2CU
indexName
: Algolia indexName - defaultnpm-search
bootstrapConcurrency
: How many docs to grab from npm registry at once in the bootstrap phase - default100
replicateConcurrency
: How many changes to grab from npm registry at once in the replicate phase - default10
seq
: npm registry first change sequence to start replication. In normal operations you should never have to use this. - default0
npmRegistryEndpoint
: npm registry endpoint to replicate from - defaulthttps://replicate.npmjs.com/registry
This should be the only valid endpoint to replicate (even if a bit slow), see this comment.npmDownloadsEndpoint
: Where to look for the last 30 days download of packages - defaulthttps://api.npmjs.org/downloads
popularDownloadsRatio
: % of total npm downloads for a package to be considered as popular how much % of it is needed for a package to be popular - default 0.2 This is a bit lower than the jQuery download range.
Our goal with this project is to:
- be able to quickly do a complete rebuild
- be resilient to failures
- clean the package data
When the process starts with seq=0
:
- save the current sequence of the npm registry in the state (Algolia settings)
- bootstrap the initial index content by using /_all_docs
- replicate registry changes since the current sequence
- watch for registry changes continuously and replicate them
Replicate and watch are separated because:
- In replicate we want to replicate a batch of documents in a fast way
- In watch we want new changes as fast as possible, one by one. If watch was asking for batches of 100, new packages would be added too late to the index
yarn test
Only linting.
Setup heroku, then:
git push heroku master
This will force a reindex, without removing any existing package
heroku config:add seq=0
# check logs to see if it re-started
heroku logs -t
heroku config:remove seq
# check logs to see if it re-started
heroku logs -t