Codelauf is a source code search system
It is a work-in-progress. This design document describes how it will be architected.
Codelauf mirrors git repositories and uses elasticsearch to index files and commits on tracked branches.
Code is passed through some language specific syntax analysers before being loaded into the index.
You can search the indexes given a commit id or a string that appears in the codebase on one of the tracked remotes and branches.
ELB -> ASG[ Web Frontends ] -> ElasticSearch <- codelauf worker -> sqlite
-> ZooKeeper <-
there can be any number of web frontends, each of which is stateless.
a separate project provides the web front-end and API.
the web frontends provide an api that can be used to query the cluster state as it is in zookeeper, and also to perform searches.
there is a single codelauf worker at any one time and this is enforced via zookeeper. in future we could use leader election to allow failover, or partition the repositories into buckets spread across a cluster of workers.
zookeeper is used for two things:
- long lived configuration data:
- list of repositories that need to be indexed
- ephemeral state of worker process:
- when it started
- what it's doing
codelauf stores mirrored git repositories on its local filesystem, and also uses sqlite to track program state that should persist across application restarts, but does not need to outlive the mirrored git repositories themselves.
if the worker machine is lost, it can be recovered by starting a new one and re-mirroring the git repositories named in zookeeper. this process is automatic. zookeeper also holds the indexed commit id of each branch as a backup, so no re-indexing is needed
if zookeeper is lost, its configuration will need to be recreated, and the codelauf worker restarted.
if the elasticsearch cluster is lost, the worker will need to re-index everything.
it is recommended that if your repository setup is anything other than trivial, that you create a script to drive the web api to add the repos automatically.
/codelauf (root)
/repositories
/{43223-21998392-3232-123294}
- type: git
url: https://github.com/...
branches:
- name: master
indexed_commit_id: blah
last_indexed: Monday
wanted_indexed: Tuesday
/{09238-24234233-3242-432981}
- type: hg?
url: blah
blah: blah
/workers
/0
- start_time: Tuesday
/repositories
/{43223-21998392-3232-123294}
- status: cloning
- progress: 80%
/{09238-24234233-3242-432981}
- status: indexing_files
- progress: 20%
/repositories index,get,patch,delete
/workers index,get
/search get
note that there's no way to directly add or remove repos to a worker. this is done via the worker watching zk /repositories at the moment. this API is a bit redundant at the moment. in future it will be used to coordinate ownership of repos among workers,
/repositories index,get
/repositories/{id}/sync post // trigger immediate fetch and sync
/repositories/{id}/recreate post // clone fresh copy and sync
/status get
- open sqlite db
- create top-level nodes in zookeeper under /workers
- start watch on zk repositories node
- create nodes per project as per rows in sqlite db
- begin sync tasks:
- loop over projects defined in sqlite db
- for each watched remote start sync thread
- create entry in sqlite
- start new sync thread
- find repo dir and check consistency against sqlite db:
- if dir doesn't exist, clone it
- if sqlite commit id doesn't exist in repo clear it
- git fetch all to manually sync with remote
- use revwalk to find all the commits back to the merge base(s): include in the revwalk all the repo's tracked branches in the branches table for each tracked branch: hide merge bases of (branch tip commit id, indexed commit id)
- add all commits found by revwalk to commits work table in sqlite crash recovery: ignore duplicate row errors
- scroll through commits work table and add each commit to elastic search mark row in work table as done periodically commit elasticsearch batch as we go all updates to search index are idempotent remove from search index any files deleted or renamed by a commit add to repo_files table any files that are added or updated if they're already in there then update the change commit id if newer crash recovery: no special logic needed. elasticsearch will eventually converge
- when all rows done, save each branch tip commit id as indexed commit id in branches table and clear work table. update each branch commit id in zookeeper crash recovery: update branches table and delete work table rows in same transaction. zookeeper branch commit id is eventually consistent.
- for each file in repo_files table, add to search index update repo_files indexed commit id as we go if change commit id is newer than indexed commit id crash recovery: it's monotonic. no special logic needed.
- started
- start_fail couldn't open sqlite db or find data dir? or zk?
- cloning
- clone_fail couldn't access remote repo
- cloned
- fetching
- fetch_fail couldn't access remote repo
- fetched
- indexing_commits
- index_commits_fail error twiddling git or poking elasticsearch or sqlite
- indexed_commits
- indexing_files
- index_files_fail error poking elasticsearch or sqlite or git
- indexed_files
- id uuid string (hyphen formatted, 36 chars)
- repo uri (e.g. https://github.com/me/foo.git)
- indexed_datetime for information only
- sync state (see above)
- local filesystem path
unique indexes on id and repo
- repo_id
- name
- indexed_commit_id
unique index on (repo_id,name)
- id git oid of commit 20 char ascii
- repo_id uuid string of repo
- state enum indexed or not_indexed
unique index on (repo_id, id)
- repo_id uuid string of repo
- path relative path in repo of file
- commit_id id of commit when last changed
- indexed_commit_id id of commit when last indexed
unique index on (repo_id, path)
the rust code uses Paths where appropriate. the sqlite db uses c strings. converting between the two is done in modules::types.rs
there are no paths created from things that aren't already paths, or are otherwise known to be something safe like ascii e.g. a hash of the remote url is used instead of the url itself as the dir to clone the repo into, and the branch names aren't used in paths anywhere.
hopefully that's enough to be reasonably cross platform and tolerant of non-utf8 inputs.