hpjansson/fornalder

a workflow to remove a repository after ingestion?

gasche opened this issue · 1 comments

It has happened to me several times now that I ingest a large set of repositories, I look at the data, and I notice oddities caused by a repository that should not have been there in the first place.

Is there a workflow to remove a repository from the database, and rerun the plotting?

Currently I don't know of such a workflow, so I manually remove the repository, delete the database, and restart ingestion from scratch. This is ok, but it can be annoying when ingestion is slow (several minutes on large repository sets).

I thought about running sqlite on the database and doing a DELETE operation on all raw_commits coming from this directory. However, if I understand correctly, the plotting data comes from the authors table that I would need to update with new aggregates, and I don't know how to do it easily.

Assuming this does not currently exist, my proposal would be to have a command fornalder reanalyze foo.db that would drop the current authors table and recompute it from the raw_commits table as it currently exists.

(Another option of course would be to have a fornalder repo-remove foo.db repo.git command that removes a repository from a table, instead of adding it as fornalder ingest foo.db repo.git does. But that sounds like more work.)

The authors table gets derived from raw_commits every run, so it should be safe to poke around in the latter. See:

fornalder/src/commitdb.rs

Lines 204 to 233 in 43f3d48

// Generate table with per-author stats like time of first and
// last commit.
self.conn.execute ("drop table authors;", NO_PARAMS).ok();
self.conn.execute ("
create table authors as
select author_name,
first_time,
first_year,
last_time,
last_year,
last_time-first_time as active_time,
n_commits,
n_changes
from
(
select author_name,
min(author_time) as first_time,
min(author_year) as first_year,
max(author_time) as last_time,
max(author_year) as last_year,
count(id) as n_commits,
sum(n_insertions) + sum(n_deletions) as n_changes
from raw_commits
group by author_name
);
create index index_author_name on authors (author_name);
create index index_first_time on authors (first_time);
create index index_active_time on authors (active_time);
", NO_PARAMS).chain_err(|| "Could not create author summaries")?;

I intended to re-run postprocess() only if something changed (e.g. store a hash of the meta file provided, clear a flag whenever a fornalder command like ingest changes the database), but it wasn't too slow in practice, so I didn't feel the need to optimize it, at least not yet. I left a reminder here:

cdb.postprocess(&meta.domains)?; // FIXME: Skip if metadata is unchanged

Anyway, the bottom line is that manually editing raw_commits is safe, for now.

I like the idea of having CLI for common database editing (like removing a repo, or maybe a date range). Let's keep this issue open for repo-remove (or remove-repo?).