etix/mirrorbits

Mirrorbits stats pile up forever, there should be a way to delete old stats

elboulangero opened this issue · 1 comments

Mirrorbits save some stats in the Redis database, within keys prefixed with STATS_. However these keys have no expiry set, and mirrorbits doesn't provide a configuration option to set an expiry. As a result, stats remain forever, and since they are stored in the Redis DB, they are stored in RAM. RAM usage will slowly increase over time.

A bit more details now. Here are the stats available in the database:

  • STATS_TOTAL: just one key, the total number of requests processed by mirrorbits. Total so far.
  • STATS_MIRROR: number of requests sent to each mirror. Total so far.
  • STATS_MIRROR_BYTES: number of bytes served by each mirror. Total so far. (NB: It's a theoretical number, based on file size, and assuming the mirror served the file entirely).
  • STATS_MIRROR_${date} and STATS_MIRROR_BYTES_${date}: same as above, but totals per year (when date=YYYY), per month (when date=YYYY_MM) and per day (when date=YYYY_MM_DD).
  • STATS_FILE: number of requests per file. Total so far.
  • STATS_FILE_${date}: same as above, but totals per year (when date=YYYY), per month (when date=YYYY_MM) and per day (when date=YYYY_MM_DD).

In practice, most of those stats occupy little memory, as they are totals, or yearly total, or monthly totals. It's mostly the daily totals that I'm concerned about. STATS_MIRROR_${date} and STATS_MIRROR_BYTES_${date}` are hashes, and the number of keys are the number of mirrors. So it shouldn't be too big either.

So we're left with STATS_FILE_${date} per day. This is also a hash, and the number of keys is the number of files that were served this day. So depending on the number of files in the repo, it can be big.

For context, I'm evaluating using mirrorbits for Kali Linux, and we have around 500,000 files in the repo. The number fluctuates, and can go up to a million. Of course, we don't serve all the files every day, but still, a good number. Long-term, we need a way to clean the old stats in order to save some RAM.


For anyone interested, here's a little bash script to print the size of the STATS keys in RAM:

#!/bin/bash

set -eu

REDIS="redis-cli $@"

# Get all the STATS_* keys
echo "Scanning the redis db for STATS_* keys, this might take a while ..."
if redis-cli --help 2>&1 | grep -q -- " --count "; then
    KEYS=$($REDIS --scan --count 1000 --pattern "STATS_*")
else
    KEYS=$($REDIS --scan --pattern "STATS_*")
fi

KEYS=$(echo "$KEYS" | LC_ALL=C sort -u)

# Do the maths
count=0
count_file=0
bytes=0
bytes_file=0
for key in $KEYS; do
    b=$($REDIS --raw memory usage $key)
    count=$((count + 1))
    bytes=$((bytes + b))
    case $key in
        STATS_FILE_*)
            count_file=$((count_file + 1))
            bytes_file=$((bytes_file + b))
            ;;
    esac
    echo -n .
done
echo

echo "TOTAL for all STATS_* keys : $count keys, estimated size: $bytes bytes, or $((bytes / 1024)) kB, or $((bytes / 1024 / 1024)) MB"
echo "TOTAL for STATS_FILE_* keys: $count_file keys, estimated size: $bytes_file bytes, or $((bytes_file / 1024)) kB, or $((bytes_file / 1024 / 1024)) MB"

You'd run it with the redis database id in argument, eg. ./stats-size.sh -n 0 for the redis db number 0.


I propose #145 (a contrib script) as a solution. Would be better to implement that straight in mirrorbits though.

Thanks!

Merged #145