/bitmapist4

Next incarnation of bitmapist: powerful analytics and cohort library using Redis bitmaps

Primary LanguagePython

bitmapist

Build Status

NEW! Try out our new standalone bitmapist-server, which improves memory efficiency 443 times and makes your setup cheaper and more scaleable. It's fully compatable with bitmapist that runs on Redis.

bitmapist: a powerful analytics library for Redis

This Python library makes it possible to implement real-time, highly scalable analytics that can answer following questions:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action "X"?
  • How many users have been active have this month? This hour?
  • How many unique users have performed action "X" this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?
  • What users performed action "X"?

This library is very easy to use and enables you to create your own reports easily.

Using Redis bitmaps you can store events for millions of users in a very little amount of memory (megabytes).

Note however that you should be careful about using huge ids as this could require larger amounts of memory. Ids should be in range [0, 2^32).

Additionally bitmapist can generate cohort graphs that can do following:

  • Cohort over user retention
  • How many % of users that were active last [days, weeks, months] are still active?
  • How many % of users that performed action X also performed action Y (and this over time)
  • And a lot of other things!

If you want to read more about bitmaps please read following:

Installation

Can be installed very easily via:

$ pip install bitmapist4

Ports

Examples

Setting things up:

import bitmapist4
b = bitmapist4.Bitmapist()

Mark user 123 as active and has played a song:

b.mark_event('active', 123)
b.mark_event('song:played', 123)

Answer if user 123 has been active this month:

assert 123 in b.MonthEvents('active')
assert 123 in b.MonthEvents('song:played')

How many users have been active this week?:

len(b.WeekEvents('active'))

Iterate over all users active this week:

for uid in b.WeekEvents('active'):
    print(uid)

Unmark that user 123 was active and had played a song:

b.unmark_event('active', 123)
b.unmark_event('song:played', 123)

To explore any specific day, week, month or year instead of the current one, you can create an event from any datetime object with a from_date static method.

specific_date = datetime.datetime(2018, 1, 1)
ev = b.MonthEvents.from_date('active', specific_date)
print(len(ev))

There are methods prev and next returning "sibling" events and allowing you to walk through events in time without any sophisticated iterators. A delta method allows you to jump forward or backward for more than one step. Uniform API allows you to use all types of base events (from hour to year) with the same code.

current_month = b.MonthEvents('active')
prev_month = current_month.prev()
next_month = current_month.next()
year_ago = current_month.delta(-12)

Every event object has period_start and period_end methods to find a time span of the event. This can be useful for caching values when the caching of "events in future" is not desirable:

ev = b.MonthEvent('active', dt)
if ev.period_end() < datetime.datetime.utcnow():
    cache.set('active_users_<...>', len(ev))

Tracking hourly is disabled (to save memory!) You can enable it with a constructor argument.

b = bitmapist4.Bitmapist(track_hourly=True)

Additionally you can supply an extra argument to mark_event to bypass the default value::

b.mark_event('active', 123, track_hourly=False)

Unique events

Sometimes the date of the event makes little or no sense and you are more interested if that specific event happened at least once in a lifetime for a user.

There is a UniqueEvents model for this purpose. The model creates only one Redis key and doesn't depend on the date.

You can combine unique events with other types of events.

A/B testing example:

active = b.DailyEvents('active')
a = b.UniqueEvents('signup_form:classic')
b = b.UniqueEvents('signup_form:new')

print("Active users, signed up with classic form", len(active & a))
print("Active users, signed up with new form", len(active & b))

You can mark these users with b.mark_unique or you can automatically populate the extra unique cohort for all marked keys

b = bitmapist4.Bitmapist(track_unique=True)
b.mark_event('premium', 1)
assert 1 in b.UniqueEvents('premium')

Perform bit operations

How many users that have been active last month are still active this month?

ev = b.MonthEvents('active')
active_2months = ev & ev.prev()
print(len(active_2months))

# Is 123 active for 2 months?
assert 123 in active_2months

Operators &, |, ^ and ~ supported.

This works with nested bit operations (imagine what you can do with this ;-))!

Delete events

If you want to permanently remove marked events for any time period you can use the delete() method:

ev = b.MonthEvents.from_date('active', last_month)
ev.delete()

If you want to remove all bitmapist events use:

b.delete_all_events()

Results of bit operations are cached by default. They're cached for 60 seconds for operations, contained non-finished periods, and for 24 hours otherwise.

You may want to reset the cache explicitly:

ev = b.MonthEvents('active')
active_2months = ev & ev.prev()
# Delete the temporary AND operation
active_2months.delete()

# delete all bit operations (slow if you have many millions of keys in Redis)
b.delete_temporary_bitop_keys()

Bulk updates with transactions

If you often performs multiple updates at once, you can benefit from Redis pipelines, wrapped as transactions inside bitmapist.

with b.transaction():
    b.mark_event('active')
    b.mark_event('song:played')

Migration from previous version

The API of the "bitmapist4.Bitmapist" instance is mostly compatible with the API of previous version of bitmapist (module-level). Notable changes outlined below.

  • Removed the "system" attribute for choosing the server. You are supposed to use different Bitmapist class instances instead. If you used "system" to work with pipelines, you should switch to transactions instead.
  • bitmapist.TRACK_HOURLY and bitmapist.TRACK_UNIQUE module-level constants moved to bitmapist4.Bitmapist attributes and can be set up with a class constructor.
  • On a database level, new bitmapist4 uses "bitmapist_" prefix for Redis keys, while old bitmapist uses "trackist_" for historical reasons. If you want to keep using the old database, or want to use bitmapist and bitmapist4 against the same database, you need to explicitly set the key prefix to "trackist_".
  • If you use bitmapist-server, make sure that you use the version 1.2 or newer. This version adds the support for EXPIRE command which is used to expire temporary bitop keys.

Replace old code which could look like this:

import bitmapist
bitmapist.setup_redis('default', 'localhost', 6380)
...
bitmapist.mark_event('acive', user_id)

With something looking like this:

from bitmapist4 import Bitmapist
bitmapist = Bitmapist('redis://localhost:6380', key_prefix='trackist_')
...
bitmapist.mark_event('acive', user_id)

Bitmapist cohort

Cohort is a group of subjects who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation).

You can get the cohort table using bitmapist4.cohort.get_cohort_table() function.

Each row of this table answers the question "what part of the cohort performed activity over time", and Nth cell of that row represents the number of users (absolute or in percent) which still perform the activity N days (or weeks, or months) after.

Each new column of the cohort unfolds the behavior of different similar cohorts over time. The latest row displays the behavior of the cohort, provided as an argument, the one above displays the behavior of the similar cohort, but shifted 1 day (or week, or month) ago, etc.

For example, consider following cohort statistics

table = get_cohort_table(b.WeekEvents('registered'), b.WeekEvents('active'))

This table shows what's the rate of registered users is still active the same week after registration, then one week after, then two weeks after the registration, etc.

By default the table displays 20 rows.

The first row represents the statistics from cohort of users, registered 20 weeks ago. The second row represents the same statistics for users, registered 19 week ago, and so on until finally the latest row shows users registered this week. Naturally, the last row will contain only one cell, the number of users that were registered this week AND were active this week as well.

Then you may render it yourself to HTML, or export to Pandas dataframe with df() method.

Sample from user activity on http://www.gharchive.org/

In [1]: from bitmapist4 import Bitmapist, cohort

In [2]: b = Bitmapist()

In [3]: cohort.get_cohort_table(b.WeekEvents('active'), b.WeekEvents('active'), rows=5, use_percent=False).df()
Out[3]:
             cohort       0        1        2        3        4
05 Nov 2018  137420  137420  25480.0  18358.0  21575.0  18430.0
12 Nov 2018  150975  150975  22195.0  25833.0  21165.0      NaN
19 Nov 2018  121417  121417  22477.0  15796.0      NaN      NaN
26 Nov 2018  152027  152027  25606.0      NaN      NaN      NaN
03 Dec 2018  130470  130470      NaN      NaN      NaN      NaN

The dataframe can be further colorized (to be displayed in Jupyter notebooks) with stylize().


Copyright: 2012-2019 by Doist Ltd.

License: BSD