/automated-analytics

Library of useful functions for automating reporting.

Primary LanguagePython

Python Analyst Library

These are a mix of various functions I wrote as an analyst at Eventbrite (www.eventbrite.com) to automate company reports.

Some are OS dependent (windows_excel_api).

  • There are several python libraries that support writing to excel independent of OS, however these are horrible for maintaining complex excel reports (ie. with charts, formulas, dynamic tables, formatting etc.) and by horrible I mean they break all of their functionality. WindowsCom is the best way to easily write large datasets while maintaining any complex functionality.

The way I set these up to run at Eventbrite was through a windows machine I had running at the office. Using dropbox, a mix of both pycron* and windows native task scheduler, this computer would run 24/7 kicking off reports to various groups & VPs throughout the night. In my opinion, sending out Excel reports is far superior to static analysis that you could achieve in a variety of other methods. Being able to provide a lot of complex data to individuals in a format they are comfortable digging into themselves is extremely beneficial.

Analysts from different departments (marketing & product) leveraged these functions and processes to automate their reports as well.

windows_excel_api.py

Example usage:

import datetime
from numpy import (
    column_stack,
    vstack,
)
from mysqlfunctions import mgdbget
from windows_excel_api import ExcelApi
from emailfunctions import send_email

@logerrors(['mwhahn@gmail.com'], 'Sample Report')
def main():
    # pull data from a database (MySQL etc.)
    query = """
    SELECT
        count(*) as num_users,
        DATE_FORMAT(created, '%Y-%m-01') as month_created
    FROM Users
    WHERE created > '2012-01-01'
    GROUP BY month_created
    """
    data_set = mgdbget(query)

    >>> data_set
    {'num_users': [341998L, 2775230L, 32307L, 102892233L, 7123158L, 822332L], 'month_created': ['2012-01-01', '2012-02-01', '2012-03-01', '2012-04-01', '2012-05-01', '2012-06-01']}

    # construct a table for output
    headers = ['Month Created', 'Number of Users']
    output_table = column_stack(
        data_set['month_created'],
        data_set['num_users'],
    )

    >>> output_table
    array([['2012-01-01', '744998'],
           ['2012-02-01', '877500'],
           ['2012-03-01', '1041307'],
           ['2012-04-01', '1028923'],
           ['2012-05-01', '950758'],
           ['2012-06-01', '254082']],
          dtype='|S10')

    output_table = vstack((headers, output_table))

    >>> output_table
    array([['Month Created', 'Number of Users'],
           ['2012-01-01', '744998'],
           ['2012-02-01', '877500'],
           ['2012-03-01', '1041307'],
           ['2012-04-01', '1028923'],
           ['2012-05-01', '950758'],
           ['2012-06-01', '254082']],
          dtype='|S15')

    filename = 'mysamplefile.xlsx'

    # way to write to file numerous times
    with ExcelApi() as excel_api:
        excel_api.open_workbook(filename)
        # would have all these already in the file but just showing how to
        # write multiple times
        excel_api.write('Data', 'My automated report', 'A1')
        excel_api.write('Data', str(datetime.datetime.now()), 'A2')
        # write out the data
        excel_api.write('Data', output_table, 'C1')

    # or if you just want to write the output to excel
    xlsxwrite(filename, 'Data', output_table, 'C1')

    # email it out to the team

    email_params = {
        'to': ['myteam@mycompany.com'],
        'subject': 'My automated report',
        'attachment': filename,
        'attachmentname': 'Daily Report for %s' % str(
            datetime.datetime.now()
        ),
        'message': 'Here is the automated report',
    }
    send_email(**email_params)

if __name__ == '__main__':
    main()

The way I prefer to structure these reports is have everything in excel referencing data defined on one sheet. You can have charts, tables, pivot tables, formulas all pulling in data from this "data" sheet. Then all you need to do is write a script that handles pulling the data from whatever sources you have (MySQL, Hadoop, MongoDB etc.), aggregating them together and doing the processing in python, then writing it out to the "data" sheet. If you set things up right, you can have fully automated reports you never need to worry about again.