A simple api to explore, clean, transform and visualize data. This api is:
- Minimalistic: short names, simple functionalites, minimal parameters
- Pragmatic: focuses on raw efficiency rather than idiomatic code
- Simple stupid: keep it easy to understand for both code and api
Pandas to work with data
Dataset and Sql Alchemy to work with databases
To chart data:
- Postgresql, Sqlite and all those that Sql Alchemy supports
- Influxdb
Using conda:
conda install pandas sqlalchemy seaborn
conda install -c ioam holoviews bokeh
conda install altair --channel conda-forge
pip install dataset pytablewriter goerr gencharts chartjspy
pip install dataswim --no-deps
Using pip:
pip install dataswim
To get the Altair charts in notebooks running this command is required:
jupyter nbextension enable vega --py --sys-prefix
Read the api documentation.
Note: the api may change and is not stable yet. The doc is in sync with master. To autobuild the docs for your installed version clone and:
pip install sphinx
cd docs
make html
firefox _build/html/api_auto.html&
Jupyter demo notebooks are available as example.
Users demo notebook: it uses Django user data loaded from csv or directly from a database. Note: the data in this demo is autogenerated and fake but it is possible to run it on real user data connecting to a Django database.
Resample data demo: demonstrates how to resample data by time periods
To run the notebooks online click the badge:
Goal: chart user registrations over time from fake Django user data
from dataswim import ds
# Load from a Django database
#ds.connect('postgresql://user:xxxx@localhost/dbname')
#ds.connect('sqlite:////path/to/db.sqlite3')
#ds.load("auth_user")
# Load demo data from csv
ds.load_csv("data/users.csv")
# for a data description:
#ds.describe()
# for a quick look:
#ds.look()
# for a sample:
ds.show()
[START] Loading csv...
[END] Finished loading csv in 0.12 seconds
[INFO] The dataframe has 1007 rows and 11 columns:
id, password, last_login, is_superuser, first_name, last_name, email, is_staff, is_active,
date_joined, username
id | password | last_login | is_superuser | first_name | last_name | is_staff | is_active | date_joined | username | ||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | pbkdf2_sha256$36000$TGjJvvfpuFhR$jVD3mP8MNxNYD... | 2017-12-02 14:05:30.796573 | True | NaN | NaN | NaN | True | True | 2017-12-02 13:57:53.393755 | ggg |
1 | 2 | Cupiditate accusamus velit sit dolor. Doloribu... | 1986-07-16 11:22:02.000000 | False | Lindsey | Thompson | farmermeghan@henderson.com | False | False | 1997-08-06 16:23:37.000000 | jrichards |
2 | 3 | Nobis quisquam voluptatibus nulla.\nEa archite... | 1981-04-06 06:48:03.000000 | False | Kurt | Black | kellycharles@marsh.com | True | True | 2013-08-29 18:04:10.000000 | eroberts |
3 | 4 | Inventore ea quia ducimus eligendi quod. Velit... | 1991-08-08 22:23:45.000000 | True | Sandra | Wilson | antoniobowers@hotmail.com | True | False | 1981-04-22 05:32:55.000000 | emilykelley |
4 | 5 | Veniam sequi aut nisi vitae. Quasi explicabo v... | 1971-02-08 15:53:42.000000 | True | Andrew | Guzman | hannahconner@yahoo.com | False | False | 2010-11-10 17:53:54.000000 | lsmith |
# Resample data by one year
# see Pandas frequencies for units:
# https://github.com/pandas-dev/pandas/blob/master/pandas/tseries/frequencies.py#L98
# try "1D" for one day
ds.rsum("1A", dateindex="date_joined", num_col="Registrations", index_col="Date")
# Convert nulls to zeros
ds.zero_nan("Registrations")
# Keep only the fields we need
ds.keep("Date", "Registrations")
ds.backup()
ds.show()
[OK] Added a datetime index from column date_joined
[OK] Data resampled by 1A
[OK] Column Date added from index
[OK] Replaced 0 values by nan in column Registrations
[OK] Setting dataframe to columns Date Registrations
[OK] Dataframe backed up
[INFO] The dataframe has 48 rows and 2 columns:
Date, Registrations
Date | Registrations | |
---|---|---|
date_joined | ||
1970-12-31 | 1970-12-31 | 20 |
1971-12-31 | 1971-12-31 | 21 |
1972-12-31 | 1972-12-31 | 20 |
1973-12-31 | 1973-12-31 | 25 |
1974-12-31 | 1974-12-31 | 26 |
import holoviews as hv
hv.extension('bokeh')
ds.chart("Date", "Registrations", label="Registrations")
ds.line_()
ds.opts(dict(width=940, height=300))
line = ds.line_()
style = dict(color="orange", size=10)
opts = dict(tools=["hover"])
point = ds.point_(style=style, opts=opts)
# The * operator merge the two charts in one
line*point
ds.width(300)
ds.height(200)
ds.color("orange")
area = ds.area_()
ds.color("green")
point = ds.point_()
ds.color("grey")
bar = ds.bar_()
area+point+bar
ds.color("green")
ds.width(850)
ds.height(300)
c7 = ds.hline_("Registrations")
c8 = ds.area_("Registrations").hist()
c8*c7
Prepare the data: convert all to integers to make these charts to work
ds.to_int("Date")
ds.to_int("Registrations")
[OK] Converted column values to integers
[OK] Converted column values to integers
Draw the chart
% matplotlib inline
ds.chart("Registrations", "Date")
c = ds.distrib_()
ds.chart("Date", "Registrations")
c = ds.density_()
ds.opts(dict(xticks=(1970, 2017), yticks=(0, 35)))
c = ds.dlinear_()
c = ds.residual_()