Lightweight Python data frames without bloat or typecasting, using only the standard library.
Take your data with you to the cloud without the bloat of a 200kg full-grown bear that refuses to mate.
git clone
cd duffel
import duffel as pd
df = pd.read_csv('duffel/data/MOCK_DATA.csv',index_col=0)
>>> (1000, 6)
index first_name last_name email gender ip_address
-- ---- ---- ---- ---- ----
1 Brinn Herity Female
2 Wylma Lavell Female
duffel.DataFrame (1000, 5)
first_name last_name email gender ip_address
-- ---- ---- ---- ---- ----
576 Cesaro Ohrtmann Male
duffel.Row (1, 5)
df.loc[5:7, ['first_name','gender']]
index first_name gender
-- ---- ----
6 Jolynn Female
7 Moina Female
duffel.DataFrame (2, 2)
Pandas is great for hardcore analytical workloads. However, If you are using Pandas for convenient-but-basic dataframe operations in a non-analytical use case, you might encounter the following limitations:
- Pandas file size is large - hard to use in size-constrained places e.g. Lambda functions
- NumPy file size is large - ditto above
- Pandas transforms numbers to numpy types and dates to pandas.Timestamp - this leads to unpredictable results
- Pandas has a bloated API with several ways to accomplish a goal
- Pandas sometimes returns some subset of a dataframe with a link to the original, instead of making a new dataframe
- Pandas throws strange errors while allowing operations to work - instead of throwing clear errors that are real exceptions
It is to solve these problems that I'm building red-pandas duffel: a smaller, simpler dataframe tool that relies only on the standard library and is generally a drop-in replacement for the Pandas API.
Some inspiration on organization, structure, and some copypasta from duffel
borrows much from @paleolimbot implementation of loc
, iloc
, and __repr__
Uses the black
code style.
Build a dataframe solution that can be easily used in AWS Lambda functions for most non-massive-scale-analytical dataframe operations.
Implement a significant subset of the "minimally sufficient" Pandas API as laid out in
Implemented functionality names are strikethrough -ed .
columns- dtypes
indexshape- T
Subset Selection
headilocloctailscalar comparisonvector comparisongetitem selection
Missing Value Handling
- dropna
- fillna
- interpolate
- isna
- notna
- expanding
- groupby
- pivot_table
- resample
- rolling
Joining Data
append- merge
- asfreq
- astype
- copy
drop- drop_duplicates
- equals
- isin
- melt
- plot
- rename
- replace
reset_indexsample- select_dtypes
- shift
sort_indexsort_valuesto_csvto_json- to_sql
Aggregation Methods
- all
- any
- count
- describe
idxmaxidxmin- max
- mean
- median
- min
- mode
- nunique
- sum
- std
- var
Non-Aggretaion Statistical Methods
- abs
- clip
- corr
- cov
- cummax
- cummin
- cumprod
- cumsum
- diff
- nlargest
- nsmallest
- pct_change
- prod
- quantile
- rank
- round
pd.concat- pd.crosstab
- pd.cut
- pd.qcut
pd.read_csvpd.read_json- pd.read_sql
- pd.to_datetime
- pd.to_timedelta