/lambdata

Lambdata Python Package Project

Primary LanguagePython

Pandas sublcass with additional helper methods

Install

pip install -i https://test.pypi.org/simple/ lambdata-richmondtest

Helper Methods:

Tabulate

Format the dataframe table for pretty printing. This is a wrapper method for https://pypi.org/project/tabulate/ It checks for display.max_rows and display.max_columns and uses them by default for generating the output table.

Example Usage:

>>> from lambdata_richmondtest import DataFrameWithHelpers
>>> from faker import Faker
>>> import pandas as pd
>>> import datetime
>>> fake = Faker()
>>> pd.set_option('display.max_rows', 20)
>>> #Create a DataFrame from random dates
>>> start_date = datetime.date(year=2010, month=1, day=1)
>>> end_date = datetime.date(year=2020, month=1, day=1)
>>> fake_dates = [fake.date_between(start_date=start_date, end_date=end_date) for x in range(1000)]
>>> df = DataFrameWithHelpers(fake_dates,columns=['Date'])
>>> print(df.tabulate())
---  ----------
0    2017-10-21
1    2017-11-09
2    2017-03-12
3    2016-10-28
4    2013-12-29
5    2018-08-25
6    2012-01-19
7    2015-03-12
8    2011-08-21
9    2010-04-01
...  ...
990  2010-10-23
991  2017-03-30
992  2014-03-11
993  2013-02-18
994  2019-10-12
995  2018-11-05
996  2012-06-22
997  2010-05-30
998  2019-12-11
999  2017-08-04
---  ----------
Github Flavored Markdown:
>>> print(df.tabulate(headers='keys', tablefmt="github",))
|     | Date       |
|-----|------------|
| 0   | 2017-10-21 |
| 1   | 2017-11-09 |
| 2   | 2017-03-12 |
| 3   | 2016-10-28 |
| 4   | 2013-12-29 |
| 5   | 2018-08-25 |
| 6   | 2012-01-19 |
| 7   | 2015-03-12 |
| 8   | 2011-08-21 |
| 9   | 2010-04-01 |
| ... | ...        |
| 990 | 2010-10-23 |
| 991 | 2017-03-30 |
| 992 | 2014-03-11 |
| 993 | 2013-02-18 |
| 994 | 2019-10-12 |
| 995 | 2018-11-05 |
| 996 | 2012-06-22 |
| 997 | 2010-05-30 |
| 998 | 2019-12-11 |
| 999 | 2017-08-04 |
HTML:
>>> print(df.tabulate(headers='keys', tablefmt="html",))
<table>
<thead>
<tr><th>   </th><th>Date      </th></tr>
</thead>
<tbody>
<tr><td>0  </td><td>2017-10-21</td></tr>
<tr><td>1  </td><td>2017-11-09</td></tr>
<tr><td>2  </td><td>2017-03-12</td></tr>
<tr><td>3  </td><td>2016-10-28</td></tr>
<tr><td>4  </td><td>2013-12-29</td></tr>
<tr><td>5  </td><td>2018-08-25</td></tr>
<tr><td>6  </td><td>2012-01-19</td></tr>
<tr><td>7  </td><td>2015-03-12</td></tr>
<tr><td>8  </td><td>2011-08-21</td></tr>
<tr><td>9  </td><td>2010-04-01</td></tr>
<tr><td>...</td><td>...       </td></tr>
<tr><td>990</td><td>2010-10-23</td></tr>
<tr><td>991</td><td>2017-03-30</td></tr>
<tr><td>992</td><td>2014-03-11</td></tr>
<tr><td>993</td><td>2013-02-18</td></tr>
<tr><td>994</td><td>2019-10-12</td></tr>
<tr><td>995</td><td>2018-11-05</td></tr>
<tr><td>996</td><td>2012-06-22</td></tr>
<tr><td>997</td><td>2010-05-30</td></tr>
<tr><td>998</td><td>2019-12-11</td></tr>
<tr><td>999</td><td>2017-08-04</td></tr>
</tbody>
</table>

Check out: https://pypi.org/project/tabulate/ for formats and faeatures documentation.

train_test_val_split

Split the data frame into random train, test and val subsets. Uses sklearn.model_selection.train_test_split to split the data into train/test then splits train furtuer to train/val

Example Usage:

>>> train, test, val = df.train_test_val_split()
>>> print(train.shape, test.shape, val.shape)
(562, 1) (250, 1) (188, 1)

Compatible with train_test_split's parameters

>>> train, test, val = df.train_test_val_split(test_size=0.30)
>>> print(train.shape, test.shape, val.shape)
(490, 1) (300, 1) (210, 1)

split_dates

Split a date column into multiple columns for day, month and year.

df.split_dates('Date')
Date day month year
0 2018-08-28 28 8 2018
1 2013-08-23 23 8 2013
2 2011-05-21 21 5 2011
3 2011-03-01 1 3 2011
4 2014-09-29 29 9 2014
5 2018-05-13 13 5 2018
6 2010-05-15 15 5 2010
7 2015-12-27 27 12 2015
8 2011-06-13 13 6 2011
9 2018-05-15 15 5 2018
... ... ... ... ...
990 2017-08-02 2 8 2017
991 2010-10-31 31 10 2010
992 2012-02-25 25 2 2012
993 2010-08-29 29 8 2010
994 2014-09-11 11 9 2014
995 2018-08-18 18 8 2018
996 2019-09-02 2 9 2019
997 2011-10-07 7 10 2011
998 2010-01-11 11 1 2010
999 2018-12-05 5 12 2018

With custom prefix:

df.split_dates('Date', prefix='date_')
Date date_day date_month date_year
0 2018-11-28 28 11 2018
1 2012-10-14 14 10 2012
2 2019-04-22 22 4 2019
3 2015-08-03 3 8 2015
4 2011-11-28 28 11 2011
5 2016-01-15 15 1 2016
6 2018-02-01 1 2 2018
7 2019-08-07 7 8 2019
8 2010-10-07 7 10 2010
9 2019-10-07 7 10 2019
... ... ... ... ...
990 2017-12-19 19 12 2017
991 2016-10-27 27 10 2016
992 2010-10-31 31 10 2010
993 2013-05-02 2 5 2013
994 2019-07-04 4 7 2019
995 2014-03-23 23 3 2014
996 2012-05-23 23 5 2012
997 2014-09-12 12 9 2014
998 2018-03-18 18 3 2018
999 2013-10-02 2 10 2013