Built on the top of pandas
, sichuanMojo is a grammar of data manipulation with data frame, providing a consistent a series of utility functions that help you solve the most common data manipulation challenges:
unite_col
: unite cols into one columnseparate_col
: separate column by a pattern into individual columngroupby_col
: provide a summary table of a selected column which grouped by desired column(s)groupby_across
: provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)pivot_tb
: provide a pivot (count/frequency) table of a selected column which grouped by desired column(s)pivot_rate
: provide a pivot (rate/percentage) table of a selected column which grouped by desired column(s)simplify_network_df
: simplify the network dataframe from directed to undirecte
Use the package manager pip to install foobar.
pip install sichuanMojo
The Sichuan province is pandas' hometown which explains the name of this library!
import sichuanMojo as sm
import pandas as pd
# test_dict as an example input
data = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike','Jan'],
'age':[20, 21, 19, 18, 21, 33],
'group':['A','A', 'A', 'B', 'B', "B"],
'major':['biology', 'english', 'biology', 'english', 'biology', 'biology'],
'response':['good', 'good', 'ok', 'bad', 'bad','bad']}
# Create DataFrame
test_df= pd.DataFrame(data)
name | age | group | major | response | |
---|---|---|---|---|---|
0 | Tom | 20 | A | biology | good |
1 | nick | 21 | A | english | good |
2 | krish | 19 | A | biology | ok |
3 | jack | 18 | B | english | bad |
4 | Mike | 21 | B | biology | bad |
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.unite_col(test_df, unite_by = ["group", "major", "response"], pattern="; ", united_col_name="New_col", output_json=False)
name | age | group | major | response | New_col | |
---|---|---|---|---|---|---|
0 | Tom | 20 | A | biology | good | A; biology; good |
1 | nick | 21 | A | english | good | A; english; good |
2 | krish | 19 | A | biology | ok | A; biology; ok |
3 | jack | 18 | B | english | bad | B; english; bad |
4 | Mike | 21 | B | biology | bad | B; biology; bad |
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
# Test data
data2 = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike'],
'age':[20, 21, 19, 18, 21],
'group':['A','A', 'A', 'B', 'B'],
'major':['biology', 'english', 'biology', 'english', 'biology'],
'date':['2022-01-13', '2022-06-23', '2022-03-12', '2022-08-23', '2022-09-13']}
# Create DataFrame
test_df2= pd.DataFrame(data2)
name | age | group | major | date | |
---|---|---|---|---|---|
0 | Tom | 20 | A | biology | 2022-01-13 |
1 | nick | 21 | A | english | 2022-06-23 |
2 | krish | 19 | A | biology | 2022-03-12 |
3 | jack | 18 | B | english | 2022-08-23 |
4 | Mike | 21 | B | biology | 2022-09-13 |
sm.separate_col(test_df2, sep_by="date", pattern="-", sep_to_names=["year", "month", "date"], output_json=False)
name | age | group | major | date | year | month | |
---|---|---|---|---|---|---|---|
0 | Tom | 20 | A | biology | 13 | 2022 | 01 |
1 | nick | 21 | A | english | 23 | 2022 | 06 |
2 | krish | 19 | A | biology | 12 | 2022 | 03 |
3 | jack | 18 | B | english | 23 | 2022 | 08 |
4 | Mike | 21 | B | biology | 13 | 2022 | 09 |
Note: if arg output_json = True
, output will be json format. Otherwise, output will be a data frame.
operator should be one of mean, sum, min, max, median, unique, count, nunique
sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique", output_json = True)
#returns [{'group': 'A', 'major': 'biology', 'name': 2},
# {'group': 'A', 'major': 'english', 'name': 1},
# {'group': 'B', 'major': 'biology', 'name': 2},
# {'group': 'B', 'major': 'english', 'name': 1}]
sm.groupby_col(test_df, group_by = ["group", "major"], summmarize_at = "name", operator = "nunique")
group | major | name | |
---|---|---|---|
0 | A | biology | 2 |
1 | A | english | 1 |
2 | B | biology | 2 |
3 | B | english | 1 |
provide a summary table of a selected column with basic statistical infomation (min, max, sum, median) which grouped by desired column(s)
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.groupby_across(test_df, ["group", "major"], summmarize_at="age", operation=None, output_json = False)
group | major | age_min | age_max | age_sum | age_median | |
---|---|---|---|---|---|---|
0 | A | biology | 19 | 20 | 39 | 19.5 |
1 | A | english | 21 | 21 | 21 | 21 |
2 | B | biology | 21 | 33 | 54 | 27 |
3 | B | english | 18 | 18 | 18 | 18 |
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.pivot_tb(test_df, group_by=["group", "major"], summmarize_at="response", operation="count", output_json = False, na_fill=0)
group | major | bad | good | ok | |
---|---|---|---|---|---|
0 | A | biology | 0 | 1 | 1 |
1 | A | english | 0 | 1 | 0 |
2 | B | biology | 2 | 0 | 0 |
3 | B | english | 1 | 0 | 0 |
Note: if arg output_json = True, output will be json format. Otherwise, output will be a data frame.
sm.pivot_rate(test_df, group_by=["major", "group"], summmarize_at="response", output_json = False, na_fill=0)
major | group | response_count_bad | response_count_good | response_count_ok | response_perc_bad | response_perc_good | response_perc_ok | |
---|---|---|---|---|---|---|---|---|
0 | biology | A | 0 | 1 | 1 | 0 | 50 | 50 |
1 | biology | B | 2 | 0 | 0 | 100 | 0 | 0 |
2 | english | A | 0 | 1 | 0 | 0 | 100 | 0 |
3 | english | B | 1 | 0 | 0 | 100 | 0 | 0 |
Note: if arg keep = "last"
, output dataframe will keep the last row of duplicated rows. if arg keep = "first"
, output dataframe will keep the first row of duplicated rows.
data = {'from':['Tom', 'Jack', 'Jen', 'Sam'],
'overlap':[20, 21, 19, 18],
'to':['Jack','Tom', 'Emily', 'John']}
test_df= pd.DataFrame(data)
from | overlap | to | |
---|---|---|---|
0 | Tom | 20 | Jack |
1 | Jack | 21 | Tom |
2 | Jen | 19 | Emily |
3 | Sam | 18 | John |
In the test_df, we have duplicated Tom <--> Jack pairs.
sm.simplify_network_df(test_df, from_col = "from", to_col = "to", keep = "first")
from | overlap | to | |
---|---|---|---|
0 | Tom | 20 | Jack |
2 | Jen | 19 | Emily |
3 | Sam | 18 | John |
We can see output only keeps one Tom <--> Jack pair.