/danfoMojo

A javascript data analysis toolkit based on danfojs

Primary LanguageJavaScriptMIT LicenseMIT

danfoMojo: A javascript data analysis toolkit based on danfojs

What is it?

danfoMojo is a javascript package that provides a series utility functions to tidy and manipulate data frame based on danfojs. It is heavily inspired by both Pandas and tidyr library which means that users familiar with Pandas or tidyr from data science community, can easily pick up the functionalities of danfoMojo.

Main Features

  • dfd_unite_cols: Unite multiple columns into one by pasting strings together
  • dfd_separate_col: Separate a character column into multiple columns with a regular expression or numeric locations
  • dfd_groupby_pivot: dfd_groupby_pivot() takes an existing data frame and converts it into a grouped data frame where operations are performed "by group".
  • dfd_pivot_wider: "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is dfd_pivot_longer()
  • dfd_pivot_longer: increasing the number of rows and decreasing the number of columns. The inverse transformation is dfd_pivot_wider().

Installation

npm install danfomojo

Use cases

const dfm = require("danfomojo")

const dfd = require("danfojs")
const _ = require("lodash")


data = {'name':['Tom', 'nick', 'krish', 'jack', 'Mike','Jan'],
        'age':[20, 21, 19, 18, 21, 33],
       'group':['A','A', 'A', 'B', 'B', "B"],
       'major':['biology', 'english', 'biology', 'english', 'biology', 'biology'],
       'response':['good', 'good', 'ok', 'bad', 'bad','bad'],
       'date':['2020-04-01', '2021-03-12', '2021-03-12', '2020-04-01', '2020-04-01','2020-07-01']}

df = new dfd.DataFrame(data)
name age group major response date
0 Tom 20 A biology good 2020-04-01
1 nick 21 A english good 2021-03-12
2 krish 19 A biology ok 2021-03-12
3 jack 18 B english bad 2020-04-01
4 Mike 21 B biology bad 2020-04-01
5 Jan 33 B biology bad 2020-07-01

unite cols into one column

result = dfm.dfd_unite_cols(df, united_with = ["major","response"], pattern = "; ", name_united="new_concat_col")
result.print()
name age group major response date new_concat_col
0 Tom 20 A biology good 2020-04-01 biology; good
1 nick 21 A english good 2021-03-12 english; good
2 krish 19 A biology ok 2021-03-12 biology; ok
3 jack 18 B english bad 2020-04-01 english; bad
4 Mike 21 B biology bad 2020-04-01 biology; bad
5 Jan 33 B biology bad 2020-07-01 biology; bad

Separate a character column into multiple columns with a regular expression or numeric locations

result = dfm.dfd_separate_col(df, sep_by="date",new_col_name = ["Year", "Month", "Date"], pattern="-")
result.print()
name age group major response date year month
0 Tom 20 A biology good 01 2020 04
1 nick 21 A english good 12 2021 03
2 krish 19 A biology ok 12 2021 03
3 jack 18 B english bad 01 2020 04
4 Mike 21 B biology bad 01 2020 04
5 Jan 33 B biology bad 01 2020 07

dfd_groupby_pivot() takes an existing data frame and converts it into a grouped data frame where operations are performed "by group".

result_pivot = dfm.dfd_groupby_pivot(df, group_by = ['group',"major", "response"], pivot_at = ["age"], operator="mean")
result_pivot.print()
group major response age_mean
0 A biology good 20
1 A biology ok 19
2 A english good 21
3 B biology bad 27
4 B english bad 18

dfd_pivot_wider: "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is dfd_pivot_longer()

Let's use result_pivot (result from above) as input data frame for this use case.

result_wide = dfm.dfd_pivot_wider(result_pivot,  groupby_cols = ["group",  "response"], at_col = "major", value_col = "age_mean")
result_wide.print()
group response age_mean_biology age_mean_english
0 A good 20 21
1 A ok 19 NaN
2 B bad 27 18

dfd_pivot_longer: increasing the number of rows and decreasing the number of columns. The inverse transformation is dfd_pivot_wider().

Let's use result_wide (result from above) as input data frame for this use case.

result_long = dfm.dfd_pivot_longer(result_wide, keep_cols = ["group", "response"])
result_long.print()
group response new_col_name new_val_name
0 A good age_mean_biology 20
1 A ok age_mean_biology 19
2 B bad age_mean_biology 27
3 A good age_mean_english 21
4 A ok age_mean_english NaN
5 B bad age_mean_english 18