jtablesaw/tablesaw

Improvement suggestion: need an equivalent of Pandas value_counts()

minhster99 opened this issue · 1 comments

Hi

I have a bit of background using Python's Pandas and I've been evaluating Tablesaw from this perspective.

One of the most useful functions in Pandas is value_counts. It allows us to understand more about the data in a specific column which has a small subset of values, eg days of the week, months of the year, enums, etc. Extremely useful during data exploratory work

This is an example Pandas code with the output

table['Marital_Status'].value_counts()
// groups the data by Marital_Status and give the count of each value, finally sort by descending count order

Married     864
Together    580
Single      480
Divorced    232

To do the equivalent in Tablesaw

table.summarize("Marital_Status", count).by("Marital_Status").sortDescendingOn("Count [Marital_Status]")
// note how the column name has to be repeated

 Marital_Status  |  Count [Marital_Status]  |
---------------------------------------------
        Married  |                     864  |
       Together  |                     580  |
         Single  |                     480  |
       Divorced  |                     232  |

It would be great if there was a convenience function like.

table.valueCounts("Marital_Status")

The important thing is not having to repeat the column name.

I have since discovered countBy which does precisely this!

table.countBy("Marital_Status")

edit: ok it doesnt sort so you'll have to do this part yourself