/simple-data-analysis-benchmarks

Comparing performance of different versions of simple-data-analysis with popular Python and R libraries for data analysis.

Primary LanguageJavaScript

Simple-data-analysis benchmarks

To test the performance of simple-data-analysis@2.0.1, we calculated the average temperature per decade and city with the daily temperatures from the Adjusted and Homogenized Canadian Climate Data.

We ran the same calculations with simple-data-analysis@1.8.1 (both NodeJS and Bun), Pandas (Python), and the tidyverse (R).

In each script, we:

  1. Load a CSV file (Importing)
  2. Select four columns, remove rows with missing temperature, convert date strings to date and temperature strings to float (Cleaning)
  3. Add a new column decade and calculate the decade (Modifying)
  4. Calculate the average temperature per decade and city (Summarizing)
  5. Write the cleaned-up data that we computed the averages from in a new CSV file (Writing)

Each script has been run ten times on a MacBook Pro (Apple M1 Pro / 16 GB), and the durations have been averaged.

The charts displayed below come from this Observable notebook.

Small file

With ahccd-samples.csv:

  • 74.7 MB
  • 19 cities
  • 20 columns
  • 971,804 rows
  • 19,436,080 data points

As we can see, simple-data-analysis@1.8.1 was the slowest, but simple-data-analysis@2.0.1 is now the fastest.

A chart showing the processing duration of multiple scripts in various languages

Big file

With ahccd.csv:

  • 1.7 G
  • 773 cities
  • 20 columns
  • 22,051,025 rows
  • 441,020,500 data points

The file was too big for simple-data-analysis@1.8.1, so it's not included here.

Again, simple-data-analysis@2.0.1 is now the fastest option.

A chart showing the processing duration of multiple scripts in various languages