acowley/Frames

Clunkiness

Opened this issue · 6 comments

Rasmus Bååth wrote an interesting blog: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/ where he remarks that Pandas feels clunky compared to R's frames. So I thought I would try to reproduce it using Frames and https://hackage.haskell.org/package/Frames-map-reduce (@adamConnerSax) hoping it would be as easy as R. This is as far as I got but it doesn't seem as slick as the R. Does anyone have any better ideas?

{-# OPTIONS_GHC -Wall            #-}

{-# LANGUAGE TypeApplications    #-}
{-# LANGUAGE TemplateHaskell     #-}
{-# LANGUAGE DataKinds           #-}
{-# LANGUAGE FlexibleContexts    #-}
{-# LANGUAGE FlexibleInstances   #-}
{-# LANGUAGE OverloadedStrings   #-}
{-# LANGUAGE ExplicitForAll      #-}
{-# LANGUAGE TypeOperators       #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeFamilies        #-}

module Main (main) where

import           Frames
import qualified Data.Foldable as F
import qualified Data.List as L
import           Lens.Micro.Extras

import           Control.Foldl (fold, sum)
import qualified Control.Foldl as Foldl
import           Prelude hiding (sum)

import qualified Frames.MapReduce as FMR
import qualified Frames.Folds as FF

tableTypes "Purchases" "purchases.csv"

loadBenchmarks :: IO (Frame Purchases)
loadBenchmarks = inCoreAoS (readTable "purchases.csv")

unpack :: FMR.Unpack Purchases Purchases
unpack = FMR.unpackFilterOnField @Country (const True)

assign :: FMR.Assign (Record '[Country]) Purchases (Record '[Amount, Discount])
assign = FMR.splitOnKeys @'[Country]

reduce :: FMR.Reduce (Record '[Country])
                     (Record '[Amount, Discount])
                     (Frame Purchases)
reduce = FMR.foldAndAddKey $ (FF.foldAllConstrained @Num @'[Amount, Discount]) sum

mrFold :: FMR.Fold Purchases (Frame Purchases)
mrFold = FMR.concatFold $ FMR.mapReduceFold unpack assign reduce

rhead :: Show a => Frame a -> IO ()
rhead  = \ms -> mapM_ print (((take 6) . F.toList) ms)

main :: IO ()
main = do
  ms <- loadBenchmarks
  rhead ms
  print $ fold sum (view amount <$> ms)
  let result = FMR.fold mrFold ms
  putStrLn $ (L.intercalate "\n" $ fmap show $ fold Foldl.list result)

which gives

{country :-> "USA", amount :-> 2000, discount :-> 10}
{country :-> "USA", amount :-> 3500, discount :-> 15}
{country :-> "USA", amount :-> 3000, discount :-> 20}
{country :-> "Canada", amount :-> 120, discount :-> 12}
{country :-> "Canada", amount :-> 180, discount :-> 18}
{country :-> "Canada", amount :-> 3100, discount :-> 21}
17210
{country :-> "Australia", amount :-> 600, discount :-> 60}
{country :-> "Brazil", amount :-> 460, discount :-> 46}
{country :-> "Canada", amount :-> 3400, discount :-> 51}
{country :-> "France", amount :-> 500, discount :-> 50}
{country :-> "Germany", amount :-> 570, discount :-> 57}
{country :-> "India", amount :-> 720, discount :-> 72}
{country :-> "Italy", amount :-> 630, discount :-> 63}
{country :-> "Japan", amount :-> 690, discount :-> 69}
{country :-> "Spain", amount :-> 660, discount :-> 66}
{country :-> "UK", amount :-> 480, discount :-> 48}
{country :-> "USA", amount :-> 8500, discount :-> 45}

That blog post is great, thank you for sharing it! The conciseness of the R version is definitely inspirational, and worth aiming for. I’m looking forward to spending some time with it.

I added another of Rasmus' analyses

purchases |>
filter(amount <= median(amount) * 10) |> #👈
group_by(country) |>
summarize(total = sum(amount - discount))

reduce1 :: FMR.Reduce (Record '[Country])
                      (Record '[Amount, Discount])
                      (Frame Purchases)
reduce1 = FMR.foldAndAddKey $ aggDataFold1

aggDataFold1 :: FMR.Fold (Record '[Amount, Discount]) (Record '[Amount, Discount])
aggDataFold1 =
  let t = Foldl.premap (\r -> rgetField @Amount r - rgetField @Discount r) sum
      d = Foldl.premap (rgetField @Discount) sum
  in FF.sequenceRecFold $ FF.toFoldRecord t V.:&
                          FF.toFoldRecord d V.:&
                          V.RNil

mrFold1 :: FMR.Fold Purchases (Frame Purchases)
mrFold1 = FMR.concatFold $ FMR.mapReduceFold unpack assign reduce1

I'd like to have a field called total but it seems all I can do is manipulate the amount.

Instead of

{country :-> "Australia", amount :-> 600, discount :-> 60}
{country :-> "Brazil", amount :-> 460, discount :-> 46}
{country :-> "Canada", amount :-> 3400, discount :-> 51}
{country :-> "France", amount :-> 500, discount :-> 50}
{country :-> "Germany", amount :-> 570, discount :-> 57}
{country :-> "India", amount :-> 720, discount :-> 72}
{country :-> "Italy", amount :-> 630, discount :-> 63}
{country :-> "Japan", amount :-> 690, discount :-> 69}
{country :-> "Spain", amount :-> 660, discount :-> 66}
{country :-> "UK", amount :-> 480, discount :-> 48}
{country :-> "USA", amount :-> 8500, discount :-> 45}

I can now get the discounted amount

{country :-> "Australia", amount :-> 540, discount :-> 60}
{country :-> "Brazil", amount :-> 414, discount :-> 46}
{country :-> "Canada", amount :-> 3349, discount :-> 51}
{country :-> "France", amount :-> 450, discount :-> 50}
{country :-> "Germany", amount :-> 513, discount :-> 57}
{country :-> "India", amount :-> 648, discount :-> 72}
{country :-> "Italy", amount :-> 567, discount :-> 63}
{country :-> "Japan", amount :-> 621, discount :-> 69}
{country :-> "Spain", amount :-> 594, discount :-> 66}
{country :-> "UK", amount :-> 432, discount :-> 48}
{country :-> "USA", amount :-> 8455, discount :-> 45}

@adamConnerSax I copied this from one of your examples but it does feel clunky in comparison to the R:

aggDataFold1 :: FMR.Fold (Record '[Amount, Discount]) (Record '[Amount, Discount])
aggDataFold1 =
  let t = Foldl.premap (\r -> rgetField @Amount r - rgetField @Discount r) sum
      d = Foldl.premap (rgetField @Discount) sum
  in FF.sequenceRecFold $ FF.toFoldRecord t V.:&
                          FF.toFoldRecord d V.:&
                          V.RNil

So R or maybe tidyverse has a concept of grouped frames

> attributes(purchases |> group_by(country))
attributes(purchases |> group_by(country))
$class
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32

$names
[1] "country"  "amount"   "discount"

$spec
cols(
  country = col_character(),
  amount = col_double(),
  discount = col_double()
)

$problems
<pointer: 0x60000365b7d0>

$groups
# A tibble: 11 × 2
   country         .rows
   <chr>     <list<int>>
 1 Australia         [3]
 2 Brazil            [2]
 3 Canada            [3]
 4 France            [3]
 5 Germany           [3]
 6 India             [3]
 7 Italy             [3]
 8 Japan             [3]
 9 Spain             [3]
10 UK                [3]
11 USA               [3]

Sorry to chime in late here. I love that you're trying these comparisons! And, selfishly, that you are working with Frames-map-reduce, because I too think the API needs work.

I often don't focus on the issues you point out here because I am more stymied by the type-inference issues involved in writing more general purpose functions, ones with some columns left as type-parameters. But it would be great to have a better API for these sorts of explorations.

Anyway, some observations:

  1. map-reduce-folds and, thus, Frames-map-reduce, have a noUnpack function so you don't need the unpack you defined. In fact, I so rarely use unpack, I wonder if there should be an interface that doesn't require it?
  2. I love it when the field-level-fold can be expressed via foldAllConstrained! Otherwise, specifying the field-by-field folds is awkward. This is, I think, because Haskell lacks true row-types. Vinyl/Frames is a lovely substitute but the type-level-list and type-indexed-list machinery is clunky.
  3. Lens might help a bit here, allowing more concise expressions of functions like
    (\r -> -> rgetField @Amount r - rgetField @Discount r) as (\r -> r ^. amount - r ^. discount) and probably something even slicker.
  4. Not sure if it's helpful, but I have an unreleased layer over Frames, Frames-streamy that adds a few bells & whistles. And is, at least on my machine, at least twice as fast to load CSV data. It also allows column-selection and renaming at load time which makes things faster and more concise. And it has output functions which are useful for building formatted CSV output, though the API there has the same clunkiness as the field-by-field folds, only this time for field-by-field formatting. I've not released it because we really ought to figure out a way to get it into Frames proper but it would involve breaking changes for both the speed increases (a switch to Streamly from Pipes for the internal streaming) and larger changes for the column-selection since that adds fields to the RowGen type.
  5. One thing I'm not sure of, is how well R does at merging the folds. My guess is fairly well since it's so heavily used by people working with large data sets. But I like that the Haskell, expressed as a fold, is explicitly doing this work in two passes (the grouping pass and the reduce pass) over the data. I think R makes this possible as well, but less directly.
  6. I think the grouped tables more-or-less amounts to having a data-type to hold the result of the unpack and assign steps in the the map-reduce-fold. Which we could do! It would be a list (or Seq or Streamly.Stream) of (key, value) pairs. The type would be parameterized by the key and value types. Then we could have operations which act on such things. We have them already, actually, Reduce. It would still be clunkier than R because the grouped data would not be a Frame but something quite different. But I suppose some type classes might make some operations available on Frames and "grouped Frames"? Not sure if that's a good idea...
  7. So maybe you are pointing to a different API! One that simplifies operating at the interface between assign (group) and reduce. And a more concrete "ungroup"--an operation that is harder in Haskell though straightforward in the Frames version. Though maybe if we require the grouping operation (assign) to be an isomorphism between the plain row and the grouped row? Let me think about how that might work.

@adamConnerSax thanks for the feedback. I came up with a way which doesn't use Frames-map-reduce. I think this (below) is not too far off the R version and has the advantage that we can see the types.

It's annoying that folds does not have a median function.

A Simple Example
----------------

> rhead :: Show a => Frame a -> IO ()
> rhead  = \ms -> mapM_ print (((take 6) . F.toList) ms)

> frame :: FL.Fold a (Frame a)
> frame = FL.Fold (\x a -> x . (Frame 1 (const a) <>)) id ($ mempty)

> data GroupedFrame k t b = GroupedFrame (Map.Map k (t b))

> instance (Show a, Foldable t) => Show (GroupedFrame k t a) where
>   show (GroupedFrame x) = L.intercalate "\n" $ map (L.intercalate "\n") $ fmap (fmap show) $
>                           fmap (fold FL.list) $
>                           map F.toList $ F.toList $ Map.elems x

> data OrdinaryFrame t b = OrdinaryFrame (t b)

> instance (Show b, Foldable t) => Show (OrdinaryFrame t b) where
>   show (OrdinaryFrame x) = L.intercalate "\n" $ fmap show $ fold FL.list x

> test :: IO ()
> test = do
>   ms <- loadBenchmarks

How much do we sell? Let’s take the total sum!

>   putStrLn $ show $ FL.fold FL.sum ((^. amount) <$> ms)

Ah, they wanted it by country...

>   let groupedByCountry :: Map.Map (Record '[Country]) (Frame Purchases)
>       groupedByCountry = FL.fold (FL.groupBy (rcast @'[Country]) frame) ms

>   putStrLn $ show $ GroupedFrame groupedByCountry

>   let countrySum :: Map.Map (Record '[Country]) Int
>       countrySum = Map.map (\ns -> (FL.fold FL.sum ((^. amount) <$> ns))) groupedByCountry
>   let summary :: Frame (Record '[Country, Total])
>       summary = toFrame $ Map.mapWithKey (\k x -> (k ^. country) &: x &: V.RNil) countrySum

>   putStrLn $ show $ OrdinaryFrame summary

And I guess I should deduct the discount.

>   let countrySumDscnt :: Map.Map (Record '[Country]) Int
>       countrySumDscnt =
>         Map.map (FL.fold (FL.premap (\r -> r ^. amount - r ^. discount) FL.sum)) groupedByCountry

>   let summaryDscnt :: Frame (Record '[Country, Total])
>       summaryDscnt = toFrame $ Map.mapWithKey (\k x -> (k ^. country ) &: x &: V.RNil) countrySumDscnt

>   putStrLn $ show $ OrdinaryFrame summaryDscnt

Let’s remove everything 10x larger than the mean using the mean within each country.

>   let countryMean :: Map.Map (Record '[Country]) Double
>       countryMean =
>         Map.map (\ns -> (FL.fold FL.mean (fmap fromIntegral $ (^. amount) <$> ns))) groupedByCountry

>   let filteredGroupByCountry :: Map.Map (Record '[Country]) (Frame Purchases)
>       filteredGroupByCountry = Map.map (uncurry f) (Map.intersectionWith (,) groupedByCountry countryMean)
>         where
>           f :: Frame Purchases -> Double -> Frame Purchases
>           f df m = filterFrame (\r -> fromIntegral (r ^. amount) <= 10.0 * m) df

>   putStrLn $ show $ GroupedFrame filteredGroupByCountry
median :: (Fractional a, Ord a) => FL.Fold a a
median = FL.Fold (flip L.insert) [] f
  where
    f [] = error ""
    f xs | odd l = xs !! m
         | otherwise = (xs !! (m - 1) + xs !! m) / 2
      where l = length xs
            m = l `div` 2