Table of Contents
This package compares streamly, a blazing fast streaming library providing native high level, declarative and composable concurrency support, with popular streaming libraries e.g. vector, streaming, pipes and conduit. This package has been motivated by streamly, however, it is general purpose and compares more libraries and benchmarks than shown here. Please send an email or a pull request if the benchmarking code has a problem or is unfair to some library in any way.
A stream of one million consecutive numbers is generated using monadic unfold
API unfoldrM
, these elements are then processed using a streaming
combinator under test (e.g. map
). The total time to process all one million
operations, and the maximum resident set size (rss) is measured and plotted for
each library. The underlying monad for each stream is the IO Monad. All the
libraries are compiled with GHC-8.4.3. All the benchmarks were run on an Apple
MacBook Pro computer with a single 2.2 GHz Intel Core i7 processor with 4 cores
and 16GB RAM.
streamly
shows the best overall performance in terms of time as well as space.streamly
andvector
show similar performance except for theappend
operation wherestreamly
is much better, and thefilter
operation where vector is faster.- The
append
operation scales well only forstreamly
andconduit
. All other libraries show quadratic complexity on this operation. streaming
performs slightly better thanconduit
when multiple operations are composed together even though in terms of individual operations it is slightly worse thanconduit
.conduit
andpipes
show unusually large space utilization fortake
anddrop
operations (more than 100-150 MiB vs 3 MiB).drinkery
shows very good performance too though not plotted here because of a small issue in measurement and lack of space.machines
is roughly 2x slower than the slowest library here, and its maximum resident set size is close to 100 MiB for all operations (touching 300 MiB fortake
) compared to the 3MiB for all other libraries. I am not sure if there is something wrong with the measurements or the benchmarking code, majority of the code is common to all libraries, any improvements in the machines benchmarking code are welcome.
The following diagram plots the time taken by key streaming operations to process a million stream elements. Note: the time for streamly and vector is very low (600-700 microseconds) and therefore can barely be seen in this graph.
For those interested in the heap allocations, the following diagram plots the overall heap allocations during each measurement period i.e. the total allocations for processing one million stream elements.
The following diagram plots the maximum resident set size (rss) during the measurement of each operation. In plain terms, it is the maximum amount of physical memory that is utilized at any point during the measurement.
Benchmark | Description |
---|---|
drain | Just discards all the elements in the stream |
drop-all | drops all element using the drop operation |
last | extract the last element of the stream |
fold | sum all the numbers in the stream |
map | increments each number in the stream by 1 |
take-all | Use take to retain all the elements in the
stream |
filter-even | Use filter to keep even numbers and discard
odd numbers in the stream. |
scan | scans the stream using + operation |
mapM | transform the stream using a monadic action |
zip | combines corresponding elements of the two streams together |
A million streams of single elements are created and appended together to create a stream of million elements. The total time taken in this operation is measured. Note that vector, streaming and pipes show a quadratic complexity (O(n^2)) on this benchmark and do not finish in a reasonable time. The time shown in the graph for these libraries is just indicative, the actual time taken is much higher.
A stream of a million elements is generated using unfoldrM
and then
converted to a list.
A stream operation or a combination of stream operations are performed four times in a row to measure how the composition scales for each library. A million elements are passed through this composition.
Note: the time for streamly and vector is very low (600-700 microseconds) and therefore can barely be seen in this graph.
Benchmark | Description |
---|---|
mapM | mapM four times in a row |
all-in-filters | four filters in a row, each allowing all elements in |
map-with-all-in-filter | map followed by filter composed four times
serially |
To quickly compare packages:
# Chart all the default packages ./run.sh --quick # Compare a given list of packages # Available package names are: streamly, vector, streaming, pipes, # conduit, machines, drinkery, list, pure-vector ./run.sh --quick --select "streamly,vector" # Show full results for the first packages and delta from that for # the rest of the packages. ./run.sh --quick --select "streamly,vector" --delta
After running you can find the charts generated in the charts
directory.
If you have the patience to wait longer for the results remove the --quick
option, the results are likely to be a tiny bit more accurate.
The list
package above is the standard haskell lists in the base package,
and pure-vector
is the vector package using pure API instead of the monadic
API.
Note that if different optimization flags are used on different packages,
performance can sometimes badly suffer because of GHC inlining and
specialization not working optimally. If you want to be absolutely sure that
all packages and dependencies are compiled with the same optimization flags
(-O2
) use run.sh --pedantic
, it will install the stack snapshot in a
private directory under the current directory and build them fresh with the ghc
flags specified in stack-pedantic.yaml
. Be aware that this will require 1-2
GB extra disk space.
It is trivial to add a new package. This is how a benchmark file for a streaming package looks like. Pull requests are welcome, I will be happy to help, just join the gitter chat and ask!
Benchmarking is a tricky business. Though the benchmarks have been carefully designed there may still be issues with the way benchmarking is being done or the way they have been coded. If you find that something is being measured unfairly or incorrectly please bring it to our notice by raising an issue or sending an email or via gitter chat.
Benchmarking Tool:
We use the gauge package for measurements instead of
criterion. There were several issues with criterion that we fixed in gauge to
get correct results. Each benchmark is run in a separate process to avoid any
interaction between benchmarks.
IO Monad:
We run the benchmarks in the IO monad so that they are close to real life usage. Note that most existing streaming benchmarks use pure code or Identity monad which may produce entirely different results.unfoldrM
is used to generate the stream for two reasons, (1) it is monadic, (2) it reduces the generation overhead so that the actual streaming operation cost is amplified. If we use generation from a list there is a significant overhead in the generation itself because of the intermediate list structure.- Unless we perform some real IO operation, the operation being benchmarked can get completely optimized out in some cases. We use a random number generation in the IO monad and feed it to the operation being benchmarked to avoid that issue.
Inlining:
GHC simplifier is very fragile and inlining may affect the results in unpredictable ways unless you have spent enough time scrutinizing and optimizing everything carefully. Inlining is the biggest source of fragility in performance benchmarking. It can easily result in an order of magnitude drop in performance just because some operation is not correctly inlined. Note that this applies very well to the benchmarking code as well.GHC Optimization Flags:
To make sure we are comparing fairly we make sure that we compile the benchmarking code, the library code as well as all dependencies using exactly the same GHC flags. GHC inlining and specialization optimizations can make the code unpredictable if mixed flags are used. See the--pedantic
option of therun.sh
script.Single file vs multiple files
The best way to avoid issues is to have all the benchmarking code in a single file. However, in real life that is not the case and we also needed some modularity to scale the benchmarks to arbitrary number of libraries so we split it into per package file. As soon as the code was split into multiple files, performance of some libraries dropped, in some cases by 3-4x. Careful sprinkling of INLINE pragmas was required to bring it back to original. Even functions that seemed just 2 lines of code were not automatically inlined.- When all the code was in a single file, not a single INLINE pragma was needed. But when split in multiple files even functions that were not exported from that file needed an INLINE pragma for equivalent performance. This is something that GHC may have to look at.
- The effect of inlining varied depending on the library. To make sure that we are using the fully optimized combination of inline or non-inline for each library we carefully studied the impact of inlining individual operations for each package. The current code is the best we could get for each package.
- There is something magical about streamly, not sure what it is. Even though all other libraries were impacted significantly for many ops, streamly seemed almost unaffected by splitting the benchmarking ops into a separate file! If we can find out why is it so, we could perhaps understand and use GHC inlining in a more predictable manner. Edit - CPS seems to be more immune to inlining, as soon as streamly started using direct style, it too became sensitive to inlining.
- This kind of unpredictable non-uniform impact of moving functions in different files shows that we are at the mercy of the GHC simplifier and always need to tune performance carefully after refactoring, to be sure that everything is fine. In other words, benchmarking and optimizing is crucial not just for the libraries but for the users of the libraries as well.
There are two dual paradigms for stream processing in Haskell. In the first paradigm we represent a stream as a data type and use functions to work on it. In the second paradigm we represent stream processors as data types and provide them individual data elements to process, there is no explicit representation of the stream as a data type. In the first paradigm we work with data representation and in the second paradigm we work with function representations. Both of these paradigms have equal expressive power. The latter uses the monadic composition for data flow whereas the former does not need monadic composition for straight line stream processing and therefore can use it for higher level composition e.g. to compose streams in a product style.
To see an example of the first paradigm, let us use the vector
package to
represent a monadic stream of integers as Stream IO Int
. This data
representation of stream is passed explicitly to the stream processing
functions like filter
and drop
to manipulate it:
import qualified Data.Vector.Fusion.Stream.Monadic as S stream :: S.Stream IO Int stream = S.fromList [1..100] main = do let str = (S.filter even . S.drop 10) stream toList str >>= putStrLn . show
Pure lists and vectors are the most basic examples of streams in this paradigm. The streaming IO libraries just extend the same paradigm to monadic streaming. The API of these libraries is very much similar to lists with a monad parameter added.
The second paradigm is direct opposite of the first one, there is no stream
representation in this paradigm, instead we represent stream processors as
data types. A stream processor represents a particular process rather than
data, and we compose them together to create composite processors. We can call
them stream transducers or simply pipes. Using the machines
package:
import qualified Data.Machine as S producer :: S.SourceT IO Int producer = S.enumerateFromTo 1 100 main = do let processor = producer S.~> S.dropping 10 S.~> S.filtered even S.runT processor >>= putStrLn . show
Both of these paradigms look almost the same, right? To see the difference let's take a look at some types. In the first paradigm we have an explicit stream type and the processing functions take the stream as input and produce the transformed stream:
stream :: S.Stream IO Int filter :: Monad m => (a -> Bool) -> Stream m a -> Stream m a
In the second paradigm, there is no stream data type, there are stream
processors, let's call them boxes that represent a process. We have a
SourceT box that represents a singled ended producer and a Process box or a
pipe that has two ends, an input end and an output end, a MachineT
represents any kind of box. We put these boxes together using the ~>
operator and then run the resulting machine using runT
:
producer :: S.SourceT IO Int filtered :: (a -> Bool) -> Process a a dropping :: Int -> Process a a (~>) :: Monad m => MachineT m k b -> ProcessT m b c -> MachineT m k c
Custom pipes can be created using a Monadic composition and primitives to
receive and send data usually called await
and yield
.
Streaming libraries using the direct paradigm. | |
---|---|
Library | Remarks |
vector | The simplest in this category, provides transformation and combining of monadic streams but no monadic composition of streams. Provides a very simple list like API. |
streaming |
|
list-t | Provides straight line composition of streams
as well as a list like monadic composition.
The API is simple, just like vector . |
streamly | Like list-t, in addition to straight line
composition it provides a list like monadic
composition of streams, supports combining streams
concurrently supports concurrent applicative and
monadic composition.
The basic API is very much like lists and
almost identical to vector streams. |
Streaming libraries using the pipes paradigm. | |
---|---|
Library | Remarks |
conduit | await and yield data to upstream or
downstream pipes; supports pushing leftovers back. |
pipes | await and yield data to upstream or
downstream pipes |
machines | Can await from two sources, left and right. |