/easyLambda

distributed dataflows with functional list operations for data processing with C++14

Primary LanguageC++Boost Software License 1.0BSL-1.0

Build Status codecov

ezl: easyLambda

Parallel data processing made easy using functional and dataflow programming with modern C++

Welcome to easyLambda and thanks for your interest. The site aims to be a comprehensive guide for easyLambda.

What is easyLambda

EasyLambda is header only C++14 library for data processing in parallel with functional list operations (map, filter, reduce, scan, zip) that are tied together in type--safe dataflow.

EasyLambda is parallel, it scales from multiple cores to hundreds of distributed nodes without any need to deal with parallelism in user code.

EasyLambda is fast. It has minimal overhead in serial execution and builds upon high performance MPI parallelism that is known to be more efficient than any other comparable work [1].

EasyLambda is expressive and succinct, thanks to the column selection for composition of functions and many generic algorithms such as configurable parallel file reader, predicates, correlation, summary etc.

EasyLambda is intuitive and easy to understand with its uniform property based (or ExpressionBuilder) interface for everything from configuring parallelism to changing behavior of generic algorithms to routing dataflow.

EasyLambda is easily interoperable with other libraries like standard library or raw MPI code, since it uses standard data types and enforces no special structure, data-types or requirements on the user functions.

Why easyLambda

EasyLambda is a good fit for the following tasks:

  • table/list processing and analysis from CSV or flat text files.
  • post-processing of scientific simulation results.
  • running iterative machine learning algorithms.
  • parallel type-safe data reading.
  • to play with dataflow programming and functional list operations.

Since, it can smoothly interoperate with other libraries, it is possible to add distributed parallelism using easyLambda to the existing libraries or codebase when its programming abstraction fits well e.g. it can be used along with bare MPI code or with a machine learning library to add distributed training and testing.

EasyLambda will also interest you if you

  • are a modern C++ enthusiast
  • want to dabble with metaprogramming
  • like functional and dataflow programming
  • have cluster resources that you want to put to use in everyday tasks without much effort.
  • have always wanted a high-level MPI interface.

Benchmarks

EasyLambda combines the efficiency of MPI with a high level programming abstraction. With easyLambda you get easy to understand code with good run-time performance. Check out the benchmarks and comparisons for performance and ease of use.

benchmarks

Getting Started

Check out the Getting Started section of the library webpage to know how to install and begin with easyLambda. The library can also be used on aws elastic cloud or single instance.

Examples

A detailed walkthrough of the library is given here, The examples directory contains various examples and demonstrations with explanations of features and options.

Here we mention some examples in short.

The following program calculates frequency of each word in the data files.

auto reader = fromFile<string>(argv[1]).rowSeparator('s').colSeparator("");
ezl::rise(reader)
  .reduce<1>(ezl::count(), 0).dump()
  .run();

The dataflow pipeline starts with rise and subsequent operations are added to it. In the above example, the pipeline begins by reading in data from the specified file(s). fromFile is a library function that takes column types and the specified file(s) glob pattern as input and reads the file(s) in parallel. It has a lot of properties for controlling data-format, parallelism, denormalization etc (shown in demoFromFile).

In reduce we pass the index of the key column to group by, the library function for counting and initial value of the result.

Following is a dataflow for calculating pi using Monte-Carlo method.

ezl::rise(ezl::kick(10000)) // 10000 trials shared over all processes
  .map([] { 
    return pow(rnd(), 2) + pow(rnd(), 2);
  })
  .filter(ezl::lt(1.))
  .reduce(ezl::count(), 0)
  .map([](int inCircleCount) { 
    return (4.0 * inCircleCount / 10000); 
  }).dump()
  .run();

The dataflow starts with rise in which we pass a library function to call the next unit a number of times. The steps in the algorithm have been expressed with the composition of small operations, some are common library functions like count(), lt() (less-than) and some are user-defined functions specific to the problem.

Here is another example from cods2016. A stripped version of the input data-file is given with ezl here. The data contains student profiles with scores, gender, job-salary, city etc.

auto scores = ezl::fromFile<char, array<float, 3>>(fileName)
                .cols({"Gender", "English", "Logical", "Domain"})
                .colSeparator("\t");

ezl::rise(scores)
  .filter<2>(ezl::gtAr<3>(0.F))   // filter valid domain scores > 0
  .map<1>([] (char gender) {      // transforming with 0/1 for isMale
    return float(gender == 'm');
  }).colsTransform()
  .reduceAll(ezl::corr<1>())
    .dump("", "Corr. of gender with scores\n(gender|E|L|D)")
  .run();

The above example prints the correlation of English, logical and domain scores with respect to gender. We can find similarity of the above code with steps in a spreadsheet analysis or with SQL query. We select the columns to work with viz. gender and three scores. We filter the rows based on a column and predicate. Next, we transform a selected column in-place and then find an aggregate property (correlation) for all the rows.


Contributing

Suggestions and feedback are welcome. Feel free to contact via mail or issues for any query.

Some of the possible directions of improvement:

  • compile time optimization
  • use of specialized data structures in various units like reduce etc.
  • addition of more examples e.g. neural nets, simulations etc.
  • design simplifications
  • parallelism optimization
  • code reviews
  • documentation

Possible ideas for future extenstions:

  • fault tolerance
  • algorithms / functions to plot streaming and buffered data
  • domain specific algorithms
  • MPI single-sided communications
  • Experiments to extend current programming abstraction to cover more problems like domain-decomposition etc.

Check internals and blog for design and implementation details.

Acknowledgments

A big thanks to cppcon, meetingc++ and other conferences and all C++ expert speakers, committee members and compiler implementers for modernising C++ and teaching it with so much enthusiasm. I had fun implementing this, hoping you will have fun using it. Looking forward to learn more from the community.

I wish to thank eicossa and Nitesh for their (less online, more offline :P) contributions.