TimelyDataflow/timely-dataflow

Human learning loop / compilation times

rw opened this issue ยท 6 comments

rw commented

Heya!

@frankmcsherry I've been noodling on timely-dataflow and differential-dataflow for months, and this week I finally got a chance to read the mdbook and go through the examples. Very fun, mind-expanding stuff! The runtime speed numbers are impressive, and I like how much TLC is put into these libraries. Thank you for all of your hard work bringing these projects to the world.

As I learn the idioms of TD/DD, I've noticed that I want to tweak the provided examples (maybe by adding a function to a pipeline, or changing the input generator to use stdin, and so on). This kind of fiddling is part of my normal hands-on tinkering-to-learn process.

Unfortunately, I haven't been able to get compile times low enough to make this learning process pleasant. I've found that compiling one of the first examples from the mdbook, with a hot incremental build cache (Rust 1.33.0), and debug mode, still takes 10+ seconds.

I looked through the verbose cargo logs, but nothing stood out to me as being a problem I could fix to speed up compilation.

I know these libraries are often described as working in "mysterious ways" -- are there techniques we could employ to speed up the compile-test-edit cycle, so that fans like me can improve our intuition of these libraries?

Thanks!

I agree! I'm a bit tight on time right at the moment, but I'll try and explain more and sketch some alternatives later on today. But, you are right to be frustrated! :D

My experience has been that

  1. timely and differential dataflow builds take a while, and
  2. the vast majority of the time is spent in LLVM, compiling the code that Rust hands it.

My understanding is that Rust provides a fairly massive volume of code, at one point surprising some of the core devs: we had one example where a fraction of a gigabyte of intermediate representation was handed to LLVM, with the instructions "optimize this". I have a differential dataflow program that takes 20 minutes to build in release mode.

Release and debug builds of the simple example, whose source is literally

extern crate timely;

use timely::dataflow::operators::*;

fn main() {
    timely::example(|scope| {
        (0..10).to_stream(scope)
               .inspect(|x| println!("seen: {:?}", x));
    });
}

takes

echidnatron% cargo build --release --example simple
   Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
    Finished release [optimized + debuginfo] target(s) in 26.44s
echidnatron% cargo build --example simple 
   Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
    Finished dev [unoptimized + debuginfo] target(s) in 6.39s
echidnatron% 

which I agree is deeply objectionable.

I don't have a lot of visibility into why Rust takes so long to compile things. There is the cargo bloat subcommand which attempts to account for the space in binaries, which may be an indicator of what took so long to compile. Unfortunately, its output appears to be relatively unhelpful:

echidnatron% cargo bloat --example simple
Compiling ...
Analyzing target/debug/examples/simple

 File  .text    Size                Crate Name
32.6%  95.2%  1.9MiB                      [11592 Others]
 0.2%   0.5% 10.5KiB            [Unknown] _read_line_info
 0.1%   0.4%  8.0KiB               timely <timely::progress::reachability::Tracker<T>>::propagate_all
 0.1%   0.4%  7.5KiB                  std std::sys_common::backtrace::_print::{{closure}}
 0.1%   0.4%  7.4KiB                  std _backtrace_dwarf_add
 0.1%   0.3%  5.9KiB timely_communication timely_communication::allocator::zero_copy::initialize::initialize_networking
 0.1%   0.3%  5.3KiB                  std std::sys_common::backtrace::_print
 0.1%   0.3%  5.1KiB               timely <timely::progress::subgraph::SubgraphBuilder<TOuter, TInner>>::build
 0.1%   0.2%  4.8KiB            [Unknown] _read_attribute
 0.1%   0.2%  4.8KiB               timely <timely::progress::reachability::Builder<T>>::is_acyclic
 0.1%   0.2%  4.1KiB               timely timely::progress::reachability::summarize_outputs
 0.1%   0.2%  3.5KiB               timely <timely::worker::Worker<A>>::step
 0.1%   0.2%  3.3KiB               timely <timely::progress::subgraph::Subgraph<TOuter, TInner>>::propagate_pointstamps
 0.1%   0.2%  3.3KiB                  std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
 0.1%   0.2%  3.3KiB                  std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
34.2% 100.0%  2.0MiB                      .text section size, the file size is 5.7MiB
echidnatron% 

The --release build is smaller, but the bulk is still in Others.

There is also the cargo llvm-lines subcommand, which when pointed at a new src/bin/simple.rs produces

echidnatron% cargo llvm-lines --bin simple | head -n20
   Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
    Finished dev [unoptimized + debuginfo] target(s) in 7.35s
  17942   25  <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
  13511  641  core::ptr::real_drop_in_place
  12250   35  <alloc::raw_vec::RawVec<T, A>>::reserve_internal
   6701   72  <core::option::Option<T>>::map
   6239   55  <core::result::Result<T, E>>::map
   5272   46  <core::result::Result<T, E>>::map_err
   4028   43  <core::result::Result<T, E>>::unwrap_or_else
   3755   16  std::collections::hash::map::search_hashed_nonempty
   3534    8  alloc::slice::merge_sort
   3460   28  core::iter::iterator::Iterator::try_fold
   3146   26  <alloc::raw_vec::RawVec<T, A>>::allocate_in
   2898    4  <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
   2625   35  <alloc::raw_vec::RawVec<T, A>>::amortized_new_size
   2520    8  alloc::slice::insert_head
   2508   64  <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut
   2479    7  <std::collections::hash::map::HashMap<K, V, S>>::try_resize
   2403   55  core::ptr::read
   2172    8  alloc::slice::merge
   2141   26  <alloc::vec::Vec<T>>::push
   2062   34  core::ptr::swap_nonoverlapping_one
echidnatron% 

which appears to be telling us that there is a massive amount of random Rust cruft. If you add up all the lines for simple.rs, you get 295,988 lines. That seems like quite a lot, but I'm not certain what LLVM is expected to handle efficiently.

My sense is that what this means is that Rust's "zero cost abstractions" are very much not zero cost to compile, and the encouragement to use them is perhaps an error for compilation-minded users.

I'm about to head out for some vacation, but I'll try and look into things and see what I can learn. For example, it may make sense to #ifdef out much of the networking code for "learning", because even though it is dead code for timely::example, Rust may force LLVM to rediscover this on each build. I'll see what I can do about pointing serious Rust folks at this problem, as it worries me too.

One thing that helped a little (not greatly, but ...) is to replace timely example code with

extern crate timely;

use timely::dataflow::operators::*;

fn main() {

    let alloc = timely::communication::allocator::thread::Thread::new();
    let mut worker = timely::worker::Worker::new(alloc);

    worker.dataflow::<u64,_,_>(|scope|  {
        (0..10).to_stream(scope)
               .inspect(|x| println!("seen: {:?}", x));
    });
}

which removes the timely::example() and timely::execute() calls, making much of the communication crate dead code. Instead, it uses only the single-threaded allocator and directly employs the main thread as the compute thread.

It still has lots of llvm-lines, and I'll try looking in to what could possibly be the source of all of them.

echidnatron% cargo llvm-lines --bin simple | head -n20
   Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
    Finished dev [unoptimized + debuginfo] target(s) in 6.01s
  16495   23  <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
  10500   30  <alloc::raw_vec::RawVec<T, A>>::reserve_internal
   9755  460  core::ptr::real_drop_in_place
   5664   50  <core::result::Result<T, E>>::map
   4965   54  <core::option::Option<T>>::map
   4215   37  <core::result::Result<T, E>>::map_err
   3840   41  <core::result::Result<T, E>>::unwrap_or_else
   3534    8  alloc::slice::merge_sort
   3191   26  core::iter::iterator::Iterator::try_fold
   2904   24  <alloc::raw_vec::RawVec<T, A>>::allocate_in
   2898    4  <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
   2590   11  std::collections::hash::map::search_hashed_nonempty
   2520    8  alloc::slice::insert_head
   2250   30  <alloc::raw_vec::RawVec<T, A>>::amortized_new_size
   2232   57  <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut
   2172    8  alloc::slice::merge
   2122    6  <std::collections::hash::map::HashMap<K, V, S>>::try_resize
   1872   24  <core::option::Option<T>>::unwrap_or_else
   1803   22  <alloc::vec::Vec<T>>::push
   1800   36  <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer
echidnatron% 

I just pushed d65e96d which adds the timely::execute_directly(..) method. It bypasses much of the communication construction logic, taking no arguments from the command line and just using the main thread.

This also improved cargo test by about 20s (from 220s to 200s) as most of the doctest examples use timely::example() which was ported over to use this too.

rw commented

@frankmcsherry I just verified that timely::execute_directly shaves multiple seconds off of my debug compilation loop -- thanks! Do you know of other avenues for improving these numbers further? From your explanations above, it sounds like this problem might fundamentally be a rustc issue.

Great to hear! I'm not immediately aware of other issues, but I'm going to try exploring a bit to find out. Timely dataflow aggressively monomorphises, meaning it creates new types whenever it can tell a difference between things, but perhaps it should do less of that and more "boxing" of types behind abstract trait interfaces.

I'll explore a bit. I've also asked some Rust core dev folks, who seem receptive to thinking about it (esp if folks show up with specific data, so I'm looking into that first).