Human learning loop / compilation times
rw opened this issue ยท 6 comments
Heya!
@frankmcsherry I've been noodling on timely-dataflow and differential-dataflow for months, and this week I finally got a chance to read the mdbook and go through the examples. Very fun, mind-expanding stuff! The runtime speed numbers are impressive, and I like how much TLC is put into these libraries. Thank you for all of your hard work bringing these projects to the world.
As I learn the idioms of TD/DD, I've noticed that I want to tweak the provided examples (maybe by adding a function to a pipeline, or changing the input generator to use stdin, and so on). This kind of fiddling is part of my normal hands-on tinkering-to-learn process.
Unfortunately, I haven't been able to get compile times low enough to make this learning process pleasant. I've found that compiling one of the first examples from the mdbook, with a hot incremental build cache (Rust 1.33.0), and debug mode, still takes 10+ seconds.
I looked through the verbose cargo logs, but nothing stood out to me as being a problem I could fix to speed up compilation.
I know these libraries are often described as working in "mysterious ways" -- are there techniques we could employ to speed up the compile-test-edit cycle, so that fans like me can improve our intuition of these libraries?
Thanks!
I agree! I'm a bit tight on time right at the moment, but I'll try and explain more and sketch some alternatives later on today. But, you are right to be frustrated! :D
My experience has been that
- timely and differential dataflow builds take a while, and
- the vast majority of the time is spent in LLVM, compiling the code that Rust hands it.
My understanding is that Rust provides a fairly massive volume of code, at one point surprising some of the core devs: we had one example where a fraction of a gigabyte of intermediate representation was handed to LLVM, with the instructions "optimize this". I have a differential dataflow program that takes 20 minutes to build in release mode.
Release and debug builds of the simple
example, whose source is literally
extern crate timely;
use timely::dataflow::operators::*;
fn main() {
timely::example(|scope| {
(0..10).to_stream(scope)
.inspect(|x| println!("seen: {:?}", x));
});
}
takes
echidnatron% cargo build --release --example simple
Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
Finished release [optimized + debuginfo] target(s) in 26.44s
echidnatron% cargo build --example simple
Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
Finished dev [unoptimized + debuginfo] target(s) in 6.39s
echidnatron%
which I agree is deeply objectionable.
I don't have a lot of visibility into why Rust takes so long to compile things. There is the cargo bloat
subcommand which attempts to account for the space in binaries, which may be an indicator of what took so long to compile. Unfortunately, its output appears to be relatively unhelpful:
echidnatron% cargo bloat --example simple
Compiling ...
Analyzing target/debug/examples/simple
File .text Size Crate Name
32.6% 95.2% 1.9MiB [11592 Others]
0.2% 0.5% 10.5KiB [Unknown] _read_line_info
0.1% 0.4% 8.0KiB timely <timely::progress::reachability::Tracker<T>>::propagate_all
0.1% 0.4% 7.5KiB std std::sys_common::backtrace::_print::{{closure}}
0.1% 0.4% 7.4KiB std _backtrace_dwarf_add
0.1% 0.3% 5.9KiB timely_communication timely_communication::allocator::zero_copy::initialize::initialize_networking
0.1% 0.3% 5.3KiB std std::sys_common::backtrace::_print
0.1% 0.3% 5.1KiB timely <timely::progress::subgraph::SubgraphBuilder<TOuter, TInner>>::build
0.1% 0.2% 4.8KiB [Unknown] _read_attribute
0.1% 0.2% 4.8KiB timely <timely::progress::reachability::Builder<T>>::is_acyclic
0.1% 0.2% 4.1KiB timely timely::progress::reachability::summarize_outputs
0.1% 0.2% 3.5KiB timely <timely::worker::Worker<A>>::step
0.1% 0.2% 3.3KiB timely <timely::progress::subgraph::Subgraph<TOuter, TInner>>::propagate_pointstamps
0.1% 0.2% 3.3KiB std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
0.1% 0.2% 3.3KiB std <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
34.2% 100.0% 2.0MiB .text section size, the file size is 5.7MiB
echidnatron%
The --release
build is smaller, but the bulk is still in Others
.
There is also the cargo llvm-lines
subcommand, which when pointed at a new src/bin/simple.rs
produces
echidnatron% cargo llvm-lines --bin simple | head -n20
Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
Finished dev [unoptimized + debuginfo] target(s) in 7.35s
17942 25 <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
13511 641 core::ptr::real_drop_in_place
12250 35 <alloc::raw_vec::RawVec<T, A>>::reserve_internal
6701 72 <core::option::Option<T>>::map
6239 55 <core::result::Result<T, E>>::map
5272 46 <core::result::Result<T, E>>::map_err
4028 43 <core::result::Result<T, E>>::unwrap_or_else
3755 16 std::collections::hash::map::search_hashed_nonempty
3534 8 alloc::slice::merge_sort
3460 28 core::iter::iterator::Iterator::try_fold
3146 26 <alloc::raw_vec::RawVec<T, A>>::allocate_in
2898 4 <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
2625 35 <alloc::raw_vec::RawVec<T, A>>::amortized_new_size
2520 8 alloc::slice::insert_head
2508 64 <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut
2479 7 <std::collections::hash::map::HashMap<K, V, S>>::try_resize
2403 55 core::ptr::read
2172 8 alloc::slice::merge
2141 26 <alloc::vec::Vec<T>>::push
2062 34 core::ptr::swap_nonoverlapping_one
echidnatron%
which appears to be telling us that there is a massive amount of random Rust cruft. If you add up all the lines for simple.rs
, you get 295,988 lines. That seems like quite a lot, but I'm not certain what LLVM is expected to handle efficiently.
My sense is that what this means is that Rust's "zero cost abstractions" are very much not zero cost to compile, and the encouragement to use them is perhaps an error for compilation-minded users.
I'm about to head out for some vacation, but I'll try and look into things and see what I can learn. For example, it may make sense to #ifdef
out much of the networking code for "learning", because even though it is dead code for timely::example
, Rust may force LLVM to rediscover this on each build. I'll see what I can do about pointing serious Rust folks at this problem, as it worries me too.
One thing that helped a little (not greatly, but ...) is to replace timely example code with
extern crate timely;
use timely::dataflow::operators::*;
fn main() {
let alloc = timely::communication::allocator::thread::Thread::new();
let mut worker = timely::worker::Worker::new(alloc);
worker.dataflow::<u64,_,_>(|scope| {
(0..10).to_stream(scope)
.inspect(|x| println!("seen: {:?}", x));
});
}
which removes the timely::example()
and timely::execute()
calls, making much of the communication crate dead code. Instead, it uses only the single-threaded allocator and directly employs the main thread as the compute thread.
It still has lots of llvm-lines
, and I'll try looking in to what could possibly be the source of all of them.
echidnatron% cargo llvm-lines --bin simple | head -n20
Compiling timely v0.8.0 (/Users/mcsherry/Projects/timely-dataflow)
Finished dev [unoptimized + debuginfo] target(s) in 6.01s
16495 23 <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::try_fold
10500 30 <alloc::raw_vec::RawVec<T, A>>::reserve_internal
9755 460 core::ptr::real_drop_in_place
5664 50 <core::result::Result<T, E>>::map
4965 54 <core::option::Option<T>>::map
4215 37 <core::result::Result<T, E>>::map_err
3840 41 <core::result::Result<T, E>>::unwrap_or_else
3534 8 alloc::slice::merge_sort
3191 26 core::iter::iterator::Iterator::try_fold
2904 24 <alloc::raw_vec::RawVec<T, A>>::allocate_in
2898 4 <core::slice::IterMut<'a, T> as core::iter::iterator::Iterator>::try_fold
2590 11 std::collections::hash::map::search_hashed_nonempty
2520 8 alloc::slice::insert_head
2250 30 <alloc::raw_vec::RawVec<T, A>>::amortized_new_size
2232 57 <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut
2172 8 alloc::slice::merge
2122 6 <std::collections::hash::map::HashMap<K, V, S>>::try_resize
1872 24 <core::option::Option<T>>::unwrap_or_else
1803 22 <alloc::vec::Vec<T>>::push
1800 36 <alloc::raw_vec::RawVec<T, A>>::dealloc_buffer
echidnatron%
I just pushed d65e96d which adds the timely::execute_directly(..)
method. It bypasses much of the communication construction logic, taking no arguments from the command line and just using the main thread.
This also improved cargo test
by about 20s (from 220s to 200s) as most of the doctest examples use timely::example()
which was ported over to use this too.
@frankmcsherry I just verified that timely::execute_directly
shaves multiple seconds off of my debug compilation loop -- thanks! Do you know of other avenues for improving these numbers further? From your explanations above, it sounds like this problem might fundamentally be a rustc
issue.
Great to hear! I'm not immediately aware of other issues, but I'm going to try exploring a bit to find out. Timely dataflow aggressively monomorphises, meaning it creates new types whenever it can tell a difference between things, but perhaps it should do less of that and more "boxing" of types behind abstract trait interfaces.
I'll explore a bit. I've also asked some Rust core dev folks, who seem receptive to thinking about it (esp if folks show up with specific data, so I'm looking into that first).