dtolnay/cargo-tally

Improve performance of --transitive

Closed this issue · 4 comments

Currently graphing direct dependencies takes about 4 seconds and graphing transitive dependencies takes 50 seconds. There is probably low-hanging fruit to speed this up.

Here is a look at valgrind --tool=callgrind cargo tally --graph "" --transitive syn:
2,008,253,106,551  PROGRAM TOTALS

281,457,705,838  cargo_tally::intern::crate_name
180,275,075,479  cargo_tally::tally::Resolve::add_crate'2
163,842,871,243  core::iter::traits::iterator::Iterator::try_for_each::{{closure}}
147,235,485,561  semver::version_req::VersionReq::matches
117,579,396,828  cargo_tally::tally::Universe::resolve
104,796,305,465  hashbrown::raw::RawTable<T>::reserve_rehash
 95,662,557,545  malloc.c:_int_free
 90,610,321,614  cargo_tally::tally::Resolve::add_crate_feature'2
 81,847,693,728  cargo_tally::tally::Resolve::add_crate_feature
 67,075,480,232  <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold'2
 62,561,498,453  malloc.c:malloc
 59,789,394,084  __memcmp_avx2_movbe
 59,577,203,335  malloc.c:_int_malloc
 47,053,764,829  std::collections::hash::map::HashMap<K,V,S>::insert
 43,335,854,045  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter
 41,735,730,260  spin::once::Once<T>::call_once
 40,962,961,157  std::collections::hash::set::HashSet<T,S>::insert
 29,247,846,267  malloc.c:free
 29,019,940,094  std::collections::hash::set::HashSet<T,S>::remove
 28,956,799,160  cargo_tally::tally::Universe::resolve_and_add_to_graph
 27,082,241,716  <string_interner::InternalStrRef as core::cmp::PartialEq>::eq
 23,602,520,053  cargo_tally::tally::tally
 22,763,683,867  malloc.c:malloc_consolidate
 18,789,504,656  cargo_tally::tally::Resolve::add_dep_or_crate_feature'2
 14,714,313,859  cargo_tally::tally::Resolve::add_dep_or_crate_feature
 13,518,171,690  cargo_tally::tally::Resolve::add_crate
 13,012,911,115  core::ptr::real_drop_in_place'2
 12,252,192,341  __memcpy_avx_unaligned_erms
  9,306,334,264  std::sys::unix::alloc::__rdl_alloc
  8,919,551,753  core::slice::<impl [T]>::copy_from_slice
  6,968,387,700  __atomic_load_8
  6,442,603,306  alloc::raw_vec::RawVec<T,A>::reserve
  6,244,765,659  string_interner::InternalStrRef::as_str
  6,244,679,652  string_interner::InternalStrRef::from_str
  6,016,772,296  core::ptr::real_drop_in_place
  4,540,695,687  <alloc::vec::Vec<T> as core::clone::Clone>::clone'2
  3,007,069,092  <core::iter::adapters::Cloned<I> as core::iter::traits::iterator::Iterator>::fold'2
  2,658,952,652  std::alloc::__rdl_alloc
  2,228,194,951  <core::iter::adapters::Cloned<I> as core::iter::traits::iterator::Iterator>::fold
  2,177,678,356  __memset_avx2_unaligned_erms
  2,132,635,116  _ZN4core3ptr18real_drop_in_place17h4c31a7ac400b6ca7E.llvm.1572264842351706568
  1,729,101,712  regex::backtrack::Bounded<I>::backtrack
  1,724,879,124  __memset_avx2_erms
And perf record:
11.35%  cargo_tally::intern::crate_name
 9.98%  core::iter::traits::iterator::Iterator::try_for_each::{{closure}}
 8.70%  cargo_tally::tally::Resolve::add_crate
 6.90%  cargo_tally::tally::Resolve::add_crate_feature
 4.79%  std::collections::hash::set::HashSet<T,S>::insert
 4.59%  cargo_tally::tally::Universe::resolve
 3.98%  <string_interner::InternalStrRef as core::cmp::PartialEq>::eq
 3.88%  std::collections::hash::set::HashSet<T,S>::remove
 3.22%  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter
 2.85%  semver::version_req::VersionReq::matches
 2.68%  cargo_tally::tally::Universe::resolve_and_add_to_graph
 2.61%  std::collections::hash::map::HashMap<K,V,S>::insert
 2.41%  malloc
 1.85%  <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
 1.82%  cargo_tally::tally::tally
 1.80%  cfree
 1.41%  hashbrown::raw::RawTable<T>::reserve_rehash
 1.33%  hashbrown::raw::RawTable<T>::reserve_rehash
 1.11%  cargo_tally::tally::Resolve::add_dep_or_crate_feature
 0.91%  hashbrown::raw::RawTable<T>::reserve_rehash
 0.83%  spin::once::Once<T>::call_once
 0.71%  core::ptr::real_drop_in_place
 0.48%  string_interner::InternalStrRef::as_str
 0.47%  string_interner::InternalStrRef::from_str
 0.34%  __rdl_alloc
 0.26%  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter
 0.26%  <core::iter::adapters::Cloned<I> as core::iter::traits::iterator::Iterator>::fold
 0.23%  core::ptr::real_drop_in_place
 0.21%  <alloc::string::String as core::clone::Clone>::clone
 0.19%  __rust_dealloc
 0.17%  __atomic_load_8
 0.16%  __rust_alloc
 0.14%  <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
 0.13%  <alloc::vec::Vec<T> as core::clone::Clone>::clone
 0.13%  alloc::raw_vec::RawVec<T,A>::reserve
 0.12%  __atomic_load_8@plt
 0.11%  <alloc::vec::Vec<T> as core::clone::Clone>::clone
 0.10%  <core::iter::adapters::Cloned<I> as core::iter::traits::iterator::Iterator>::fold
 0.10%  core::ptr::real_drop_in_place

Using rayon would lead to a performance improvement right off the bat, no? Right now everything is done on one thread. Looking into changing the string interner could also lead to significant gains.

Fixed by #30.