rleap-project/dlplan

What is the most efficient underlying data structure for ConceptDenotation and RoleDenotation?

Closed this issue · 19 comments

Option A: Sparse representation: using ConceptDenotation = std::vector<int>; and using RoleDenotation = std::vector<std::pair<int, int>>

Option B: Bitset representation: using ConceptDenotation = std::vector<bool>; and using RoleDenotation = std::vector<bool>

I expect option B to be more efficient on small instances but option A to be more efficient on large instances.

Yes, I'd expect similar. For option A, you can likely adjust the datatype for representing object IDs to the max. number of objects in any instance you're processing, e.g. in many problems, 1 byte (uint8_t) could be enough to represent them?

From the perspective of using the library with a small number of features (for example in a planner), this would not change much because it only reduces memory by small amounts and does not affect the runtime.

In the case of generating features where the number of features can become large, one could implement a lossless compression function that maps from usual int to uint8_t for more compact storage. And of course, throw an error if the range is not sufficient for a given instance. Can be a limiting factor for a user though.

For future work: it makes sense to take existing implementations of option B from d2l and compare it against option A that is currently used in dlplan for different sizes of instances.

In the case of generating features where the number of features can become large, one could implement a lossless compression function that maps from usual int to uint8_t for more compact storage. And of course, throw an error if the range is not sufficient for a given instance. Can be a limiting factor for a user though.

Yes. I was thinking on the use case where you have all the states in advance, and can precompute the max. number of PDDL objects that exists in any state in advance, and hence can choose on what's the best representation (whether 8 bits or more).
Even if you don't have that information but you want to get fancy, instead of throwing an error as you describe, you could switch representations on-the-fly, and change your data structures (likely with some remapping overhead). Not sure this is worth the effort though.

Another approach that I think Simon was using for the duplicate concept pruning, is to compute a hash of the concept denotation and go forward only with that (and of course hope for the best :-)). In the case of detection of duplicate denotations over large sets of states, the likelihood of a hash collision should be extraordinarily small, and this technique breaks the dependency of the space necessary to keep track of duplicates wrt the number of states in your sample.

Another approach that I think Simon was using for the duplicate concept pruning, is to compute a hash of the concept denotation and go forward only with that (and of course hope for the best :-)). In the case of detection of duplicate denotations over large sets of states, the likelihood of a hash collision should be extraordinarily small, and this technique breaks the dependency of the space necessary to keep track of duplicates wrt the number of states in your sample.

We currently use this technique with sha-256. It decreases the required memory by huge amounts. Option A and option B are compatible with this approach so again, it is interesting to see which is better in evaluating features on states.

sha-256 is a cryptographic hash function, which is not needed here as far as I can see. You'll get faster hashing with non-cryptographic hash functions such as Murmur hash, which is used in Fast Downward, IIRC.

I added an implementation with Murmurhash3 with 128Bit. The speed improved slightly in comparison to sha-256 but the number of bits also decreased by a factor of 2.

From the perspective of using the library with a small number of features (for example in a planner), this would not change much because it only reduces memory by small amounts and does not affect the runtime.

If you use a bitset type that supports Boolean operations (like the DynamicBitset in Fast Downward, copied from Boost), I think the choice of representation can make a drastic difference in runtime. For example, currently the "and" concept uses time O(n²) for evaluating a state, where n is the size of the universe. With DynamicBitset, you can use the intersect() method (possibly even in-place to save memory allocations) to compute the denotation in time O(n). Similarly for "not". DynamicBitsets would probably also speed up evaluating "diff" and "or". An operation that becomes more complicated though is counting the number of "active" objects. But C++ 20 features an efficient implementation for that (https://en.cppreference.com/w/cpp/numeric/popcount), which can be copied to earlier C++ versions, I guess.

This is not meant as a suggestion to switch data types, just food for thought :-) In any case, before switching to a different data structure, we should think about how this affects the rest of the algorithms.

BTW: for "not", a simpler way of obtaining O(n) runtime is to use swap-and-pop instead of erase(). And for "and", you could sort the two input vectors and use std::set_intersection(), which gives O(nlogn + 2n) = O(nlogn).

[Edit: I missed that the operations are performed on hash sets, not vectors, so my runtime analysis is wrong.]

The more dlplan code I see, the more I wonder whether we shouldn't just use a hash set of ints for ConceptDenotation. This would fit the fact that denotations are defined as sets, not sequences. Currently, we always convert between vectors and hash sets, which seems wasteful.

To make this efficient, we should use a cache-friendly hash set implementation though and I'd recommend Abseil's flat_hash_set (https://abseil.io/docs/cpp/guides/container) for this.

As I said above, let's first analyze the effects of such a change more globally.

Here is my suggestion on how to benchmark this. We can run the python generator in parallel, each on 1 small instance from different domains, and report the runtimes for each domain, until complexity c, and accumulated values. After this is set up, we can evaluate different data structures with just a few clicks on diverse problems.

The drawback of this approach is that we are calling the python code. Hence, I am unsure if we can get in-depth information as we would get with the callgrind tool of valgrind.

Alternatively, we could do the same thing directly in cpp, or to make it simpler, hardcode states from instances of different domains directly into the cpp code.

Sounds good to me. I think for comparing different data structures it's fine to use the Python generator. Be sure to use a realistically sized set of states for each domain though. And don't forget to measure memory usage in addition to runtime (even though it's unlikely that it's affected much). Also, it's good to add some checks that ensure that each version computes the same number of features. Apart from the vector representation, I'd try dynamic_bitset (see above) and flat_hash_set (see above). No need to use vector, I think.

I've been working a bit on the experimental side and we are ready to implement and test different data structures now. There is one piece left to discuss: we need to be able to compute a hash value for denotations. Correct me if I am wrong, but this requires a canonical representation of the denotation, e.g., a sorted vector, that we can iterate and compute a hash value from.
The good thing is that we already do this in the root of each element. For instance,

ConceptDenotation evaluate(const Element<ConceptDenotation>* parent, const State& state) const override {
        if (state.get_instance_info()->get_vocabulary_info() != parent->get_vocabulary_info()) {
            throw std::runtime_error("ConceptImpl::evaluate - mismatched vocabularies of Concept and State.");
        }
        ConceptDenotation result = parent->get_element()->evaluate(state);
        std::sort(result.begin(), result.end());
        return result;
}

Or is there some other clever way to compute hashes from unordered sets? If not then we can keep returning a vector from the public interface as above and think of changing this later.

Interesting problem! I googled a bit and found https://crypto.stackexchange.com/questions/54544/how-to-to-calculate-the-hash-of-an-unordered-set and https://www.preprints.org/manuscript/201710.0192/v1/download. So it seems to boil down to the question "does the hash set store the elements in a canonical order?". If yes, we can iterate over the elements without explicitly sorting them. Otherwise, yes, sorting them explicitly is probably the best approach.

So we do need to sort for vector, but not for DynamicBitset. I'm not sure about std::unordered_set or abseil::flat_hash_set.

Paul found another solution: https://stackoverflow.com/questions/30734848/order-independent-hash-algorithm

I see that this was already mentioned in the first post with using the xor on the individual hashes.

Here are the experimental results:

Summary:

  • same generated and novel features among all three implementations.
  • dynamic_bitset is ~40% faster than std::unordered_set and dominates in all instance.
  • similar runtimes for std::unordered_set and vector. However std::unordered_set solves 1 more task than std::vector.
  • no significant difference in memory usage.

(1) std::vector (baseline):
data-vector.zip
(2) std::unordered_set:
data-unordered_set.zip
(3) dynamic_bitset:
data-dynamic_bitset.zip

I stumbled over http://bitmagic.io/, which provides a "compressed" bitset and could maybe solve the memory problem we're seeing for large tasks with the current bitset implementation. Would be cool to include it in the next benchmark round (together with phmap::flat_hash_set). (For that I recommend encapsulating all methods on Denotations in a new class and then providing implementations of that class for each of the data structures (bitset, hash set, etc.).)

I tried to encapsulate the methods on denotations today but I was not so happy with the amount of code required to do that and moving so much code from the evaluate function of each element into the denotation class. Also, for some elements like c_all, c_some, c_equal, c_some, r_composition, I did not find a nice way to generalize the implementation because of the way we iterate the objects. However, I liked my idea of using templates which moves the decision of which container we want to use to compile time and we can easily change the data structure. Inheritance does not change the above problems and also seems to be the wrong choice because we never want to use dynamic dispatch.

template<typename T, typename Container>
Denotation { 
public:
    Denotation operator&=(const Denotation& other);
    // ...
};

// Explicit template instantiation.
template class Denotation<int, Bitset>;
template class Denotation<std::pair<int ,int>, Bitset>;

using ConceptDenotationBitset = Denotation<int, Bitset>;
using RoleDenotationBitset = Denotation<std::pair<int, int>, Bitset>;

// Explicit template specialization.
// Maybe we can use of partial explicit template specialization here?
template<>
ConceptDenotationBitset& ConceptDenotationBitset(const ConceptDenotationBitset& other) {
}
template<>
RoleDenotationBitset& RoleDenotationBitset(const RoleDenotationBitset& other) {
}

// Switch between different implementations, 
using ConceptDenotation = ConceptDenotationBitset;
using RoleDenotation = RoleDenotationBitset;

Yes, I agree it's not easy to abstract away some of the required functionality. Have you considered providing begin() and end() iterators? (see task_proxy.h for an example of how to provide these)

But choosing the implementation at compile time is completely fine (I agree we don't want dynamic dispatch) and if templates simplify the code, that's good.

Yes, I have considered implementing iterators. The problem with begin and end is that bitset is sorted but unordered set is not. The implementation of, e.g., c_some, look different then because in sorted iteration we can set an additional break (https://github.com/rleap-project/dlplan/blob/main/src/core/elements/concepts/some.h) whereas in unsorted iteration we can not set an additional break. Similar thing happens even in simple set operations like intersection, union, or difference where naive implementation using iterators are not optimized for Bitset.

I suggest that we just live with the code duplication for now and move the evaluation inside the Denotations such that we do not have to expose the underlying container as we currently do.

I think we need to give the new class methods for operations like intersection, union, etc. Basically, we want an IntegerSet class for ConceptDenotation that has methods for all the required operations on sets (including looping over all elements).