JuliaCollections/Iterators.jl

length() isn't required for iterators

Closed this issue · 15 comments

According to the Julia language docs, iterators do not need to support the length() method. The length() method is expected of "General Collections". The biggest problem with length() is that some iterators have infinite length() and this cannot be represented by any Integer.

Often iterators are functions of other iterators, and so their length is a function of other iterators. The technique of selectively declaring the length() method when it is known, and using applicable() to test for its presence doesn't work well in this case - how can you selectively define length() dependent on whether an argument type has it declared?

The cleanest easy solution to this is not to declare length for any iterators. A better solution would be to allow length() to return :unknown or :infinite for cases where the length cannot be determined without iterating through the entire sequence (eg reading tokens from a file), or when it is known to be unbounded. A set of basic math operators would need to be provided for combining finite lengths with either unknown or infinite lengths. But this would affect Julia beyond just this package.

julia> collect( product( chain( [1], [2] ), [ 3 ] ) )
ERROR: length has no method matching length(::Chain)

Iterators that are functions of other iterators could be parametrized by whether they have a known length (e.g. Chain{false} has unknown length and Chain{true} has known length), and then we could define length only for Chain{true}. I believe we could do this without introducing type instability if we use a staged function to test the arguments for applicability of length when the iterator is constructed.

Interesting. This is a different solution to the one I was thinking of. The disadvantage is the baggage of having to define true and false versions etc. (I define a lot of iterators - they represent signals in signal processing applications) As for type stability, I believe that in some ways this is just moving the problem around - the test for length being available still needs to be made and this isn't that different that code the compiler put put in testing the type. My assumption is that length can be hoisted out of loops, so efficiency shouldn't be affected much. I use multiple return types a lot - outside of loops, and only for "good" reasons - I figure that the compiler generated code for testing this is as least as efficient as the code I would write to effect the same thing - effectively tagged unions - and I can have smaller and neater code exploiting this.

For the record, this is where I was going. It makes a lot more sense to use singleton types than the symbols I originally suggested - and then I can add specializations for the + operator.

type UnknownLength
end
+( lhs::UnknownLength, rhs ) = UnknownLength()
+( lhs, rhs::UnknownLength ) = UnknownLength()
+( lsh::UnknownLength, rhs::UnknownLength ) = UnknownLength()

Then chain() could just use sum() over map() of length() - and if any were UnknownLength() then the result is UnknownLength(). I have also in the past had the concept of infinite length which could be handled the same way.

However, I'm modifying my application code to not assume the presence of length() to avoid all these issues.

A quick thought: since infinite length will typically be determined for a whole iterator type, what if we were to define length(::MyInfiniteIterator) = Inf?

Although it might seem a bit odd at first to use a float in this context, there are some advantages:

  • it would always compare longer than any finite-length chain, so the definition of Zip2 would work as expected, and we could avoid the need for TakeStrict by defining
length(t::Take) = min(length(t.xs), n)

Similarly, we could have

length(c::Chain) = sum([length(i) for i in c.xss])
  • We could distinguish between infinite-length iterators (whose behaviour is fairly predictable), and unknown-length iterators (whose behaviour can cause problems if they finish too early), which would continue to throw a MethodError.

length(::MyInfiniteIterator) = Inf seems sensible to me.

Also, since Inf - n == Inf for any finite n, this would also work for Drop:

length(d::Drop) = max(length(d.xs) - d.n, 0)

I'd worry that the type instability this introduces would cause more problems than this really solves.

This doesn't neecessarily introduce any type instability. length(::MyInfiniteIterator) would always return a Float64 (Inf), and length(::Array) would still always return an Int.

Type instability doesn't mean that a given function (length) must always return the same type. To be stable, you only need to return the same type for the same argument types.

It is true that length(::Any) would now return an unknown type, but that would only matter when type inference has already failed (so that the compiler doesn't know the type of the object you are taking the length of).

Wouldn't we want to check if there are any places where code is using length(::Any)?
I have a suspicion that it might show up in places like ODBC, or DataFrames, where it is frequent to have rows with many different types, so you end up with a Vector{::Any}.

I think there should just be two different interfaces: an AbstractFiniteIterator where length is defined and AbstractIterator where its just start, next, and done. "Composite" iterators like Chain and Product should be parameterized on the class of iterator they hold and define length{I<:AbstractFiniteIterator}(c::Chain{I}) = ....

Is it possible to determine a single type parameter from a typejoin of the parameters in the constructor arguments?

@phobon's suggestion sounds reasonable to me

I've had to design iterators that could handle unknown or essentially infinite streams of data, for example,
you don't want to have to iterate over terabytes of a B+tree to find a length, and by the time you've done that, the length might not even be accurate any longer, if other processes are adding/deleting to/from the database.

@ScottPJones I don't think anyone is proposing to make length iterate over the collection (for one thing, this would cause all sorts of problems with streaming data where you only get to iterate once).

My point, muddled as it was, was just that in some cases, like the one I mentioned, it was impossible to determine in advance whether the iteration was of unknown length, or infinite, even for the same structure. Same thing is true for iterating over a filesystem. I didn't think people were proposing making length iterate of the entire collection, just that conceptually, to me, @phobon's idea seems correct, not changing length to try to return an out-of-band value. If you really want length to return an out-of-band value, why not just return -1 (since length returns an Int usually instead of a UInt which in and of itself is conceptually strange)

I think returing Inf would be unsemantic,
because you can have an iterator of unknown (and unknowable) length that is certainly finite:
For example

type ValuesThatSumToAtLeast100
end

Base.start(b::ValuesThatSumToAtLeast100) = 1
Base.done(b::ValuesThatSumToAtLeast100,state) = state>=100
function Base.next(b::ValuesThatSumToAtLeast100,state)
    value = rand(1:10)
    value, state+value
end

ValuesThatSumToAtLeast100() |> collect

This iterators length is always between 10 and 100 elements long

Better than returning some out of band value like Inf or -1
is having different parameterised types, as suggested in #42 (comment)
, or different interfaces as suggested in #42 (comment)

This is also more julian, since it makes it easy to define things using multiple dispatch later.

Keno commented

This is fixed on 0.5, since I declared Product as SizeUnknown, which is the only thing we can do, since the iteratorsize is a type property and we do no parameterize on types.