nvie/itertools

Adding a tee

BreizHell opened this issue · 4 comments

Hi, sorry if this is a FAQ, but is there anything preventing the addtion of a tee function whithin this library (mimicking itertool's own tee : https://docs.python.org/3.9/library/itertools.html#itertools.tee) ?

import { tee } from itertools;

const iterable= {
  *[Symbol.iterator]() {
    yield "a";
    yield "b";
    yield "c";
    yield "d";
    yield "e";
  },
};

const [iter1, iter2, iter3] = tee(iterable, 3)

console.log(...iter1);
console.log(...iter2);
console.log(...iter3);

// output: 
// "a"   "b"   "c"  "d"  "e"
// "a"   "b"   "c"  "d"  "e"
// "a"   "b"   "c"  "d"  "e"

From my point of view, the gain of adding this function would be to be able to have several iterators """inheriting""" from the same iterator and have them iterate on it without having one consume the base iterator before the others.
In an ideal world I could see myself using this for async or even parallel computations ... But seeing as this was problematic for the original Python implementation I don't hold too much hope there :p

nvie commented

I don't think there is anything preventing this from being implemented, but I wonder how useful this is going to be in practice. Suppose you have a long (or potentially infinite, even) iterable and then wrap it in a tee. You are then responsible for "simultaneously" consuming them both (or "all", if n > 2), which may be hard to do in practice.

For example, in your code example, if iterable would produce a very long stream of values, it would effectively have to read all those values into memory while consuming iter1, and it would need to retain those in memory until iter3 gets consumed. Only then we can start freeing memory again. If the input stream is long enough, code like this will run out of memory.

I think that is a big caveat that makes it hard to find practical applications for this. (Cases that do fit in memory could arguably be read into Array.from(...iterable) anyway.) Or you would have to go to lengths as a consumer to make sure you protect yourself against that, for example by simultaneously reading from all forks at once, perhaps in chunks:

  • Read 100 items from iter1
  • Read 100 items from iter2
  • Read 100 items from iter3
  • Read 100 items from iter1 again
  • etc.

I think the Python implementation suffers from the same problem, though.

All of that said, I'm not opposed to the idea of adding it and would definitely accept a pull request for this, but I would love to at least document the above subtleties and possibly provide some example code for how to best deal with the returned iterators in practice. If you have suggestions for this, they are more than welcome!

I think tee could actually be a function that takes away these issues from its users :

For simplicity I'll assume iterable and n refer to the arguments within tee(iterable, n);
Let's say we returned 3 iterators, our slowest iterator is currently at iteration i, the middle at i+a and the fastest at i+b (so a <= b);
We also have a shared array* A of variable length, where A.length always equals b;

I've got this ideas :

  • tee could check the type of iterable. In case iterable is known to not be infinite, like, for example, if Array.isArray(iterable) returns true, then we just return range(n).map(iterable[Symbol.iterator])
  • If it cannot be proven that iterable isn't infinite then we have 2 choices :
    • Operate in a forcefully synchronized fashion :
      None of the returned iterators is allowed to jump to iteration i+1 if at least one of the iterators is still at iteration i-1
      -> I could imagine a third argument for tee that would specify the size of a buffer (basically an array containing the elements of iterable that the slowest iterators hasn't reached yet). This buffer size also dictate how much of a gap there can be between the fastest and slowest iterators.
    • Operate asynchronously :
      Every time b increases so does A.length, A[0] corresponds to the oldest value of iterable that the slowest iterator hasn't gotten to yet, while A[b-1] is the last value that was just read by the fastest iterator.

*

  • I've been talking about an array but it would most likely be better to use a FIFO stack.
  • In order to contain the buffer Array I've been talking about, I would add an internal Map within the tee function itself where each key returns the corresponding buffer array, each returned iterator would therefore have to contain a key to access their buffer. (I haven't thought a whole lot about this so maybe there is a simpler/better way of doing this)

So ... What do you think ?

We have detected this issue has not had any activity during the last 60 days. That could mean this issue is no longer relevant and/or nobody has found the necessary time to address the issue. We are trying to keep the list of open issues limited to those issues that are relevant to the majority and to close the ones that have become 'stale' (inactive). If no further activity is detected within the next 14 days, the issue will be closed automatically.
If new comments are are posted and/or a solution (pull request) is submitted for review that references this issue, the issue will not be closed. Closed issues can be reopened at any time in the future. Please remember those participating in this open source project are volunteers trying to help others and creating a better collaboration environment for all. Thank you for your continued involvement and contributions!