Questions, improvements

Question

Questions, improvements

dumblob opened this issue 5 years ago · 0 comments

dumblob commented 5 years ago

This looks like a very promising library. I have couple questions and ideas.

Establish a name for these preemptive user space threads. There are green threads, greenlets, eventlets, fibers, workers, micro threads, go routines, etc. None of them preemptive. There are also pthreads, user space threads etc. which are just wrappers around os-specific primitives. What about subthreads or psubthreads (preemptive subthreads)?
Make it into a single header library (like e.g. stb or SQLite).
What is the performance on Linux (compared to native os threads) with the 5000 us timer? Any measurements anywhere?
What is the memory consumption (compared to native os threads)? Currently stacks are fixed, so this is directly proportional to the number of psubthreads, but os threads are handled differently and a real measurement could shed some light on the differences.
There seem to be missing support for a voluntary "yield" for tasks which determine early in their time slice, that they have nothing to do (typically waiting for I/O), but don't want to yield to the os-level scheduler as it would put the whole os-level thread the psubthread is running in into disadvantage. Having native yield in psubthreads allows for implementation of work stealing and zero overhead MPSC/SPSC queues (think of goroutines and their channels) due to completely skipping syscalls and making tight loops better optimizable (cacheable, vectorizable).

One could also see it from a different angle and say, it is no yield, but rather a "steal" request as in work stealing (work stealing is the most efficient known strategy maintaining all caches hot, making the impression on the os-scheduler that the thread utterly needs full CPU time allowing it to be scheduled more frequently and with other "benefits", etc.).
Any plans to support multicore non-uniform (e.g. big.LITTLE or NUMA architectures or CPU grids like Parallela with 256 cores) systems? I mean, the best approach seems to be to spawn N os-level threads where N is the number of CPU cores. But this has to automatically shrink and grow based on events following e.g. power saving (in big.LITTLE, smartphones, notebooks, etc.), hot plug (CPUs removed or added), etc. In other words this would mean enhancing the scheduler by migration of psubthreads among os-level threads and dynamic addition & removal of os-level threads and rescheduling.
What about optimizing the cost of the switch between psubthreads by making it friendlier to caching (e.g. by running closely related tasks with temporal locality as mentioned in https://youtu.be/KXuZi9aeGTw or at least by grouping them accordingly in os-level threads assuming (6) gets implemented)?
Another wanna be "optimization" could be the use of a different "approximate" timer which won't depend on Linux internal scheduling (which seems to require synchronization across all CPU cores to sum all times - this everything probably at least at the frequency with which ITIMER_VIRTUAL is being requested - didn't check it though) and second might lead to slight starvation if other processes make all CPUs really busy. In this vein, I'd guess ITIMER_REAL would kind of "force" Linux to interrupt other processes, so the starvation would probably not happen (at the cost of less fair scheduling of psubthreads), but the issue with synchronization across all CPU cores would probably stay. Ideas?
I'd increase the timer frequency - the current 5000 is a very conservative choice (will work well on very old Linux systems), but rather unacceptable e.g. for real-time stuff like audio or video (e.g. isochronous data stream of 250 packets per second) - I'd suggest using the current Linux default frequency which shall be 250 Hz.
With regards to (4), what about optimizing the stack size in a similar manner like in Go?
Another optimization could be to use an array instead of list due to a huge difference in cache friendliness (every pointer dereference means "jumping randomly in memory" which totally scratches the existence of any cache in practice 😢).
Consider using SIGURG to avoid clashes in the app itself - for reasons see https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md#other-considerations .