keean/zenscript

Parallelism

Opened this issue · 79 comments

Unfortunately this discussion started in Error handling thread.

And the discussion derived from a post I made about C no longer being a low-level language because of its mismatch to the fact that all scaling must come from parallelism.

I hope that discussion will continue here instead.

How can we ever find any of our 1000s of comments of past discussion if they are buried in unrelated threads.

@keean wrote:

Actually you have that backwards, there is more information about parallelism available at runtime.This is why VLIW processors failed to outperform x86.

VLIW was based on the idea that the compiler could do more to extract parallelism from the code than the CPU, so they made a highly parallel CPU with many execution units that could run multiple instructions in parallel.

You mentioned this VLIW comparison and then VLSI/VHDL parallelism in our past discussion in the Concurrency thread. And ironically you made the same conflation of “low and high-level parallelism” which I will be correcting you on again herein.

I agree there’s more information at runtime relative to compile-time approach which is parallelizing the same algorithm.

But your thought process is incorrectly applied to what I wrote. You conflated low-level runtime, with high-level compile-time. You must compare apples to apples. Your point is true for low-level runtime versus low-level compile-time.

I did not write “more information”. I wrote “more semantic information”. The distinction apparently means nothing to you but it should.

The “more information” at runtime is limited in its character by the semantics expressed in the code. The higher level semantic information enables (the programmer, not the compiler!) to make changes in the algorithm which are impossible for the low-level speculative pipelining parallelism to achieve, because the low-level transformations cannot change the high-level algorithm. A high-level algorithm designed to achieve parallelism can obtains orders-of-magnitude scale of performance; whereas, the low-level speculative pipelining is only achieving much less than an order-of-magnitude increase and this increase is not scaling by Moore’s law over time because the CPU designers have hit a limitation of nature which can’t scale. Advocating the futility of fighting against nature is inane, idiotic, myopic, etc..

In addition to my slamdunk point above, your point about runtime information can also be applied by a compiler+runtime at a higher semantic level which the low-level pipelining can’t do. For example, with pure functions a map operation can parallelize all the operation for all elements, but the decision about whether it should or not is dependent on the ratio of the overhead of parallelization relative to runtime cost for the operation. So the runtime could measure this ratio and decide whether to parallelize the map. A compiler could estimate it, but for cases near to the margin a runtime could measure it.

Another tangential point is that when you claim that it’s too difficult for programmers to extract parallelism, you are ignore the points made in the ACM article that extant PLs lack the proper designs to extract the parallelism that they could plausibly do without too much programmer intervention, such as the aforementioned pure function example. But my point has been also that ingenuous programmers will find even more clever algorithms for parallelism.

That ACM article argued that C is no longer a low-level language because due to lacking these facilities for the compiler to automatically extract higher-level parallelism from the code which the low-level speculative pipelining cannot extract, then CPU designers are forced to create a virtual microcode pipelining hardware inside the CPU which C does not even expose to the programmer. IOW that parallelism is the only way to continue to scale Moore’s law and C encourages CPU designers to instead not scale and waste most of the transistors on trying to squeeze diminishing incremental improvements in serial performance. That is a recipe for disaster with the future of computing becoming stagnant and technology which relies on computing failing to improve. Thus it behooves us as PL designers to address this issue.

Essentially they stripped the speculative out or order hardware out of the CPU and used that space for extra execution units. This relied on the compiler to produce the parallel instruction.bundles. What they found was it could only outperform the x86 on very specific scientific computation tasks, and failed to be faster for general computation.

Because they were running serial algorithms. No where have I claimed that running serial algorithms on parallel hardware will be faster than running serial algorithms on speculative pipelining. Thus you shouldn’t write this irrelevant information which doesn’t even apply to my point.

My point was about the need to parallelize our performance critical code paths, otherwise our programs will not scale. This is a statement of fact, which none of your obfuscating BS has refuted.

I go with evidence over speculation every time.

Me too I will also choose speculative parallelism because of the evidence. At a higher semantic level. Which is for example how 3-SAT is made practical for some working sets.

And you will go with evidence? Then why do you ignore the evidence that future improvements in serial speedups are dying? And why am I repeating this for the 4th or 5th time?

I simply can’t believe you’re this slow minded because you’ve demonstrated to me in many instances that you have a sharp intellect. Why the cognitive dissonance here? Is it because and do you lack divergent thinking? Is it some obstinance issue? Are you preoccupied? Are you being lazy and not really thinking it out and relying on past conclusions you had made instead of reanalyzing them in light of my points?

keean commented

@shelby3

3-SAT is made practical for some working sets.

And 3-SAT is not general computation, it is a very specific algorithm, for a class of problems, but it looks very different from general computation.

The general part of most programs is a event loop that queues sequential processes, that necessarily follow one another, or are connected by streams in the pipelined case.

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I agree that a new language has to handle parallelism. There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

@keean wrote:

The general part of most programs is a event loop that queues sequential[ly ordered] processes, that necessarily follow one another, or are [analogously sequentially ordered by being] connected by streams in the [Unix shell paradigm] pipelined case.

ftfy.

Sequentially ordered != sequential process. There may be parallelism within the individual processes each of which are sequentially ordered. I know you know that nature is fractal, so there can be parallelism within serialism and vice versa.

Also the events in the event loop which are in the performance critical path are typically not sequentially ordered and can be performed in any order or even in parallel in some cases. The GUI events which are ordered are typically not performance critical (we do not need to scale the performance of GUIs by orders-of-magnitude).

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I have not looked at what are the limitations of the designs of existing GPUs. Seems that the ones built into the CPU don’t have the fast GBs of dual-ported main memory that dedicated GPU cards have? And what are the issues with exchanging work between CPU threads and GPU threads? Are the cost barriers high enough that the two processors pretty much can only send messages to each other? I know you’re reasonably knowledgeable about the extant hardware designs, so I will defer to your knowledge on this.

But yeah, maybe our compiler should be doing more to leverage that GPU (and also the CPU’s multicore) hardware which often sits idle.

I agree that a new language has to handle parallelism.

I think my recent point can be recapitulated/reconstituted as the more parallelism we can push for in code, then the more incentive CPU designers will have to prioritize high-level parallelism instead of low-level pipelining. Hopefully resulting in scaling trending back to Moore’s law progress.

My point is we should do all we can and reduce dependence on the low-level speculative pipelining in our design decisions. For example, such as where you argued in the Error Handling thread that exceptions as return values would not be an extra performance cost because of the low-level speculative pipelining. But that design decision if taken would have further cemented the need for low-level speculative pipelining and helped kill future scaling. And upon further analysis of your proposal independent of the concerns with reliance on speculative pipelining, it became clear we didn’t need to accept that design decision.

There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

Not just parallelism but also we need to diminish our reliance of the model of main memory as having latency that is low enough to keep up with a serial processor. Because this requires expending the transistor budget on complex deleterious caching mechanisms.

The ACM article also points other things we can improve to reduce for example the reliance on cache coherency, which is necessary in order to enable hardware designers to prioritize parallel computation units instead of the very costly (inefficient!) model of low-latency main memory.

For example that ACM article pointed out that if our PL allows for the compiler+runtime to reformat arrays of structures to structures of arrays, without intervention from the programmer. Also I wrote:


Let me quote what I wrote in my original post:

So the author of the article explained that instead of complex caches (and to accommodate the large working sets that the Transputer didn’t) we perhaps need an architecture where we have so many threads running that the latency to shared memory can be offset by computation these threads have to do in between accesses to main memory such that the CPU is always fully utilized. IOW, while some threads are waiting on main memory, other threads are running computation.

Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.


However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

Not always true. Again nature is fractal.

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

And the corollary is that expending it to speed up serial algorithms that are in the performance critical path of the application futile non-scaling.

keean commented

@shelby3

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Having 4 cores does not make up for this, just look at the failure of the Power Architecture. PowerPC was IBMs attempt to follow that approach, lots of simpler cores. It was tried in the playstation2, and the added complexity of programming the thing drove developers away, leading to the dominance of the XBox, as it provided a proper PC like CPU with caches and super-scalar-out-of-order processing.

The ideas you and that article are proposing are not new, they have been tried, and they failed the test of the free market.

@keean wrote:

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Do you not understand what I intend by “are not in the performance critical path of the application”?

It can’t be slower in any relevant way, if it is not in the performance critical path.

Whoop-de-doo if you make your highly responsive GUI more responsive and nobody can discern the speedup, because it is not code in the performance critical path. Law of diminishing returns aka marginal (non-uniform) utility.

Another corollary is that only 5% (or maybe up to 20%) of your application’s code is typically in the performance critical path, although there are exceptions. Thus the less stellar programmers will still have a lot of work they’re qualified to do that can be serial algorithms.

Btw, one (or maybe 2) core in the CPU could retain that speculative pipelining and the other 8, 16, or 24 (and increasing in the future) could then not waste their transistor budget on marketing hype. Has AMD already designed a CPU like this, such as with most of the transistor budget for the GPU on the same die?

Yet another example of parallelism I think the compiler can extract without intervention from the programmer.

keean commented

@shelby3

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

@keean wrote:

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

I did not propose to employ parallelism in code that is not in the throughput performance critical path. Thus latency would not be impacted in the typically up to 95% of the code of a program that is not in the throughput performance critical path. Since that 95% of the code can only run serially then it can’t leverage the other cores. Thus one or two cores with the speculative pipelining is sufficient (on the desktop and servers will eventually discard the speculative pipelining entirely once we improve the software). The rest of the transistor budget can be allocated to parallelism which is the only way to scale.

You could argue that running multiple copies of the program such as for a webserver where each request runs a separate copy of the program, thus would need every core to have speculative pipelining in order to minimize latency. But I would argue that a 50% reduction in CPU-bounded latency is not worth killing the scaling (by Moore’s Law) of the number of cores and thus number of requests that can be served per CPU. This is why Xeon (which sacrifices clock rate for more cores) instead of desktop Haswell/Broadwell architecture runs on most servers. So Xeon is sacrificing latency for parallelism.

Also Ryzen is kicking ass on Intel precisely because they made the right tradeoff for where computing is headed, “Ryzen CPUs offered stronger multi-threaded performance and weaker single-threaded performance relative to comparable Intel CPUs.”

Sparc was too geared towards parallelism too soon. The software had not caught up yet. AMD is probably hitting the sweet spot for the transition phase. They decided to emphasize their strengths in design and parallelism (GPUs) and outsource the foundry (eventually leveraging the coming strength of China in foundries!). AMD is following Android’s model and Intel is following Apple’s model, so I expect AMD to end up with more market share than Intel unless they fuck up again to snatch defeat from the jaws of victory. Also AMD has rising capital to expend because of the rising demand for GPUs for mining cryptocurrencies. But eventually something like Sparc or GPUs will come back much stronger in the future when the software adapts to more parallelism. And look who continues Sparc now. Japan! The Asians are going to kick ass in the future.

Also the latency is typically not bounded by the CPU but rather bounded by the persistent storage media, e.g. SSDs. So there is sometimes (or maybe even typically?) ample overhead to allow for parallelism and discard the speculative pipelining without increasing latency, although this can vary in each situation. You could attempt to make the argument that latency of the CPU and persistent storage are additive, but smart coders know to mask the CPU latency in the persistent storage latency with interleaving. Ditto the masking the CPU throughput and latency with memory latency via interleaving computation with read/write accesses.

I farted:

If for example that is an image filter, then split up the image file into tiles and parallelize the entire filter chain over tiles. If necessary overlap the tiles slightly if the filter input domain exceeds its output domain.

Some parallelism discussion in the Concurrency issues thread #17.

@shelby, Do you know how the ram capacity of a normal PC Board will increase in future? Massive multi core increase for the future is one good thing but to run multiple instances simultaneously wee need also more ram capacity.

keean commented

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.

@keean wrote:

It's still speculative out of order with up to 32 cores.

64 hyperthreads on a desktop! Wow.

Yeah I had realized that removing the speculative out-of-order doesn’t gain much. But the point remains that further increases due to Moore’s law are going into cores now. There’s no more that can be squeezed out of OoO exection.

Remember I wrote recently in the Concurrency issues thread #17:

So again I ask you, “Do you know what percentage of the silicon of Intel CPUs are for the speculative execution and register rewriting as contrasted with the caching?”

Apparently only about 10 – 15% of the transistor budget is expended on OoO execution (percentage of TPD power budget may be higher):

https://hothardware.com/news/amd-ryzen-die-shot-cache-structure-architecture-advantages-exposed

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Core_2

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Core_2

So apparently there will be not much sense in removing for purposes of increasing number of cores.

Although it would still perhaps be beneficial to turn it off when we don’t need it (i.e. in I/O bound concurrency) in order to conserve power consumption (important on mobile and in data centers), and when we need our code to surely free from timing attacks that might be lurking in the nondeterministic out-of-order speculation.

Nevertheless the point remains that transistor budget is increasing faster than transistors can be consumed for extant core counts that would increase performance. IOW, the OoO execution and superscalar features can’t improved. So Intel and AMD are forced to use those transistors for increased core counts.

So more extensive use of concurrency and parallelism is unavoidable.

@sighoya I don’t know the plans about DRAM. I presume they’ll scale everything up. It’s the latency which can’t keep pace. But with massive concurrency then we can coalesce main memory accesses as GPUs do.

@keean wrote:

But we need the L3 cache.

Well actually one day in the distant future when the concurrency is massive, then can do coalescing of main memory (and also persistent storage) accesses with concurrency then we won’t even need L3 cache, but that is I think too far into the future to design for now.

For that sort of use C + OpenMP would be a good target.

Found this:

AMD told us you should theoretically be able to build a Threadripper system with up to 1TB of memory when 128GB LR-DIMMs are used—hopefully enough to hold you for a few years. (AMD also says Threadripper should technically be able to support up to 2TB of RAM, although the company hasn’t validated this because there are no DIMMs that support the capacity yet.)

keean commented

@shelby3
Think of it like this, CPUs like threadripper are best when you have threads that need to take different paths, or run different tasks. GPUs are best when all the thread run in lockstep, all taking the same codepath in parallel.

On a CPU you might lauch parallel threads, and then wait for them all to complete (a rendezvous) on a GPU you know all the tasks finish at the same time. So a GPU is going to be more efficient for loop parallelisation (as long as its a big enough chunk of work to overcome the cost of launching the task and collecting the results). A CPU is going to be more efficient at having one thread per service connection dealing with client connections etc.

keean commented

@shelby3 A good parallel language should have different syntactic structures to represent these different kinds of parallelism, so the programmer is clear how things will execute and how much they will cost (the principle of visibility, and not paying for what you don't use). You could automake this with heuristics, but then the programmer has to play chase the heuristic, where small seemingly inconsequential code changes can tip the balance and result in unpredictable changes in performance.

@keean made me aware of this:

Microsoft is working on a new paradigm for increased parallelism:

Now Microsoft ports Windows 10, Linux to homegrown CPU

Explicit data graph execution - Wikipedia

Also remember I predicted this earlier in this discussion about the death of speculative superscalar architecture and shared caching between threads:

OpenBSD disables Intel’s hyper-threading over CPU data leak fears

https://www.itwire.com/security/83347-openbsd-chief-fix-for-new-intel-cpu-bug.html

And remember that there is no way to disable timing information (can merely increment a variable in a loop):

https://gruss.cc/files/fantastictimers.pdf

The only 100% safe way to stop code from software timing is to make all code run in a continuously randomly changing speed of clock, which isn’t desirable nor realistic.

OS preemptive threading may to some extent foil the use of timers which rely on incrementing a variable and non-preempted execution, but the devil is in the details. I don’t know the minimum interval needed to measure a timing attack with software timers. And if the OS preempts too frequently then overhead increases, which eats into the performance that was supposed to be gained from those hardware performance features (speculative execution and shared caches) which make the timing attacks possible. Thus I’m not sure that removing hardware timers is a complete solution. Also some software needs high-precision timers. Thus I still propose that in the future we will need the option to have code request that it be run only on cores with that vulnerable performance enhancements turned off and/or to dictate which other code can run along with it in the other hyperthreads which share the same vulnerable performance enhancements. Code that prioritizes security over maximum performance or which employs suitable parallelism may want to turn those vulnerable performance enhancements off.

EDIT: if you want some flavor of why correlation attacks can be so insidious and we should not think we can so easily squash them just by removing a hardware counter.

keean commented

@shelby3 I am not sure, I think for most people performance is more important than security. I mean you can still exploit "rowhammer" on DRAM, but people are still using DRAM :-)

As I wrote previously, I think eventually we will end up with a way for applications to turn it off on a per core basis so that applications can decide whether they need to prioritize security (and battery life) over serial performance.

I want to make my stance clear because I learned in private messages that some misunderstood my stance. In my post that responded to Eric Raymond’s blog post and claim about the invariant of SICK algorithms requiring serial performance and the discussion here in Github that has ensued, I didn’t intend to imply that we will ameliorate mathematical complexity theory. Rather my stance is that we will replace the need for those algorithms by solving problems with different strategies which can be parallelized and that do not require those algorithms. My stance is that we have no choice and humans necessarily rise to the challenge and are ingenuous when the market demands them to be.

We may also find algorithmic strategies to apply to existing SICK algorithms which leverage parallelism, but that wasn’t my central thesis.

@keean can you concur or correct my thought that §4. Challenge #‍1: Traffic on pg.3 of the paper Why On-Chip Cache Coherence is Here to Stay (which also appeared in a journal) contains an error in logic:

We now tackle the concerns regarding the scalability of coherence’s traffic on the on-chip interconnection network. To perform a traffic analysis, we consider for each cache miss how many bytes need to be transferred to obtain and relinquish the given block
[…]
In a coherent system, when a core reads a block that is shared, the coherence protocol may need to forward the request, but to at most one core (and thus the traffic for each read miss is independent of the number of cores). However, when a core incurs a write miss to a block that is cached by one or more other cores, the coherence protocol generates extra messages to invalidate the block from the other cores. These invalidation messages are often used to argue for the non-scalability of cache coherence, because when all cores are sharing a block, a coherent system must send an invalidation message to all other cores. However, our analysis shows that when sharers are tracked exactly, the overall traffic per miss of cache coherence is independent of the number of cores
[…]
Consider the access pattern in which a block is read by all cores and then written by one core. The writer core will indeed be forced to send an invalidation message to all cores, and each core will respond with an acknowledgment message (i.e., a cost of 2N messages for N sharers). However, such an expensive write operation can occur only after a read miss by each of those cores. More generally, for every write that invalidates N caches, it must have been preceded by N read misses. The traffic cost of a read miss is independent of the number of cores

Although they’re correct that the bus traffic overhead of the invalidation messages is constant relative to the traffic of the read misses, the cost of coherency for per-core private caches is that every write invalidation is by definition going to cause an additional read miss for each core that still needs access to anything (that wasn’t even invalid but was included) in the block (aka cache line) that was invalidated. IOW, the overall increase in traffic cost for per-core caches due to write invalidation, doesn’t scale with the number of cores. Or more simply stated that sharing mutable data between cores that have private caches doesn’t scale.


EDIT: I received an email reply from one of the authors of that paper. He concurred but pointed out that my point is only true when there exists the “unusual situation” of “false sharing” of “unrelated data … collocated on the same cache block, leading to needless coherence traffic.” His opinion is that “good data placement, false sharing shouldn't be a big problem.” I responded that I disagree because there will be cases where the record of related fields being written to also needs to be read again, so it’s not necessarily due to false sharing and thus not necessarily unusual or uncommon. Thus I continue to believe that the discovery by mostly @keean with my follow-up insights is very important. However I will say I don’t have any statistics or functioning models to check the frequency of this issue and its quantitative impact.


I don’t necessarily disagree with the title of the paper though once writable shared data is disallowed in the system. There’s still some coherency required…

If we adopt @keean’s suggestion in the Pointers #38 issues thread that all data shared between threads must be immutable (aka read-only), then my realization is that the only time that per-core caches need to be invalidated is when an immutable data structure has been deallocated and the memory it occupied is allocated anew for some other data. Because the invalid data remaining the cache will not be accessed by any program which respects the allocation until the same area in shared memory has been allocated anew. The allocation could optimize this further (so that other data in the same block which is still valid as cached is not prematurely invalidated) by refusing to allocate anew the memory that the deallocated immutable data structure formerly occupied until all of the surrounding memory which comprises an entire block (regardless whether that entire block is loaded into any per-core cache) that will be invalidated is also deallocated. Thus the invalidation broadcasts (to the per-core caches) would only happen when the entire block in shared memory is no longer in use and has accepted a new allocation.

This does conflate software and hardware though. One way to implement it is to have a system call for invalidating blocks and rely on software to manage this correctly for its virtual memory pages.

The conclusion section of that paper states:

First, this paper does not do detailed architectural simulations with specific benchmarks. We instead show that coherence’s per-miss traffic is independent of the miss pattern and number of cores. Although less precise than detailed simulation, our results are actually more robust as they are not limited to the specific benchmarks studied.

They actually failed to consider the holistic issue, which architectural simulations under scaling would presumably have illuminated. They considered each read miss independently without considering (and thus implicitly presuming not) that write invalidation drives more read misses thus amplifying traffic.

Second, we did not compare against multicore chips without caches or without a shared address space.

They also failed to consider the case of an Erlang-like model wherein all shared data is immutable.

P.S. I emailed the authors of that paper a link to this post.


Additionally immutable shared data can facilitate less complexity for cache-to-cache transfers as explained in 3) Cache-to-Cache Transfers of subsection B. Coherence Protocol Independent Design Considerations in §III. IMPLEMENTATION CONSIDERATIONS on pg. 7 of the research paper An Evaluation of Snoop-Based Cache Coherence Protocols:

3) Cache-to-Cache Transfers: To improve the timeliness of requests for data that is cached somewhere, it may be useful to allow another cache to supply the data rather than always going to memory. To support cache-to-cache transfers, each BIU must have a way of indicating whether or not its cache has the data and is able to supply it. This can be accomplished by means of a single wired-OR bus control line, which is asserted by each BIU if its cache is able to supply the data. The memory controller will also know that it should not supply the data if this control line is asserted. If several caches are able to supply a copy of the data, there must either be a way of selecting one of those caches (by using a priority or ownership scheme, for example) or it must be guaranteed that all caches will put the same value on the bus. In [1], it is noted that this additional complexity and the potential for slowing down the processors that have to provide the data from their cache has resulted in a limited use of cache-to-cache transfers.

Another issue with cache-to-cache transfers is whether or not one of the caches has the cache block in question in a Modified state, which can be indicated by another wired-OR bus control signal asserted by the BIU. If one of the caches has a modified copy of the block, it should not only be the cache that is chosen to supply the data, but it should also inhibit memory from supplying the data since memory has a stale copy [1][3]. As the cache is supplying the modified data to the requestor on the system bus, it is possible to write the modified value back to memory at the same time (as opposed to writing it back in a separate bus transaction). While this saves a bus transfer and improves bus performance, it has the additional complexity of having three cooperating members on the bus [1].

And cache-to-cache transfers provide power-savings as explained in subsection F. Energy Efficiency of Cache Coherence Protocols of §II. SURVEY OF EXISTING CACHE COHERENCE PROTOCOLS on pg. 5 of the same research paper. And also explained in §The case for cache coherency of the article How Cache Coherency Impacts Power, Performance.

Read-only shared data is actually faster and uses less power on existing systems which employ the MESI protocol.

This post contains a shocking (at least to me) discovery that Java and Go multi-threading with global heap, tracing GC is entirely incompatible with scaling multi-core. Scaling multi-core is entirely incompatible with multi-threading that shares mutable data between caches. Additionally since tracing GC is incompatible with scaling multi-threading unless each (or small group of) threads has its own separately collected heap, because as I explained that collector which isn’t instantaneous can’t be shared between too many threads because it would pause too many threads and make threads dependent on the good allocation behavior of other threads. So scaling multi-core will require reference counting for immutable objects shared between caches (and the immutability will aid in probing and collecting acyclic references, while not stalling any threads during such cycles detection and collection).

The GC of objects which aren’t shared between caches must be congruent with a viable general purpose programming model. Intel’s Xeon Phi (aka Larrabee) suffered from not sufficiently addressing these requirements2. One option is Rust’s zero overhead 100% stack allocation with its lifetime and exclusive borrowing model providing safe reentrancy. Code would be grouped to be run only by hardware threads sharing the same cache (to avoid write-back and cache-to-cache transfer coherency overhead). Another option is similarly to restrict each logical grouping of code to hardware threads sharing the same cache, but employ a separate GC’ed heap for each said grouping. For example, each green thread could have its own separate tracing GC’ed heap. Note this means a M plurality of work stealing green threads per core that has N = # of hyperthreads in M:N threading. However, this latter option lacks Rust’s zero overhead abstraction compile-time provably safe thread reentrancy; thus suffering the race condition (e.g. live and deadlock) errors and runtime overhead of locking. (EDIT: but this deficiency is solved with Actors.)

Additionally for cache-to-cache transfer efficiency and so that lightweight (e.g. green) thread context switches could occur to mask latency of such transfers, then ideally the programming model would enable the thread scheduler to know via software that such a transfer is required.

Serial core performance can’t scale and many-core doesn’t scale with complex cache coherence. At least it doesn’t scale if writable shared data is employed by the software. And if all the software will be modified to not use writable shared data, then the complex cache coherence circuitry is wasted silicon. In that case perhaps a less complex or even software driven coherence could be employed instead. If all cache-to-cache transfers are only needed when passing an Erlang-style, Actor model message from a thread on one cache to a thread on another cache, then the software can tell the hardware the source and destination caches for the transfer. (Note this means a plurality of Actors per core running on top of M work stealing green threads per core that has N = # of hyperthreads in M:N threading). Remember per the prior post that cache-to-cache transfers are needed so as to avoid (and if not read-only or was newly created, then also avoid the write-back to and) the load from main memory which is expensive both in terms of latency and power consumption.

The Rigel design offers some rough estimate of the improvement in computation density (e.g. FLOPS/mm² of silicon) by removing the:

  • order-of-order speculative execution
  • deep pipelines
  • complex cache coherence
  • cache-to-cache transfers
  • SSE vectorized SIMD
  • virtual paging, virtualization, etc (that a GPU doesn’t have either)

It was modeled on a 45 nm process. It has 1024 RISC cores (with two-wide issue superscalar ILP), 15 MB total SRAM cache, operating at 1.2 Ghz with a 100W TDP. Compared to the Intel i7 Lynnfield on a 45 nm process which achieves only 4 CISC cores with 8 MB total SRAM cache operating at 2.8 Ghz with a 95W TDP.

Ignoring latency differences and considering only throughput density1, I doubt the Lynnfield in 2009 had the 180 instructions in flight simultaneously that the current modern Intel processors has. For one thing, Lynnfield didn’t even have hyperthreading. Let’s presume maybe it had 40 in flight and the Rigel only 2. So unless 1024 ÷ (40 ÷ 2) ÷ 4 = 13 RISC instructions are required to match the work achieved by every CISC instruction, then the Rigel may have a higher compute density than the Lynnfield. Also the in flight instructions on the Intel also include some wasted computation that will be discarded by the speculation.

The Rigel paper discusses some of the other variants2 of parallelism throughput density designs which have been attempted.

One of the key distinctions1 is between the GPU vectorized (aka SIMD) model which is able to mask main memory latency with a massive number of cheap context-switch threads (affording an order-of-magnitude higher memory bandwidth by trading off increased latency) and the non-vectorized (aka MIMD) general purpose programming model which prioritize serialized low-latency and serialized performance at the expense of lower computation density and higher power consumption per work completed (i.e. lower overall efficiency). The latter has an order-of-magnitude lower main memory bandwidth with lower but still high latency (c.f. also) requiring large L3 cache to bring average latency down sufficiently for the serialized performance. As explained in the Rigel paper, the former is not suitable as a fully generalized programming model .

The following is an idea for a variant of a Rigel-esque design (yet with software driven cache-to-cache transfers and cache coherence) that replaces the L3 (and L4) cache with a solution that has comparable latency by masking most of the latency of main memory reads. Lightweight green threads are a suitable paradigm for masking macro-scale latency1, but they aren’t efficiently preemptive enough to generally mask main memory latency. But if main memory latency will only significantly occur (i.e. except for cache spill over eviction for non-shared objects, which should be minimized by the optimizing programmer) on accessing reference counted shared objects (i.e. shared between caches) which probably mainly comprise reading those reference counted shared objects which are cache-to-cache transfers (perhaps with software driven cache coherency driven by Erlang-style, Actor model message passing). The compiler is enabled to insert a green thread check-point for preemption by the memory controller (which ameliorates the lack of efficiency). Contrary to the incorrect association of green thread context switches to kernel thread/scheduler context switches, green thread context switches are low latency. At a function boundary, they’re only slightly more than the ~30 cycles latency of a jmp instruction because only the SP register has to be saved+restored (given that the other registers and flags are invalidated at the boundary. But in this case the preemption would not be at a function call boundary, so all the registers and flags have to stored and restored, which may double the latency. The memory latency is ~100ns (on CPUs and ~400ns on the CPU) which on a 2.6 Ghz CPU is 260 cycles. But this really needs hardware support that the software can query, so in case the main memory being accessed is already cached (which the software wouldn’t know) then the unnecessary context switch would be avoided. The extra latency for storing and restoring the other registers and flags could sometimes be eliminated by prefetching to the cache at the start of the function containing the the main memory accesses. No registers nor flags are yet in use at the start of a function.

Hardware-based cache coherency (which is required without this software directed coherency we’re contemplating) obfuscates the true costs of the hardware form of coherency.

1 Sufficient threaded parallelism can mask latency so that hardware is not idled. The GPU achieves this with a huge register file (RF) and putting 32 threads in a contiguously context switched warp, so context switching has low overhead (and also reduces the control circuitry needed per execution unit). The modern i7 hyperthreading employs dedicated registers for the hyperthreads which share execution pipeline ports and run simultaneously. Thus hyperthreads although masking some latency are also intertwined with a focus on serial performance. Green thread work stealing which can run on modern CPUs without hardware support employs compiler generated checkpointed pre-emption (not the same as full pre-emption) which could probably be made more efficient with dedicated hardware support. Green thread context-switches are reasonably efficient compared to OS threads, and especially where context switched at function boundaries, but only suitable for masking software macro-scale latency blocked operations, which does not include low-latency main memory accesses. They may be able to also mask efficiently the latency of waiting on a mutex lock to be freed without hardware assistance.

The GPU trades off latency in every facet for increased parallelism by eliminating synchronization assumptions for caches and global memory, as well as for example reducing the power consumption (also) by designing a much higher 24 cycle write latency (archived) (c.f. also) on the register file (RF) than the registers for each core on a GPU. The RF (and shared memory) is shared by all threads in the block so a thread context switch is very low overhead because unlike the CPU no registers needed to be saved.

The GPU’s shared memory also has a much higher latency than L1 cache on the CPU and doesn’t serve the same function as a CPU cache. Also the shared memory is banked (where each 16 or 32 successive words are successive banks) so that if threads in the same block read a different location from the same bank on the same clock cycle then they will run serially instead of in parallel. All 32 threads of a warp must be unblocked in order for the warp to execute, but even if they are unblocked they aren’t guaranteed to execute uninterrupted. Note, given 32 threads attempting to uniformly distributed randomly read from 32 banked shared memory, probability math informs us that 63% of the threads would run in parallel every cycle. Thus in such a divergent, non-vectorized access pattern would require four (4) times 24 cycles latency for each step forward of the warp (thread block) that requires all 32 threads to access the shared memory, because (0.37 ^ 4) × 32 < 1 so that less than one (1) thread is blocked for each such step (since all threads in the warp must be unblocked in order for the warp to execute). This a reason the GPU is not suitable for general purpose programming.

So on the GPU need to mask latency by either running many more than 8 threads per SM (i.e. “More concurrent threads”) and/or serially queuing up in the RF many copies of each step of an algorithm, before processing the next step for each of those copies (i.e. “More independent instructions (ILP) within a thread”).

There’s no cache coherency on the GPU. Even the GPU which has L1 and L2 cache of main memory (actually L1 may optionally be configured only for caching local memory which is what the RF spills over into), requires a threadfence() for global coherency of writes to main memory. Even communication between warps requires a shuffle paradigm.

2 @keean had mentioned to me the Xeon Phi as being an example of multi-core that failed to be as performant per watt as a GPU and failed to match the modern Intel CPUs on serial performance. But that was based on Larrabee which was originally intended to compete with a GPU yet also remain more generally programmable. The Xeon Phi has hardware coherent cache and out-of-order execution! So it’s not surprising that it hasn’t really solved any major need. It isn’t optimally designed for maximum general purpose programming parallelism (c.f. the Actor proposal in the next post) nor for vector, SIMD programming. And of course isn’t optimal for serial algorithms either. Thus it excels at nothing.

Actors as the paradigm for parallelism

(note having very bad whole body inflammatory attack today with a severe headache, forehead on the keyboard fatigue, mental confusion, so please excuse any lapses in my elocution and logic below)

@keean and I had discussed in 2016 Actors for example the Concurrency issues thread #17.

@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)1.

Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.

@keean pointed out why not just make every UTXO record an Actor. I initially thought it would increase the overhead, but as we discussed the details we both ended up realizing there would be no overhead. I contributed the idea that even a hashmap could be modeled with Actors by making each bucket an Actor which contains for example a linked-list of values. So after deriving the bucket address from the key, the message could be sent to the Actor at that address to check for the value in the bucket or add/remove a value to/from the bucket. For example, if the pointer to the linked-list for each bucket is stored in an array and every Actor for those buckets employs the same code for processing an incoming message, then there’s no extra per-bucket storage needed for the Actor. Simply call the Actor’s function with input arguments consisting of the message pointer and the aforementioned bucket address. The objects pointed to for the message and the linked-list are of course immutable. The Actor modifies the linked-list by creating a new copy with the required modifications and then changes the pointer at the bucket address of aforementioned array.

Note when messages will cause the Actor to have side-effects, then the message must be guarded by a mutex lock to prevent Actor function reentrancy. Thus each Actor really should have pure and impure variant of its function. The impure variant blocks on the mutex lock. These mutex locks can be 1 bit for each Actor when the Actors are in an array (because these adjacent Actors allow for a bit field for these mutex lock bits). But note this mutex lock would be required anyway for mutating any shared memory data with multiple threads, so the mutex bit isn’t additional overhead due to Actors.

@keean pointed out that there’s no need to queue the messages and instead employ the mutex lock for blocking. He also pointed out that no need to store a stack pointer (SP) for each Actor because there’s no context-switching going on at the Actor abstraction and each Actor should return to the top-level of its stack when returning from its message processing function. The code sending the message (which may be inside of another Actor’s message processing function) calls the Actor’s function and thus there’s no context-switch required.

In order to simulate a message queue (so that the calling Actor can complete instead of blocking), have the called (impure) Actor queue it and return. The called Actor must have registered itself to be called periodically by a thread pool (or registered as callback such as of another Actor) to process its queue.

The other key insight I had is that these Actors can run atop of green threads. So for example when blocked on the message mutex lock, the green thread can context-switch. @keean initially had the thought that green threads don’t preempt, but I explained (c.f. footnote 1 in my prior post) they can be made to software “preempt” (aka cooperatively preempt) at key check-points where they can block and Actor message passing calls would be one of those check-points.

If the called Actor is only executable by threads running on another core (or group of cores if more than one core per software-driven-coherency cache), then green threading can context-switch from the calling Actor (to another Actor on the same core / cache group). Except if the return type of the called Actor is () (aka in C is void) (presuming that failed delivery of sent message is an exception although note that message delivery is not assured in the Actor model yet that may not prevent an exception on failed or timeout yet the timeout will not be definitive in that it could have been delivered) then the calling Actor can continue immediately (presuming delivery is guaranteed) while the sent message is queued by the green threading on the called Actor’s core / cache group. Actors shouldn’t presume any semantics based only on the timing of the return of calling (i.e. sending a message to) another Actor because Actors are supposed to have autonomous state machines so that they can tolerate complex dominoes failure due to indeterminacy over the timing of delivery or failed delivery of message.

Note although we’re describing the model of how Actors work locally on the same CPU, the message passing Actor model also scales across the bus and the network. For example, the UTXO records could be split up across servers in a cluster.

1 There’s no panacea paradigm which will prevent all cases of dead and live locks, i.e. the solution to the dining philosophers problem requires omniscience. Neither the Actor model, immutability, nor Rust’s exclusive mutable borrowing will eliminate all opportunities for dead or live locks. For example, the issuer of a message could be waiting on a response from the recipient of the message in order to complete some operation, yet the recipient of the message could (even if transitively) end up waiting on the issuer before it can complete its operation and reply to the issuer.


In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis2 for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counted object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

2 Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.


Someone commented and I replied:

that all sounds very Erlang-ish, but I guess they are missing your type system constrains

Erlang has process-level context switches, which is very preemptive (although apparently also cooperative) but very heavy weight overhead. Thus, can’t eliminate the L3 cache for scaling multi-core as I proposed in the prior post. Also AFAIK Erlang doesn’t guarantee the queue-less message passing that @keean and I devised which is really critical to lowering the overhead of Actors to zero as I explained and also critical for implementing software cache coherency as I explained in the prior post. Also Erlang does not have the stack allocation I proposed and instead has automated generational GC which I have pointed out is incompatible with multi-core scaling. Yes Erlang and Actors are the inspiration, but we have proposed significant innovations and yes also the Erlang PL is lacking our ideas for a type system and other features we’d like to have for a complete PL such as modules. Also I think I with Zer0 will likely first target Go as the compile target and VM. Although in this case the reference counted objects (as declared in the code) would actually be garbage collected by Go’s GC on the Go compile-target. The important point is that Zer0 can be compiled in the future to more performant code and will support a multi-core scalable future.

Note by targeting Go initially and cooperative preemption (instead of Erlang BEAM VM’s process-level preemption which is apparently also partially cooperative), we could as compared to Erlang (c.f. also) better integrate SIMD and other assembly-level optimizations without corrupting the runtime as NIFs can do in Erlang. We’d still need to compile cooperative preemption into the SIMD code (c.f. footnote 1 in my prior post).

The Actor-like design contemplated in this post would rectify the following complaints about Erlang:

One thing Erlang isn't really good at: processing big blocks of data.

He means things like decoding mpeg data. There is too much numerical computation which Erlang is not optimized for. If the processing just involves shuffling big blocks of data from one place to another, then Erlang is quite good at it. (Files to TPC Sockets, etc)

You can't update shared blocks of data (there are no pointers in Erlang) and hence data must be shuttled across processes which in turn translate to inefficiencies.

In response the Quora question What programming languages do software engineers use in 2018?, this answer is more evidence to think that the proposed Actor model in this thread is the future:

When Salesforce bought Mulesoft for $6.5 billion, every programmer should have sat up and taken notice. This event adds to other recent API-related developments, especially from IBM, that clearly signal that the new programming paradigm is based on APIs and microservices.

Websites are transitioning from standalone collections of web pages to dynamic microservices hosted in the cloud. This simply means that what used to be a website is now a collection of independent programs running in the cloud. These independent programs “talk” with each other and with other services in the cloud via APIs.

For example, the veterinarian where I take my dog has a traditional static website with all the usual information about hours, personnel, specializations, etc. There are links to other veterinary websites where I can create an account and buy dog food, medications, toys, etc.

Now, with the proliferation of APIs, this architecture is transformed: the formerly static website interfaces directly with other websites using APIs, so the partitioning among websites becomes transparent. Soon I will go to my veterinarian’s “website” and all services and products will be available in one place using a single domain name and a single customer account. There will no longer be hyperlinking among individual, standalone websites (at least insofar as the visitor can tell). This integration of the WWW is happening really, really fast.

This might interests you.

@sighoya ty. Very succinct blog. Good choice. Yeah I was thinking what @keean and I were contemplating is sort of a mix of CSP and Actors, because we are thinking to block on sending and guarantee message delivery. Yet we were also talking about addresses and not channels to make it more efficient.

That is an excellent point that blog reminded me that one of the key aspects of Actors (which I had also mentioned recently and in the past to @keean) is that they don’t guarantee message delivery nor order of delivery. So the Actor model forces to design for network failures.

So we’ll need traditional Actors as well as our proposed optimized form for optimizing local resources.

I wrote:

In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis2 for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack of the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller.

2 Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.

@keean brought to my attention Ada’s new research on static memory management which he claims doesn’t require “annotating all the ‘lifetimes’” as Rust requires.

This new research is layered on top of their SPARK language. SPARK is a strict subset of Ada that was designed to have unambiguous semantics and that aimed at formal verification from the start.

However, consider the limitations of SPARK:

SPARK removes some features of the Ada language such as recursion, dynamic memory allocation, access types9 [i.e. pointers], dynamic dispatching and generics. It also imposes certain restrictions on the constructs of the language (e.g. array dimensions are always defined by previously declared type(s) or subtype(s)10).

On top of the language restrictions, it adds annotations that allow for the specification of data-flow, creation of abstract functions and abstract datatypes and to the use of structured assertions (loop invariants are available but package11 invariants are not). There is also a clear distinction between procedures (which may have side-effects) and functions (which are pure/free of side-effects
and also have to return a value).

So SPARK has no generics. Rust’s lifetime parameters would be significantly less often annotated if Rust didn’t allow type parameters.

This new research adds pointers to SPARK by enforcing strict control of aliasing.

But Ada’s access pointers are (as I discussed in the past) already severely restricted such that they’re probably as or more cumbersome than Rust’s lifetimes. The research paper also mentioned this:

As part of its strong type checking, Ada also prevents dangling references to objects on the stack or the heap, by providing automatic compile-time checking of “accessibility” levels, which reflect the lifetimes of stack and heap objects.

Conversions between pointer types are restricted to ensure pointers never outlive the objects they designate. Values of a pointer type are by default initialized to null to prevent use of uninitialized pointers, and run-time checks verify that a null pointer is never dereferenced.

Since Ada doesn’t accurately track lifetimes, it can’t support destructors:

Standard Ada supports safe use of pointers (“access types” in Ada) via strong type checking, but safety is guaranteed only for programs where there is no explicit deallocation of pointed-to objects – explicit deallocation is considered “unchecked” programming in Ada, meaning that the programmer is responsible for ensuring that the deallocation is not performed prematurely. Ada can provide automatic reclamation of the entire memory pool associated with a particular pointer type when the pointer type goes out of scope, but it does not automatically reclaim storage prior to that point. It is possible for a user to implement abstract data types that do some amount of automatic deallocation at the object level, but this requires additional programming, and typically has certain limitations.

So although this Ada proposal reduces some lifetime annotations, it still has the same as Rust inability to allow all safe cases2, and has additional limitations which Rust doesn’t have.

And worse the Ada proposal doesn’t even allow pointers to objects on the stack!

AFAICT, Ada is really an incredibly inflexible “What Color is Your Function” clusterfuck.

I much prefer my idea I cited at the start of this post which attains static memory safety with 100% flexibility, 100% fully feature, with 0% tsuris, and doesn’t even require a formal verification! How about that for a paradigm shift! And if we still need Rust’s lifetime analysis in some highly critical code paths, then let’s consider my idea to make the annotations less noisy. But frankly if we’re going to add optional lifetimes for functions that we want to be extra performant (so we can prove non-aliasing to the compiler), I think we should devise strong proving which can prove cases that are safe which Rust can’t2. However, in that case I really think we should just trust the programmer to write those critical performance paths correctly and label as non-aliased (e.g. the restrict keyword in C) those which the programmer is aren’t aliased (although generally proving non-aliasing at compile-time is a total order although there are scenarios where it can be determined locally, e.g. locally allocated distinct objects), unless we need formal verification for some sort of absolute guarantee of correctness.

EDIT: I’m not claiming the Ada research has no utility. The STARK subset provides for formal verification. So within that STARK language, the additional feature of pointers to objects on a the heap is an improvement. I’m just arguing that is not the ideal paradigm for a general purpose PL wherein productivity and ease-of-programming is a higher priority than formal verification.

I wrote:

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

The only reasons I can think of for why automatically reference counted (ARC) objects would be employed inside Actor code:

  1. For references that aren’t persisted after the Actor returns. These are objects which are either sent in a message call to another Actor and/or which have destructors (i.e. that execute if the reference count is zero).

  2. For references that persist after the Actor returns. These are either as persistent state owned by the Actor or objects returned to the caller of the Actor.

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns. The compiler will enforce the invariant by checking that objects with this annotation can only be assigned to references which are not re-assignable (aka const in JavaScript but not in C++) including those returned from functions that return this annotated type. And the compiler will check that every such annotated reference which was not an input argument must be returned by that function. If the function is at the top-level of the stack then no such checks are required because the reference can’t go out of scope and decrement the reference count until the Actor returns (thus such annotated reference may also be stored in persistent Actor data).

Such annotated types can then be freely assigned to non-ARC references. This mostly ameliorates the bifurcation (analogous to “What Color is Your Function?”) problem from the two kinds of references due to segregated GC schemes for shared and non-shared objects (that has been proposed in this thread).


EDIT: Unlike Pony, our non-ARC objects can’t be shared. They must be copied to an (said annotated) ARC object in order be shared with another Actor (i.e. sent in a message). We could also adapt some of Pony’s principles on sharing objects. In that Pony allows iso and trn objects which can be locally mutated before shared between Actors/threads. That may break the software cache coherency I invented up-thread for massively scaling multicore. So we probably don’t want to copy that flexibility in Pony which actually makes it more abstruse. See the chart and discussion I wrote in other thread.

Data race protection (which should not be conflated with protection against all forms of semantic dead and live lock faults) in both Pony and my idea is due to the fact that the Actor is not reentrant (i.e. runs only single-threaded) and each Actor only runs only one thread at a time and never can more than one Actor can write to same shared object (not just not simultaneously, rather not ever except for the exclusive local writes of iso and trn to non-shared objects before sharing them). Pony also has exception safety but in a non-ideal purist form which Zer0 will improve.

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns.

Why creating an extra type for it?
A type says only something about how its values are stored.

Rather I would use keywords just like const, mutable to restrict access of some pointers.

@shelby3

I'm wondering why you need to lock shared memory at all with actors.

If different (mutating) actors wan't to mutate a share state, why not wrapping the shared state in a (main) actor itself. Every mutating actor have to send a request for mutation of some address with a timestamp over a specific channel which cannot be used by the other mutating actors.
The main actor tries to process cyclically the input channels for all mutating actor requests. If more than one request is available for the same address of the shared state, e.g. the same index of an array, then the main actor decides by time stamp which applies at first, if two timestamp are equal then some configurable policy can apply.

@sighoya, they (e.g. re-assignability, mutability, and my proposal herein) are annotations (i.e. attributes) on a type. That makes them a distinct type— i.e. otherwise compatible types with incompatible attributes are thus incompatible types. IOW, these attributes have to be factored into unification of types.

I'm wondering why you need to lock shared memory at all with actors.

We don’t and haven’t proposed needing to do so. Apparently something I wrote up-thread confused you. Please refresh this page, as I had made some extensive clarifying edits to some of the up-thread posts. I suppose you’re referring to the following quoted text from the Actors as the paradigm for parallelism up-thread post?

@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)1.

Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.

@keean pointed out why not just make every UTXO record an Actor […]

The above what was pointing that I thought I needed shared memory but my private discussion with @keean had revealed that Actors subsume the need for shared memory. I explained the reasons why …

Note that the proposed Actor model I’m describing in this thread will work on top of either shared global memory or message passing. I have described how to also implement it on a hardware that employs cooperation with the software for cache-to-cache transfers and to mask latency that we would otherwise need L3 (and L4) cache for. Note it can also run on top of Go, but less efficiently than if we compile directly to the hardware (or perhaps LLVM). But Go as an interim compile target would be expedient and also gain us some momentum as well the GopherJS JavaScript (and WASM) comes for free with Go.

Some discussion of this Actor proposal in the context of the Mutable Objects issues thread #32.

Also discussed in the Iteration versus enumeration issues thread #16.

The Actor model is perhaps the antidote to dependencies on side-effects order? (c.f. the discussion of purity at the linked comment)

@keean tried to implicitly claim credit for the idea I thought of in this thread.

Rich Hickey, Clojure’s designer in his explanation of his reasons for not employing the Actor model for same-process state management in Clojure, seems to have not realized that the same code could be recompiled separately for the same-process and distributed cases without incurring performance penalties to the former when the Actor message send is a pure function call as I explained.

I do agree with him that when modeling for potentially high latency delays and timeouts (i.e. failure) resiliency needed in the distributed network case, the same-process shared memory would lose some flexibility and/or incur some (semantic and/or performance) overhead. Yet in some cases this could perhaps be parametrised.

@keean wrote in the other thread:

You clearly were the first person to explicitly mention the memory management implications of the actor approach here in these issue comments. I didn't really have time to do a detailed analysis at the time which is why I didn't directly reply.

Maybe I don't fully understand what you are proposing, but I think that the lifetime of mutables being tied to that of actors is kind of obvious,

Maybe obvious in hindsight but has anyone proposed replacing Rust’s lifetimes debacle with such an effortless paradigm as what I have proposed?

It’s not enough to just have a stack. My insight requires the compiler’s escape analysis which moves the escaped objects to the bump pointer heap (analogous to a young generation GC but without the write-barrier and mark-and-sweep overhead). The bump pointer heap in my insight is cleaned up (i.e. recover all the free space) with one assembly instruction instead of some complex mark-and-sweep pause overhead as in typical AMM GC collectors.

So my insight1 is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I think the reason it was more obvious to me is because of the extensive study I did recently on the types of and designs for garbage collection.

however immutable data, if it can still contain pointers, will mean that there will still be data on the local thread stack that can get shared (although we could force a deep copy, this would be slow, and may not terminate).

Indeed. I explained the solution for that in my up-thread reply to @sighoya.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

Also actors will replace objects, and therefore the management of actor lifetimes would become the same problem as the management of object lifetimes in other languages. In the actor model you can usually pass references to actors, which is kind of like passing mutable data.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

1 Essentially in the abstract, it is another application of my concept of “chunking” in an entirely different context of what “chunking” means compared to my use of that abstract concept in the Subtyping issues thread #8. Herein the “chunking” is one bump pointer heap for each actor. We could say that “chunking” in the abstract means divide-and-conquer or divide-and-isolate.

keean commented

@shelby3

So my insight1 is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I would not say that this simple thing is your idea, to me your idea is the combination of bump-pointer-heap, and RC, when you want to move objects to the heap etc. Your idea is the synthesised whole mechanism, not the individual components which may have been talked about elsewhere. If you deconstruct any idea enough it is usually composed of existing ideas.

RAII in C++ uses the fact that object/actor property lifetime is tied to that of the object/actor so that when an object is destroyed all its properties are destroyed. Of course being C++ there is no guarantee that this will happen, if any of the properties are objects they will of course have to have a properly working destructor.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

I would point out that by making values first-class you can prevent this from happening without solving the halting problem. We simply make "owning pointers" have to be a DAG. This limitation would hold for all values, and would be implicit if values had referential transparency.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

In the actor model there are no closures, and you cannot pass closures in messages, (because we cannot serialise functions, and there are no references to functions or closures). Instead higher order programming is achieved by passing actor references. If we cannot pass actor references then we cannot write higher order programs directly, although we can use the "typeclass trick" I posted about where you send a datatype, and that datatype selects the implementation via a typeclass.

The problem I currently have is that typeclasses represent a global abstract algebra, something we both have said was desirable in the past. Actors are the opposite of this, and I am not sure the two conceptually work well together. With typeclasses we want both objects and types to be in typeclasses, and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

Coming back to some of my other points, Actors are a kind of container, they are not 'Values'. To make actor interfaces work everywhere, everything would need to be an actor. That can make sense for mutable objects (things that have a location) like "{3}" could be an actor. But it does not make sense for values like "3". This effectively means having two parallel "interface" systems, something like type-classes for values to allow operator overloading and an algebra on values, and a separate interface system for actors which could also be type-classes (with effects?).

@keean wrote:

So my insight1 is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I would not say that this simple thing is your idea, to me your idea is the combination of bump-pointer-heap, and RC, when you want to move objects to the heap etc.

My proposed bump-pointer-heap is quite different than the use of a bump-pointer-heap for young generation objects in a mark-and-sweep GC. In that there’s a distinct bump-pointer-heap for each Actor and it’s completely deallocated when the Actor function/procedure completes after processing each message. Thus, there’s no write-barrier nor mark-and-sweep compacting overhead whatsoever.

The ARC (automatic reference counting) is for the globally shared objects. So it’s not connected in any way to the bump-pointer-heap.

Your idea is the synthesised whole mechanism, not the individual components which may have been talked about elsewhere. If you deconstruct any idea enough it is usually composed of existing ideas.

True, except there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme.

RAII in C++ uses the fact that object/actor property lifetime is tied to that of the object/actor so that when an object is destroyed all its properties are destroyed.

That is per object and per stack frame. Thus it does not provide the holistic freedom of cross-referencing across stack frames and between all types of objects and references. RAII bifurcates the program into non-RAII and RAII.

I agree that if you squint your eyes and tilt your head sideways, you might see some conceptual resemblance to my idea. But I could see some conceptual resemblance between a rabbit and an elephant. They both have large ears.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

I would point out that by making values first-class you can prevent this from happening without solving the halting problem. We simply make "owning pointers" have to be a DAG. This limitation would hold for all values, and would be implicit if values had referential transparency.

In general we can’t insure that DAG invariant.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

In the actor model there are no closures, and you cannot pass closures in messages, (because we cannot serialise functions, and there are no references to functions or closures).

That is a restriction only on the message communication between actors. We can have closures instead of the actor’s function/procedure.

Instead higher order programming is achieved by passing actor references.

That is one way. But it is also still allowed to have any programming paradigms we want inside the actor’s function/procedure. Remember my model up-thread is not of millions of actor nodes with no cache and no shared memory. Rather my model for scaling multi-core described up-thread is groups of cores sharing a L1 cache. So we don’t need to entirely code everything in actors.

If we cannot pass actor references then we cannot write higher order programs directly,

Disagree for the reasons stated above. However, I do agree your point might apply when modeling data structures with actors.

I don’t understand your point though. Why can’t we pass around actor references in messages?

although we can use the "typeclass trick" I posted about where you send a datatype, and that datatype selects the implementation via a typeclass.

Where did you describe that trick? I am not familiar with what you are describing. I don’t understand.

The problem I currently have is that typeclasses represent a global abstract algebra, something we both have said was desirable in the past. Actors are the opposite of this, and I am not sure the two conceptually work well together. With typeclasses we want both objects and types to be in typeclasses,

I see no problem. Typeclasses with work fine within actor functions/procedures.

and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

I don’t want to control effects with your idea of typeclasses. I want to eliminate effects by making actors function independent of ordering of messages. For other types of effects, we need some model that works at the actor level.

Coming back to some of my other points, Actors are a kind of container, they are not 'Values'. To make actor interfaces work everywhere, everything would need to be an actor.

I don’t want this level of pervasive integration of Actors. I think of Actors as an API, not a first-class language entity.

I wrote:

and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

I don’t want to control effects with your idea of typeclasses. I want to eliminate effects by making actors function independent of ordering of messages. For other types of effects, we need some model that works at the actor level.

By effects we mean management of a state machine.

What I am saying is that the state machines of Actors should be independent from each other. They should function correctly regardless of the state machines outside.

So this means abandoning determinism and accepting statistical progress.

Determinism is the wrong assumption. We can’t move forward with computing until we abandon it.

In fact, my decentralization solution to the centralization problem of blockchains which nobody else has solved, is basically at the generative essence about abandoning determinism.

So I think the entire concept of algebraic effects, purity, and monads is far too low-level to deal with the problem they are actually purporting to deal with.

keean commented

@shelby3 state us juts one kind of effect. There are others like whether code allocates memory, whether it is interruptible, whether it does IO. The point with effects is to control which are permitted. An example might be where we want to guarantee the machine cannot run out of memory in the middle of a process, we allocate all of the working space in advance, and then require no memory allocation during the critical process itself. Even within IO we might break down the effects to local IO and network IO, so that we might restrict code from accessing the network.

You can think of these a bit like the security permissions on iOS and Android, we can see from the top level program which permissions are needed, so that for example we might not want to install a password storage program that contacts the network, and conversely we can impose restrictions on the code we call.

The alternative is a capability system. With effects the authority is ambient - that is everything is permitted unless it is specifically disallowed. By default you can do IO etc. In a capability system you cannot do anything impure at all without a 'capability' to do it. Capabilities are passed downward in code, so when your program starts it might be passed the capabilities to allocate memory and read a particular file. The objects would need to be passed downward as arguments in your program to where they are needed.

So in a way capabilities are the analogue of modules and effects the analogue of type-classes.

@keean

Can't we implement effect orthogonal to the type system? I mean, if you want that for security permissions only, why not append properties to functions stating their effects which are checked in the requires section of other functions which take the said function.

I don't see any reason to pass effects as implicit argument, this makes only sense if you overwrite effects, but for what gain? The ability to override effects for a function destroy the meaning of the said function.

@keean wrote:

State us just one kind of effect. There are others like whether code allocates memory, whether it is interruptible, whether it does IO. The point with effects is to control which are permitted. An example might be where we want to guarantee the machine cannot run out of memory in the middle of a process, we allocate all of the working space in advance, and then require no memory allocation during the critical process itself.

I think those effects will become more and more useless anyway.

We are headed into a massively networked and parallelized, distributed execution model. So there will be very little that can be guaranteed. Programs instead need to be written to be resilient and non-deterministic with statistical correctness instead.

Besides, I am not creating a specialized programming language for a deterministic embedded system. I am creating a general purpose programming language for the wave of the future.

I have suddenly had the realization that this effects stuff is really much drama about the mostly irrelevant.

You can think of these a bit like the security permissions on iOS and Android, we can see from the top level program which permissions are needed, so that for example we might not want to install a password storage program that contacts the network, and conversely we can impose restrictions on the code we call.

AFAICT, security permissions are a different issue entirely.

keean commented

@shelby3

AFAICT, security permissions are a different issue entirely.

I don't think so. Access to a file is equivalent to permission to access that file. Most programming languages are full of 'ambient authority' that is the state hidden elsewhere determines if you can do X or Y. These are precisely the kind of effects you can model with an effect system.

Like I said capabilities and effects are two ways to do the same thing:

With effects:

f() requires FileRead("/usr/Bilbo")
   x := open("/usr/Bilbo")
   print(x)

With capabilities

f(y : FileReadPermission("/usr/Bilbo"))
   x = open(y, "/usr/Bilbo")
   print(x)

Note with capabilities we always need to pass down the permissions, all the way from the top, main would be passed the total set of permissions available to the application. I don't think this needs any language features.

With effects we can infer the effects for a function from the function definition, the same we we can infer typeclass requirements.

I think with mobile devices, and security, these kind of permissions are going to become more important, and effects, like type-classes help keep function argument lists simple.

@keean wrote:

I don't think so. Access to a file is equivalent to permission to access that file.

You entirely missed my point. Which is that effects will end up being indeterminate and non-deterministic and thus they aren’t something that is expressible with a permission flag because they can’t be captured by any determinism as we move into this brave new world.

In contrast, security permissions are inherently deterministic by definition.

keean commented

@shelby3

In contrast, security permissions are inherently deterministic by definition.

And so they should be. There is no grey area if you want to protect privileleged information. If I do not grant an app permission to access my address book, then it should not be able to under any circumstances, I certainly don't want it non deterministic, where it might give access if the application is lucky.

A summary in another thread, of the historical development of the “actor epiphany” that was introduced in this thread.

I wrote up-thread:

EDIT: Unlike Pony, our non-ARC objects can’t be shared. They must be copied to an (said annotated) ARC object in order be shared with another Actor (i.e. sent in a message). We could also adapt some of Pony’s principles on sharing objects. In that Pony allows iso and trn objects which can be locally mutated before shared between Actors/threads. That may break the software cache coherency I invented up-thread for massively scaling multicore. So we probably don’t want to copy that flexibility in Pony which actually makes it more abstruse. See the chart and discussion I wrote in other thread.

Data race protection (which should not be conflated with protection against all forms of semantic dead and live lock faults) in both Pony and my idea is due to the fact that the Actor is not reentrant (i.e. runs only single-threaded) and each Actor only runs only one thread at a time and never can more than one Actor can write to same shared object (not just not simultaneously, rather not ever except for the exclusive local writes of iso and trn to non-shared objects before sharing them). Pony also has exception safety but in a non-ideal purist form which Zer0 will improve.

Zer0 shall adopt Pony’s paradigm for our ARC references. I summarized this in the WD40 issues thread #35:

Whereas, if i am incorrect and can add the iso and trn for Zer0’s ARC objects, then for those ARC objects Zer0 would be similarly abstruse as Pony and less performant (because ARC is less performant than tracing), but for the non-ARC objects Zer0 be less abstruse and more performant. Presuming most of the computation occurs in Actor function lifetimes with non-ARC objects, then Zer0 will be less abstruse and more performant.

Note Pony adds iso, trn and tag pointer types to the ones we had contemplated so far.

Pony Zer0 Access Shared Aliasing Non-Shared Aliasing
iso
Isolated
Exclusive read-write
(writable)
None None
val
Value
immutable Immutable
(non-writable)
Read-only Read-only
ref
Reference
read-write Read-write
(writable)
None Read-write
box
Box
read-only Read-only
(non-writable)
None

Read-only
Read-write

Read-only
trn
Transition
Exclusive write-only
(writable)
None Read-only
write-only Write-only
(writable)
None Read-write
tag
Tag
Address-only
Opaque pointer
Address-only Address-only

The iso and trn can be moved (aka alias burying) to any other above type (except not trn to iso) because they’re exclusive ownership of writing. Also these types above can be employed within a recover block. Record fields also combine with the record pointer type to give a combined capability type. There’s also subtyping of capabilities, but remember in Zer0 all subsumption is via casting so as to make type inference decidable.

There’s no utility for Zer0 to have such exclusive writable types on non-ARC objects, because (the non-ARC don’t live beyond the Actor function lifetime thus) these can’t be shared without copying them to ARC (and copying of course can be to any type). Zer0 could adopt the iso and trn on ARC objects. Yet the exclusive reading of aspect of iso for immutable ARC objects could possibly prove useful to eliminate the need to trace those for cyclic references because only one pointer could exist.

Extending Pony’s Reference Types

I propose to rename and extend Pony’s reference capabilities for Zer0 as follows.

Pony Zer0 Access Shared Aliasing Non-Shared Aliasing
iso
Isolated
[r/w] Exclusive (read/write)
(writable)
None None
val
Value
val Immutable
(non-writable)
Read-only Read-only
ref
Reference
r/w Read/write
(writable)
None Read-write
[r]/w (Exclusive read)/write
(writable)
None Write-only
r/[w] Read/(exclusive write)
(writable)
None Read-only
[read] Exclusive read-only
(non-writable)
None Write-only
box
Box
read Read-only
(non-writable)
None

Read-only
Read/write1

Read-only
trn
Transition
[write] Exclusive write-only
(writable)
None Read-only
write Write-only
(writable)
None Read/write
tag
Tag
id Address-only
Opaque pointer
Address-only Address-only

Note exclusive (read and/or write) capability is useful for being consumed (in addition to being consumable for other reasons such as sending between concurrent threads) for supersumption. For example supersuming the union element type (i.e. to add constituent union members, e.g. from Int to Int | String) of a collection. It’s safe to write to the resultant supertype (e.g. Int | String) because no reference can read the consumed source subtype (e.g. Int) that would otherwise be violated by writes (e.g. of String) to the supertype.

The dual subsumption case (i.e. removing constituent union member(s), e.g. from Int | String to Int) is handled (without consumption) by assigning a type with (even non-exclusive) write capability to the subtype with the write capability intact but discarding for the subtype reference any read capability (which would otherwise enable reading from the source supertype the violating the subsumed invariant). There’s no need to consume the source of the assignment because no new types were added to the subtype (to be written to) which would otherwise violate reading from the source type.

Pony doesn’t allow sharing as box (i.e. read-only between actors) a mutable that can only be written by one ”actor”. Although it wouldn’t be data race unsafe, it would vacate the optimization of employing CPU registers for values of read-only capabilities because the single-threaded execution assumption would be vacated. Also it would be a debugging nightmare substitute for event handlers.

1 When a box alias was obtained from a trn, then the trn can be consumed as a val for sharing (c.f. also).

More discussion of the importance of the Actor paradigm-shift and what it really is.

@keean regarding our prior discussion about superscalar versus massively multi-core, the 32 core (64 logical thread) AMD Threadripper seems to only excel at massively parallelized tasks which are not memory (cache coherency) bound:

https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review/7
(c.f. the 3D Particle Movement v2.1: Brownian Motion section)

https://browser.geekbench.com/processor-benchmarks

https://www.quora.com/Is-Intels-8th-gen-processors-better-than-AMDs-Ryzen-processors

And note Intel will soon respond to the Threadripper challenge with their own such design.

Given I want a mini-ITX case for traveling (what’s the point of a laptop when I need my 24" Acer screen?), then my best choices are AMD Ryzen 7 2700X or Intel i7-8086K.

Single-core benchmark of the i7-8086K is 18% faster at same clock frequency (with lower TDP also). The i7-8086K can run at least 5.0 Ghz on a single-core (and some have been overclocked to 5.1 Ghz on all 6 cores (c.f. also)); whereas, the Ryzen 7 2700X can only reach 4.3 Ghz on a single core (and perhaps on all cores). Thus the i7-8086K has 38 – 48% better single-core (and probably also up to hexa-core) performance as overclocked (5.0 – 5.1 Ghz versus 4.1 – 4.3 Ghz) and 28% not overclocked (4.0 Ghz versus 3.7 Ghz) but we will rarely be running only a single-core so when not overclocked they’re likely to often not be in single-core Turbo mode.

Non-computationally bounded multi-core benchmark of the i7-8086K is 5% faster not overclocked (4.0 Ghz versus 3.7 Ghz) and 8 – 15% faster overclocked (5.0 – 5.1 Ghz versus 4.1 – 4.3 Ghz). Whereas for purely computationally bounded, massively multi-threaded the Ryzen 7 2700X is faster.

Not overclocked, the i7-8086K is 49% faster (i.e. 67% of execution delay) than my existing i7-4770 (not mini-ITX) on single-core and 111% multi-core (i.e. 47% of execution delay). Fully overclocked compared to my existing i7-4770 (not mini-ITX) is 86% faster on single-core and 120% multi-core (i.e. 54% and 45% of execution delay).

Given for example how slow the Scala compiler is and it being mostly bound to single-core performance, the above consideration is still significant cutting compilation delays in half compared to my existing i7-4770.

But AMD will improve and eventually the Actor programming paradigm will be more ubiquitous and so massively multicore with no hardware cache coherency required could change the landscape. But for now, Intel is still on top. AMD though is gaining a lot of revenue from GPUs (cryptocurrency mining has been a significant boost as well) which they can finance their continued improvement on CPUs, and presumably Intel is being squeezed on revenue by the competition from AMD. For the moment, Intel still owns the server market which is where the huge revenues continue to pour in. It will perhaps require that Actor paradigm shift and move away from hardware cache coherency to disrupt Intel’s lead in servers.

keean commented

@shelby3 Right now I would recommend AMD based on lower exposure to Spectre and Meltdown. I believe AMD are only vulnerable to about 5% of the vulnerabilities, and those are the harder ones to actually exploit. The AMD chips are cheaper, so you should calculate Performance/$, which because the AMD 2700X is about 33% cheaper than the i7-8700k, it would seem the AMD is the better choice for performance per dollar.

https://www.techadvisor.co.uk/feature/pc-components/intel-core-i7-8700k-vs-amd-ryzen-2700x-3679686/
https://wccftech.com/amd-ryzen-7-2700x-x470-review-out-beats-i7-8700k-in-7-10-game-tests/

I have had all Intel equipment since the core2 came out, before that the last time I bought AMD was the first generation 64bit CPUs when AMD really forced Intel into developing their own 64bit x86 (Intel wanted to keep x86 exclusively 32bit, and force everyone to Itanium for 64 bit, even though VLIW was a poor design choice in my opinion). The next CPU I buy will probably be a Threadripper 2990 when it comes out.

@keean wrote:

Right now I would recommend AMD based on lower exposure to Spectre and Meltdown. I believe AMD are only vulnerable to about 5% of the vulnerabilities, and those are the harder ones to actually exploit.

Yeah I thought about that, but my coding machine doesn’t really need to be secure. Most of the work I am doing will end up open-source anyway. If the NSA wants to get access to my computer, they’re going to get in anyway. For security, I should run another machine that has never been connected to the Internet and has no known vulnerability such as a Rasberry Pi.

Note I think for most consumers should consider a AMD for their general purpose computing, because they’re not likely to dedicate a compute just for security.

The AMD chips are cheaper, so you should calculate Performance/$, which because the AMD 2700X is about 33% cheaper than the i7-8700k, it would seem the AMD is the better choice for performance per dollar.

https://www.techadvisor.co.uk/feature/pc-components/intel-core-i7-8700k-vs-amd-ryzen-2700x-3679686/
https://wccftech.com/amd-ryzen-7-2700x-x470-review-out-beats-i7-8700k-in-7-10-game-tests/

Those links confirm what I wrote in my prior post. On purely computationally bound, massively multi-threaded tasks, the Ryzen and Threadripper defeat an Intel that has fewer cores.

But the Threadripper (especially the 32 core 2990WX) performs even worse than the Intel (and even the Ryzen 2700X!) on tasks that are bound by the I/O to the cores. So unless you’re doing content creation or until we start to optimize other software to mask I/O latency better with the sort of green M:N thread and Actor parallelism we contemplating, then the AMD will be slower (or just not that much faster) even with all those cores. I think the AM4 socket (e.g. Ryzen 7 2700X) is a better (more uniform performance) choice right now than the Threadripper unless you are narrowly focused on computational workstation tasks.

For me the $150 increase in cost for the i7-8086K in irrelevant compared to the time I would lose for example compiling Scala. If my (even incremental) builds end up being on the order of ~10 seconds and if I am doing an incremental build every 1 minute, then an i7-8086X overclocked to 5 Ghz will reduce my time loss from roughly 17% to 12% (assuming that all of the ~10 seconds is CPU bound and not SSD throughput bound, which is probably not the case). At 8 hours a day of effective coding time, that would be 24 minutes saved per day. Not only that but the loss of continuity of concentration with 10 second delays compared to 7 seconds.

Also if I wait a couple more months, then the i9-9900K will be released with 8 cores (and 16 logical threads) that will allegedly still achieve 5 Ghz up to two cores (c.f. also). That will essentially erase the computational multi-core advantage of the Ryzen 7. But that won’t increase my Scala compiler performance.

However, rumor is AMD will counter-strike in 2019 (probably before summer) with 12 core Ryzen for the AM4 socket (so not required to upgrade the motherboard). And perhaps a 16 core AM4 socket Ryzen in 2020. Perhaps even a 10 core and/or faster clock speed 2800X might appear soon such as later this year or Q1 2019. A 4.5 GHz turbo clock speed would narrow the gap to the i7-8086K to ~31% up to hexa-core.

But there’s something more profound that appears to be happening and it appears to be a decadence that is pervasive in the West. Intel appears to be loosing the lead in process technology (c.f. also). And one of the symptoms (if not cause) of that, is the SJW’s “bitches in tech” paradigmatic destruction of tech companies. Ousted former Intel President Renée J. James and CEO Brian Krzanich both doing that unconstitutional egalitarian Harvard-esque no-meritocracy allowed here insanity. Renée has no engineering experience nor engineering education.

Whereas, the CEO and President of AMD although a female, has an extensive electrical engineering background and she is Taiwanese likely facilitating the connections (economies-of-scale to displace Intel in capex) in Asia which is rising to defeat the West by 2032.

The writing is on the wall. AMD (and Asia) will increase the lead in process technology over Intel in 2019 and 2020. So the gap in clock speeds is going to close while the core counts will remain in the lead. AMD will begin to take the server market from Intel next year with the 7nm process technology.

So after all the more detailed insight, I think I agree with you. The hypothetical 24 minutes lost per day if I don’t buy an i7-8086K will evaporate or become insignificant in the bigger picture scheme.

keean commented

@shelby3 AMD also support ECC DRAM which I think is more important than most people think, and something I really miss on the consumer Intel chips, and why my other machine I bought second-hand Xeon chips and have a dual-xeon configuration.

And @keean realize the implication of this is a huge implicit demand for a mainstream PL with Actors is building. We are the right place, at the right juncture in history, with probably the correct design to meet the burgeoning hardware parallelism!

My initial thesis when I started discussing this parallelism issue with you has actually come to fruition, which is that we would abandon the more aggressive facets of ILP (some facets of the superscalar out-of-order execution) that is more risky for security and that Moore’s law would more significantly shift to increased core counts, thus forcing programmers to seek out paradigm shifts for programming to parallelism. Even I tried to argue for Intel here and have been defeated with the facts.

P.S. This is not the first time I have been correct and Eric S. Raymond has been incorrect (his thought that we are bound to serial computation by certain algorithms). I also have irrefutably disproven him and his sycophant blogging commentators (c.f. also) on the issue of 9/11 where he refused to believe (and banned me for being in his words, “bat shit crazy”) our own government and/or military perpetrated the mass-murder of its own citizens. The problem is that those hate Islam (Eric ostensibly hates Islam because he thinks it enslaves women) or otherwise don’t want to believe, thus do not study deeply the details which are irrefutable! Thus evidently a 4 SD IQ is not always going to be correct against a 2 SD IQ.

I wrote:

@Ichoran wrote:

[…]

However, I have yet to find a case where attention to memory is not radically better yet (e.g. the memory arena versions, which have ~8s CPU time and are parallelized).

I think we need memory regions for Zer0 (which can employ efficient bump-pointer allocation and a single region-wide deallocation), because it will be very slow to build for example an immutable binary tree for sharing with another Actor using ARC. Can’t use the efficient Actor-local bump-pointer heap of my “actor epiphany” because that will be only for non-shared.

@keean you said you prefer control instead of mark-and-sweep heuristics, so that’s what I’m contemplating to provide.

@keean remember I argued that we’d see an endless stream of speculation execution vulnerabilities. Here’s a new vulnerability:

https://www.zdnet.com/article/beyond-spectre-foreshadow-a-new-intel-security-problem/

keean commented

@shelby3 The problem is not speculative execution itself, the problem is that the engineers assumed because the results would be thrown away, they did not have to carry out all the security checks on the speculative branch.

I don't think AMD are vulnerable to these new 'Foreshadow' attacks.

jdh30 commented

@shelby3 "there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme"

Cheney semi-space?

@jdh30 my proposal is the objects which are shared between actors (thus can be longer lived) must be ARC. Thus the bump-pointer heap is only applicable to non-shared objects which presumably are shorter-lived or at least reasonably well correlated w.r.t. lifetimes even if longer lived. If either of those two hypotheses (the young generational or at least correlated generation) are violated, then we need alternatives for the programmer to work around the inefficiency. The entire (local) bump-pointer heap is reclaimed by resetting the bump-pointer at the termination of the actor function which processes an incoming message. Thus there’s no need for mark-and-sweep, because there’s no objects remaining in the heap because the collection cycle is upon termination of the processing of the incoming message.

I have also augmented my original proposal by acknowledging we probably also need the ability to instance region-based memory allocation (aka the Arena pattern) each of which are separately bump-pointer heap allocated. Thus a region is only ever freed all at once with no mark-and-sweep. This increases efficiency for use cases that are compatible with such a model.

I posit that these proposed models in addition to avoiding thread synchronization of collection cycles may cope better in some use cases (because they provide alternative assumptions about duration and cordoning of the generation) with the issue you mentioned in your blog as quoted below about violations of the young generational hypothesis assumption (which you’ve also mentioned can be sometimes ameliorated by eliminating superfluous boxing and allocations):

Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.

Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation.

You noted in another of your blogs that timely young generation collection cycles are important for maximum performance of (what you refer to as ‘symbolic’1) code that produces a lot of young garbage. So my proposed model above would require some intervention by the programmer to subdivide his algorithm between actors such that collection cycles are frequent enough to maximum performance. But this would conflate concerns because the Pony capabilities actor model is also about data race safe threaded sharing. So thus the need for regions and I think also we need to add the option of regioning by function scope in addition to the proposed by region id.

Will the programmer be capable of this static partitioning of allocation (i.e. is it a deterministic phenomenon) or need AMM to stochastically solve a non-deterministic phenomenon? My thought is that although program demands can be non-deterministic, the algorithms can sometimes be broken down into relatively deterministic parts that are combined non-deterministically. This will depend on whether the algorithm can efficiently and deterministically identify which objects it wants to copy from the short-lived, frequently collected bump pointer allocated region. For algorithms that can’t, then mark-and-sweep is perhaps required to maximize performance. For n-Queens it seems all the young garbage can eliminated by employing a mutable array instead of creating temporary copies, and then zeroing out the rows when backtracking.

Again I want to reiterate that fully AMM seems to not scale to massively multicore which is where the industry is rapidly headed now. The issue is more complicated than just stopping-the-world or not, as I explained up-thread (kudos to @keean for the initial insight) it also involves scaling the CPU’s cache coherence overhead. Pony’s local mark-and-don’t-sweep, global reference counting via message passing AMM model seems to have performance issues. So it seems we need a different model. I’m trying to figure out what the model should be. Note the cache coherency issue is mostly due to sharing between threads, which is mostly resolved by Actors, yet may also have interplay with the GC strategy.

1 Presumably this means immutable objects and pure functional equational reasoning a la Haskell?

Sorting out which hardware I will buy as I need to upgrade my i7-4770 desktop and prepare to be portable and mobile (will be relocating and need to do diversify my work venues not always stuck inside my cave, such as coding at the beach or in a cafe):

Parallelization Obsoletes the Desktop Computer

keean commented

@shelby3 I think the zenbook Pro is nice, although a 15" zenbook Pro might be better for extended use.

I agree that it makes more sense to rent CPU time, as either cloud instances (like AWS EC2), or even server-less compute time (like AWS Lambda). I normally keep source code in a remote repository to, and use cloud office software (Office 365 or Google docs). I try to keep nothing of value on the computer, so if it gets lost, stolen, or just breaks I don't lose too much.

However I think a decent CPU and GPU are useful, as edge computing is definitely also a factor, because mobile devices are the most used devices to access online services, as such you want to support offline use as you are not always connected to the network. This doesn't have to be the complete functionality, but should allow people to get work done offline. For example edit a document you previously cached locally offline. For big processes you transfer them to a cloud cluster where they can run the background and notify you when complete.

As a general rule I would run interactive processeses locally, and queue batch jobs to run remotely, as well as store shared/long-lived data remotely.

@keean I don’t need that screen inside the touchpad on the Pro. I would prefer not to have the numeric pad (or in the touchpad instead of¹) on the keyboard. The Pro isn’t yet refreshed for Whiskey Lake and I don’t think I want a hexa-core processor with a 35 TDP in a laptop. So I’m currently thinking the recently refreshed Zenbook classic series, although I really don’t need a GPU so I might just go for a cheaper Vivobook in anticipation of replacing the laptop again in 18 months. The refreshed Vivobook S only has a 42Wh battery though. Frustrating that these laptop manufacturers hold back certain features to force us to buy their most expensive model which has other features we don’t need. I don’t want to buy a Dell because my two experiences with Dell laptops were lemons. And I prefer to buy only in a store where I can test the unit before committing to purchase. The latest generation Thinkpads just aren’t the quality they used to be and also there’s none available retail in the Philippines. I’m in ASUS land.

I agree we still need some compute horsepower locally. My blog is both addressing the current situation and also the future. The eventual future is ubiquitous connectivity. And I think eventually basically nothing will run efficiently locally on the compute resources an individual could afford to own exclusively. There will be peak demand spikes because for example complex AI and data analysis will be incorporated into software. Our programming compilers might become much more complex and do complex (even whole program) analysis.

My basic point for the near-term is for many people it applies to not sacrifice mobility to try to cart around always enough compute resources for every possible software scenario. An eGPU can have much more horsepower and then you can cart that around when you really need it, otherwise leave it on your desktop. For us if we need to run a multi-threaded compiler than we can have a ThreadRipper on our desktop which is even portable, but not mobile. We can connect to it from our laptop when we dock.

Some current workstation and gaming users have needs that can be met with a laptop (e.g. Photoshop certified or AutoCAD certified), but I think the parallelization phenomenon is going to unleash software that simply won’t run on any laptop form factor.

Also mobility has become a focus for me lately not only because I need to travel, but because being stuck inside 14 x 7 is taking its toll on my health (at age 53.5). I need to get outdoors more both for my eyes (the blue light of the monitor may impact sleep patterns and hormones as well) and also to get daily sun exposure for vitamin D. Also for mental health.

Btw, I read today that Intel’s original 10nm process would have been transistor density equivalent to AMD’s 7nm. Apparently though Intel is scaling back the sophistication of their 10nm process so that they can get sufficient yield to ship by late 2019. Thus it may not end up actually equivalent.

¹ Although I saw a video which points out that the numeric pad in the touchpad doesn’t work that well because it can’t keep up with very fast typing.


Changing topic a bit, it seems with recent announcements that AMD is moving towards that sharing data between cores (or at least 8-core chiplets) will have more latency and cost (Infinity Fabric will remain on 14nm for the time being). Perhaps we should deprioritize being able to not copy when sending data between Actors as that will become a nearly useless optimization in the future. This design point may impact the GC issue we have been discussing in the other thread.

Also I read the new Whiskey Lake Intel mobile CPUs have some hardware fixes for Meltdown. I may opt for the i7-8565 with a 15W TDP. Seems to be the sweet spot that retains battery life, mobility and which neither Intel nor AMD will be able to significantly improve upon for at least a year. Will eventually probably also build a ~10L case Ryzen or Threadripper for more local compute resources when I’m at my desk.

https://fuse.wikichip.org/news/1815/amd-discloses-initial-zen-2-details/

http://www.tomshardware.com/answers/id-3824835/noiseless-cooling-system-amd.html

@jdh30 wrote:

@shelby3 wrote:

there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme

Cheney semi-space?

I misapplied the terminology. My intention was to state that I know of no bump-pointer heap that isn’t integrated with some form of tracing garbage collection. I conflated the generalized category of a tracing garbage collector with the mark-and-sweep variant of said collector.


Mark-and-don’t-sweep

As I understand it, the mark-and-don’t-sweep variant differs from a mark-and-sweep in that only reachable objects are ever visited (i.e. marked) by the collector. Tracing and marking only occurs when all the memory fragments in free space have been marked as reachable objects by the memory allocator which allocates from those fragments marked unreachable changing the mark to reachable. Some of those objects marked reachable may have hence become unreachable. Then the mark bit’s meaning is flipped to mean unreachable (i.e. free space) and before proceeding, tracing+marking flips the actual (i.e. not its meaning) bit for those objects that are still reachable.

Pony’s mark-and-don’t-sweep stores the mark bits in a separate list instead of with the objects (but tracing of course still has to visit the reachable objects in order to trace the references stored in the objects), which I suppose has the benefit that tracing and marking doesn’t have to wait until all the memory fragments in free space have been marked as reachable objects, because will not need to visit all the objects marked as unreachable in order to flip those mark bits to reachable before the aforementioned flipping of the meaning of the mark bit.

Related to @jdh30’s point about the importance of cache locality, I also read on pg. 3 of Danko Basch et al, “Mark without much Sweep Algorithm for Garbage Collection”:

Lazy sweep can also reduce the overall GC time due to less coalescing of free objects. It also improves locality because the swept area is small and will probably be immediately used by an allocator and mutator (user program is called mutator because it mutates the connections between objects)

Tri-color abstraction

For readers who are learning about tracing garbage collection, I find the equivalence of the tri-color abstraction easy to understand in the context of Cheney’s algorithm. Cheney’s algorithm is an improvement of the naive algorithm because it reduces the overhead for updating the addresses for references to reachable (i.e. grey or black) objects when they are moved (i.e. copied before from-space is deallocated) to the to-space.

Tying all our discussions into a summary and some additional insights:

Message-based concurrency (and parallelism)

Massively multicore + shared memory doesn’t scale

I wrote on the blog Actors are not a good concurrency model:

The blog post is correct but as others have noted that for CSP-like channels abstracted as functions within return values that [are] futures, it’s possible to make them composable. And of course these can declared pure if they don’t have any side-effects.

Also I agree with @‍Ulf Wiger that composability exists on many levels of semantics in a program. Low-level referential transparency isn’t everything about composability. The vaunted equational reasoning of Haskell is a huge price to pay for making so [many] other facets of programming arguably worse.

@‍Chris Quenelle, willy-nilly threading breaks multicore scalability:

https://gitlab.com/shelby/Zer0/issues/11#sharing-doesnt-scale

And we will have 64 core (128 hyperthreads) AMD processors in 2019.

@‍martin, threading synchronization destroys multicore scaling due to Amdahl’s law.

I forgot to point out explicitly that side-effects occur even in referentially transparent programs, as I explained elsewhere:

Although avoiding locking by sharing only immutable state doesn’t prevent higher-order semantic bugs including deadlocks and livelocks because mutable state dependencies can be modeled in other ways, there are still mutiple benefits gained by insuring first-level data race safety.

Remember it was ESR’s blog about SICK algorithms that prompted the discussion that lead to me writing the OP of this thread, and now he @eric-s-raymond has blogged again about his skepticism about exploitable parallelism in software.

And again I am going to disagree with some (not all) of his reasoning (and respectfully with @keean if @keean hasn’t yet come around to my perspective).

I will respond to some specific excerpts from his blog as follows.

In the terms of an earlier post on semantic locality, parallel methods seem to be applicable mainly when the data has good locality. And they run best on hardware which – like like the systolic-array processors at the heart of GPUs – is designed to support only “near” communication, between close-by elements.

By contrast, writing software that does effective divide-and-conquer for input with bad locality on a collection of general-purpose (Von Neumann architecture) computers is notoriously difficult.

We can sum this up with a heuristic: Your odds of being able to apply parallel-computing techniques to a problem are inversely proportional to the degree of irreducible semantic nonlocality in your input data.

Agreed. Parallelism requires the ability to subdivide the data space.

Another limit on parallel computing is that some important algorithms can’t be parallelized at all – provably so. In the blog post where I first explored this territory I coined the term “SICK algorithm”, with the SICK expanded to “Serial, Intrinscally – Cope, Kiddo!” Important examples include but are not limited to: Dijkstra’s n-least-paths algorithm; cycle detection in directed graphs (with implications for 3-SAT solvers); depth first search; computing the nth term in a cryptographic hash chain; network-flow optimization.

Bad locality in the input data is implicated here, too, especially in graph- and tree-structure contexts. Cryptographic hash chains can’t be parallelized because their entries have to be computed in strict time order – a strictness which is actually important for validating the chain against tampering.

There’s a blocking rule here: You can’t parallelize if a SICK algorithm is in the way.

The crux of my rebuttal has been and remains that the quoted is a narrow-minded framing of a perspective on what is actually blocking parallelism.

We as creative humans who adapt to our resource environment will necessarily lift our paradigms such that we overcome those SICK limitations. Let me give an example to illustrate how this is so.

For example someone refuted Eric on Dijkstra’s shortest path. Who would have thought that Pi would be by now mathematically proven to be computable in parallel and, “The Law of 1998, Why Would Anyone Ever Need 16GB of RAM?”.

Let’s take the necessarily sequential chaining of a cryptographic hash example since reinventing blockchains is my current area of focused research other than my study of programming language design. We subdivide the blockchain into shards instead.

Additionally there’s a more fundamental myopia that plagues this focus on deterministic SICK algorithms— our universe isn’t deterministic. Exact deterministic solutions become less meaningful than stochastic approximations as the context of their application broadens beyond a deterministic partial order to a stochastic reality. Which will become more and more the case in this interconnected world where our applications are not just self-contained in the deterministic space of our personal desktop.

Humans will paradigm-shift the problem set as required. That doesn’t mean there won’t still be legacy paradigms that weren’t well thought out in terms of the new realities about where Moore’s law is still advancing.

I haven’t studied Eric’s Reposurgeon project to know if I would have any insights on whether it’s possible to paradigm-shift that legacy data structure transform. Apparently back references into the DAG are random and deep, so I suppose perhaps some batching, grouping algorithm in order to generate data locality could be contemplated. EDIT: one of the comments on Eric’s blog also had a similar idea.

But I also don’t understand why his DAG couldn’t be subdivided and then have the divisions communicate via message passing. Is the overhead of message passing greater than the performance speedup of parallelizing on multicore? Does the transform algorithm have to run sequentially or can there be parallel transformation pipelining? So as I said, I would need to study his problem set in detail before I could comment intelligently. Even if message passing would work, that wouldn’t make Python’s multi-process model appropriate because message passing between processes has more overhead than between concurrent green threads. Python processes run effectively single-threaded because of the GIL.

Nevertheless my disagreement with his overly pessimistic skepticism remains.

We’re not done. There are at least two other classes of blocker that you will frequently hit.

One is not having the right tools.

Eric seems to not know about the Pony language model.

  1. For most desktop/laptop users the only seriously parallel computing that ever takes place on their computers is in their graphics chips.

Because programmers have not paradigm-shifted yet. How many programmers even know the Pony language model exists. And also Pony has some flaws which need to rectified which I have been writing about. This is why I’m contemplating creating the Co programming language (recently renamed from the original Zer0 name).

UPDATE: A commenter on G+ points out that one interesting use case for multicores is compiling code really quickly. Source for a language like C has good locality – it can be compiled in well-separated units (source files) into object files that are later joined by a linker.

I have been saying this for many months also around here.

EDIT: I am not disagreeing with the point that such massively multicore CPUs will be specialized. In this thread for example, I noted that with the Pony model we could in the future have software cache coherency because extant hardware cache coherency is an Amdahl’s law limitation on scaling massively multicore. But note the hardware model remains a general purpose stack machine and not a GPU-like register-file, vector machine. I have stated in this thread we will likely only need a few general purpose CPUs heavily designed for optimizing serial general purpose algorithms.

EDIT#2: Eric is apparently now understanding that games are one example where massively multicore CPU is utilized (and not just the GPU), and I agree with his statement (quoted below) that the lack of tools (specially the programming language we need, c.f. also and also) has been holding programmers back from making the transition to massively multicore:

The languages/toolchains in which these games are authored in just aren’t sophisticated enough to support serious concurrency on the general-purpose CPUs.


META: Tangentially I explain herein why am unable to respond directly on Eric’s blog. That is because he banned me. Apology in advance to @keean for needing to insert this here, but if Eric is going to read this then I want him to know my opinion that he needs be hit upside the head with a metaphorical baseball bat to awaken him from whatever hole he crawled into about the realities of his juvenile banning shit.

Wouldn’t it be more informational if Eric had the courage to face me on a neutral discussion forum like this where he has to actually win all the arguments and not cut me off in the middle of an argument with his censorship. For example for him to try to refute the facts (c.f. also and also). As I remember, it all sort of began circa 2011 when I tried on the 10th year anniversary of 9/11 to do some justice — to for example my heritage as a descendant of Isaac Shelby who fought for the freedoms Eric now enjoys in our country — on his 9/11 blog about what really happened on that day. Some his regulars were vocal (and condescending, brow-beating, ridiculing) about how offended they are by the notion that 9/11 could have been an inside job. Which I guess caused Eric to dislike me. Eric refers to such facts as babbling nonsense. I am also a U.S. citizen and I refuse to tolerate the corruption of our country. And dismayed that he is unwilling to host the necessary transparency. Mr. open source and all. My study of 9/11 is much more solid by now than it was in 2011. In 2011, I didn’t have the knowledge to concisely refute the bellicose interactions on his blog. Eric might think his bans are justified because of those who he thinks derail with off-topic discussion, but his regulars also do this quite often and he allows. And also the discussion I have done was also a natural progression or fundamentally related to the thread subject (but according to Eric’s ignorance, it was all babbling nonsense anyway).

I wonder if ESR has ulterior motives or conflicts-of-interest. Does he need to attract more donations to his Patreon by grandstanding to the anti-semitism political correctness totalitarianism? That doesn’t mean I hate Jews! Of yeah I know he talks a good talk about his min-anarchist ideology, but he also seems to love politics, grandstanding, and censorship. On the one hand, one can have some appreciation, love and respect for his contributions and talents, but then be entirely frustrated by his Herculean ego and nastiness when it manifests in banning discussion and the written venom he spews out towards people whose opinions (e.g. towards Jeff Read) he disagrees with. Btw, I usually ideologically agree with Eric when he ridicules Jeff, but I cringe at the example he sets in terms etiquette and discourse. Especially when it is political and ideologically based discussion (Jeff being an Apple and walled gardens fanboy). At least when Linus does it in email, it is nearly always ridicule about poor quality code which is not an ad hominem attack but rather just being frank about low quality work. That’s a Southern gentleman issue. In the South, speaking to someone was Eric does, should be resolved with a duel (and no this isn’t a veiled threat). You just don’t treat people that way even if you are correct. He would retort that noise isn’t signal, but then he must refute those facts about 9/11 I linked above which he has been censoring (as one example of myriad of signal he censors). That is a way of saying I am person who habitually forgives and forgets to work forward hopefully with harmony. But for some reason that guy can’t tolerate me. Maybe because I’m a Southerner and he is a city slicker Northerner? Every several months or so, I wander over to his blog to see if I can learn anything important. Sometimes I find blogs within my area of knowledge and I want to offer some comments. I am not stalking the man. Cripes I have more important things to do. Now Freeman Dyson, that is a man I can respect. And sadly he is approaching a Centenarian so may not be with us that much longer. Amazing combination of intelligence and also devoid of outwardly nasty ego. Maybe because Freeman doesn’t even put himself in those situations in the first place. After all, Freeman doesn’t need a blog of sycophant regulars to validate his writing. Freeman is busy inside his head and talking with other scientists. As I am busy coding doing my own thing. Anyway for the record, that is why I can’t post my comments on Eric’s blog and instead do so here.

P.S. while the sycophants play with tinkertoys, the plot thickens.

As I postulated up-thread, moving towards software driven cache coherency and minimizing the moves of data between cores:

http://www.rexcomputing.com/

keean commented

@shelby3 it's great to see people trying to make progress in this area, however I am not hopeful for success. Cache is very regular (that is it is easy to construct the blocks from hardware 'macros' when designing the chip, so in terms of hardware design effort it is cheap compared to complex parts like instruction decode, ALUs amd control logic whilst at the same time producing a very large increase in performance (10 - 1000x). In my experience people want CPU performance more than they want battery life, otherwise they could have a laptop that lasts days by using low power ARM cores. Also comparability with a existing software is important. Look at failures like Transmeta for examole, and that nobody uses x86 emulation on Arm.

You have to consider whether Microsoft's Windows ARM laptop will succeed (probably not) because it cannot run x86 windows software, compared to a chrome book, that can be viewed as a big phone with a keyboard, and can run all android software.

Your argument is only true for some markets.

Laptop != mobile. Mobile people typically want (at least a day’s) battery life before prioritizing performance. Business laptop (not gamers) people typically want (at least a a few hours) battery life before prioritizing performance. ARM is killing Intel on mobile. ARM is preparing to make a run at the server market.

Rex Computing is I believe targeting specialized supercomputers.

Rome is not built in a single day. Intel and hardware cache coherency will die in peripheral markets before finally succumbing in other mainstream computing.

keean commented

@shelby3 Didn't work for the Platstation2 which had a non-cache architecture, but the PS3 (and PS4) went back to a traditional architecture because it was too difficult to program for the PS2.

We need a PL that supports the model and is popular. Rome is not built in a day. We are actively developing Co now.

Also the PS2 did not have enough cores to make it worthwhile. When the s/w cache systems have 20X more cores per watt (and thus per TDP limitation), the incentive increases.

keean commented

@shelby3 software control of scratch memory is a pain. The problem is that it makes it the compilers job to control the cache, but the compiler cannot anticipate runtime scheduling. It may save power, but it won't work any where near as well as a hardware cache that can account for the actual runtime scheduling of instructions and interrupts.

@keean did you forget my upthread point that with “Actors” (actually CSP), the management of the cache coherency between cores can be managed in s/w at the juncture of the sending of messages. Removing that h/w responsibility will save power and with the right PL then it will not be hard to program.

Again I reiterate that this requires thinking about global memory as partitioned into “Actors”.

I agree with you that cache set associativity and load/store (i.e. not coherency between cores) has to still be managed in h/w because that is very difficult to manage in s/w generally. But that is not where all the power consumption gains are coming form. Did you fail to read, remember, or assimilate the post I linked upthread:

Massively multicore + shared memory doesn’t scale

That clearly explains that partitioning the global memory for exclusive access to some core(s) drastically reduces the power consumption:

Analogous to how Internet connectivity is much slower than memory bus connectivity, core interconnectivity bandwidth (and latency) won’t scale proportionally with the increase in cores. So software has to written to minimize unnecessary communication between cores so that computing hardware can be designed for the distributed memory variant of MIMD. Just as with the design of the client-server paradigm, multicore software must be designed to minimize distant (i.e. non-local, not cached) memory resource sharing by keeping needed data close to the core that will work with it. For example, the power consumption is ~6X greater (2 vs. 11 pJ/b) for inter-chip versus on-chip Infinity Fabric interconnections— a ratio that will worsen because […]

I emailed Rex Computing asking them to read my prior post.

I agree with you that cache set associativity and load/store (i.e. not coherency between cores) has to still be managed in h/w because that is very difficult to manage in s/w generally. But that is not where all the power consumption gains are coming form.

Apparently compiler driven methods are approaching h/w controlled methods:

https://ece.umd.edu/~barua/udayakumaran-TECS-2006.pdf

Note that paper seems to support the fact that most of the power consumption gains are not from eliminating the h/w mgmt of the cache:

Studies [Banakar et al. 2002] have shown that scratch-pad’s use 34% lesser area, consume 40%
lower power than a cache of the same capacity. These savings are significant since the onchip cache typically consumes 25-50% of the processor’s area and energy consumption, a fraction that is increasing with time [Banakar et al. 2002].

Unfortunately I can’t read the Rex Computing paper as it is behind and paywall and I refuse to pay.

Massively multicore has arrived…

From private message:

I am not sure what you are refuting, these new ARM chips have a speculative out-of-order architecture. What they don't have is hyperthreading, but that is just a way to try and increase thread level parallelism. New ARM cores have added these features to enable them to enter their server space. Cortex-A15 has speculative execution for servers, Cortex-A5 does not.

The bolded has been my prediction and my salient point.

I repeat my upthread arguments. The constant speedup of OoOE pales in comparison to the exponential speedup from maximizing the number of cores. And (wasted[1]) power consumption is a huge if not the #1 factor now, which impact the design of the silicon because power dissipation is a huge problem and impacts the degrees-of-freedom w.r.t. positioning of components on the silicon. OoOE is power-consumption inefficient and growing more insignificant as a facet of the expanding whole (as I anticipated), especially as they discover they can tradeoff more cores and lower power consumption for at least the intensity of the OoOE if not entirely the ILP (instruction level parallelism) entirely at some point in the future.

[1] Wasteful because it is speculative and thus computes some fraction of the whole which is discarded. Attempted mitigations to Sprectre will increase the waste. There will never exist a 100% solution to every possible side-channel attack that can be devised and thus speculative execution will eventually be accepted to be a security bug and discarded. Go ahead and try to debate me on this point.

Another factor is that green threads are becoming the new paradigm for massive concurrency and these essentially render OoOE less effective because they frequently purge the register file and instruction pipeline. As the number of green threads proliferate into the billions and trillions such that user-layer context switches become more frequent and granular, then OoOE may become a net negative when factoring in the power and die area costs relative to the thread parallelism that could be gained by repurposing the silicon and lower the TDP which impacts the holistic economies-of-design.

And even if I am incorrect about ILP disappearing or diminishing from the silicon, it is a fact that it is diminishing as a portion of any speedup and scaling of Moore’s law. If Moore’s law stops, computing dies. So whether it becomes deprecated or not, ILP is now becoming entirely irrelevant to the future of computing. And this was simply obvious to me.

Given that green threading context switches attempt to align with purges of the instruction pipeline (e.g. branches and function calls) to minimize costs, perhaps ILP/OoOE will become to some extent more explicitly guided by software (statically or dynamically)? Why should the hardware need to attempt to interpret some static knowledge that a loop will not exit (and the runtime will not context switch) until it has looped N times wherein N is known dynamically before the loop begins? There’s semantic knowledge that the hardware is blind to, which it may benefit from, which was one of my key theses.

To expound a bit on why Spectre can never be entirely fixed, let me first quote from the research paper:

Spectre attacks involve inducing a victim to speculatively perform operations that would not occur during correct program execution and which leak the victim’s confidential information via a side channel to the adversary.

So the vulnerability is due to the fact that the speculation is not random and is in fact driven by the runtime data that is going to be pilfered with the attack. You may presume you can foreclose every footprint of the speculation execution available to side-channels (e.g. other cores) but in general this is impossible because timing attacks can even be listened to by means outside the CPU entirely! Wrap your mind around that. So the attacker does even need to find a footprint within the CPU itself. The mere presence of speculative execution is sufficient. This is why cryptographic algorithms are designed to run at constant speed regardless of the inputs, so that they are immune to timing analysis.

@keean wrote on Jun 6, 2018

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.


Ampere announces 128-core Arm server processor

Ampere – the semiconductor startup founded by former Intel executive Renee James and not to be confused with the new line of Nvidia cards – just introduced a 128-core, Arm-based server processor

[…]

While Intel and AMD have gone with hyperthreading, allowing for two threads per core, Ampere went for single-threaded cores. Ampere says that as it adds cores, performance scales linearly, according to the Stress-NG benchmark.

“You'd think everyone must scale [as they add cores], but they don't. AMD gets bottlenecked from system level functions. A thread is not the same as a core. You are sharing resources among multiple threads. Scalability was the prime reason we went with single-threaded cores,” Wittich says.

Amazon Launches a Killer ARM Server Chip With the Graviton2

Amazon also launched its own Graviton CPU back in 2018, but the CPU wasn’t all that great a deal. Graviton2 looks to be a different story.

The Graviton2 is based on ARM’s Neoverse N1, with a slightly lower clock than ARM specified (2.5GHz). Each Graviton2 core offers 128KB of L1 (64K i-cache, 64K d-cache), and a 1MB L2. The entire 64-core CPU is backed up with a 32MB L3. The Neoverse N1 is a derivative of the Cortex-A76 and Amazon is using ARM’s CMN-600 mesh interconnect. There are 8 memory channels supporting DDR4-3200.

Graviton2 is a completely different product than Graviton, according to Anandtech, which recently reviewed it against competing solutions from Intel and AMD.

[…]

It’s Official: NVIDIA To Acquire Arm For $40 Billion

Over the last half-decade NVIDIA has undergone significant growth – both in regards to revenue and market capitalization – thanks in big part to NVIDIA’s newfound success in the server and datacenter market with their deep learning accelerators. While the company is well off of its 52-week high that it set earlier this month, NVIDIA now has a sizable $330B market cap that they are leveraging to make this deal possible.

And according to the company, it’s that success in the server market that is driving their interest in and plans for Arm. NVIDIA expects the server market to remain a high-growth opportunity, and that by acquiring Arm they can leverage Arm’s recent success with Neoverse and other server products to reach an even bigger chunk of that market.

[…]

Ultimately, the Arm deal will be a significant shift in the industry, one far bigger than the $40 billion price tag indicates. In one form or another, Arm and its IP are at the heart of billions of chips sold every year, a reach that few other companies can match. For NVIDIA, acquiring Arm will cement their position as a top-tier chip designer, all the while giving them an even larger technology portfolio to go after the server market, and new opportunities to sell IP to other vendors.

Nvidia’s $40 billion Arm acquisition is about bringing AI down from the cloud

A big deal that could be a big deal

As Huang put it: “AI is moving from the cloud to the edge, where smart sensors connected to AI computers can speed checkouts, direct forklifts, orchestrate traffic, save power. In time, there will be trillions of these small autonomous computers, powered by AI, connected to massively powerful cloud data centers in every corner of the world.”
"Nvidia supplies the data centers, Arm supplies the mobile chips"

As the creator of the GPUs used in many AI data centers, Nvidia can supply the first half of this equation, while Arm, designer of cheap and energy-efficient mobile chips, takes care of the second. It doesn’t take a genius to work out there’s potentially profitable overlap between these two businesses. (Though it does take $40 billion to force that overlap into existence.)

[…]

Fittingly enough, Nvidia’s ability to offer extra capitalization is, in part, only due to interest in artificial intelligence. As journalist Alistair Barr noted on Twitter, back in 2016 Nvidia was worth $30 billion, but its valuation has since increased to over $300 billion because of demand generated for its GPUs by machine learning (among other things). So while demand for AI motivated this deal, it also made it possible in the first place.

Nvidia's Arm Deal Would Make It the Center of the Chip World

Combining the two chipmakers would unite leaders in two big tech trends—artificial intelligence and mobile computing.

In recent years, Nvidia has ridden one of the biggest waves in technology, selling chips needed to build increasingly clever artificial intelligence algorithms. Now, the company plans to catch another big swell—mobile computing—with a $40 billion acquisition of Arm, which designs the chips found in virtually all smartphones.

The deal would reshape the chip industry overnight—putting Nvidia at the center of much of the action.

[…]

The two companies are dominant in AI and mobile, but Intel remains king of the data center. Only recently have some companies, including Amazon, begun adapting Arm’s designs to make more efficient new data center chips.

Data center sales are a fast-growing part of Nvidia’s business, generating $3 billion in revenue last year, up from growing from $830 million in 2016. In April, the company acquired Mellanox, an Israeli company that makes hardware transmitting data between chips in cloud computing systems, for $6.9 billion.

In an interview with industry analyst Patrick Moorhead, Nvidia CEO Jensen Huang said adding Arm to Nvidia would help sell more Arm chips to data centers. “What will change is the rate of our road map,” Huang said. “We know for sure that data centers and clouds are clamoring for the Arm microprocessor, the Arm CPU.”

In coming years, more computing is likely to drift into the cloud, especially as more companies make use of AI for business applications and as 5G wireless networks create new possibilities for sharing data and applications.

[…]

“We're about to enter a phase where we're going to create an internet that is thousands of times bigger than the internet that we enjoy today. A lot of people don't realize this,” Huang told Moorhead. “We would like to create a computing company for this age of AI.”

Mike Demler, an analyst at the Linley Group, which specializes in microchip technology, says this needs to be a long-term strategy, as Arm designs are just starting to find their way into data centers.

Demler believes Nvidia’s plan may be to combine Arm’s general purpose chips with its own specialized AI ones, an approach that could improve the capabilities of both. “Nvidia’s intent is to proliferate a new computing model by having a tighter integration of CPU and AI architectures,” he says.


Remember I noted that ARM supports “tagged pointers,” which can make null pointers, tagged unions, existential quantification dynamic dispatch, ARC and/or interior pointers (i.e. those that point with an allocated block of memory) more efficient. I wrote:

Elimination of tracing and ARC removes the extra overhead tradeoffs required for “interior” pointers into the body of a data structure. But doesn’t remove the overhead issue for those persist which require tracing or ARC.


I wrote on Sep 10, 2018

@keean regarding our prior discussion about superscalar versus massively multi-core, the 32 core (64 logical thread) AMD Threadripper seems to only excel at massively parallelized tasks which are not memory (cache coherency) bound:

[…]

I wrote on Oct 20, 2018:

I have also augmented my original proposal by acknowledging we probably also need the ability to instance region-based memory allocation (aka the Arena pattern) each of which are separately bump-pointer heap allocated. Thus a region is only ever freed all at once with no mark-and-sweep. This increases efficiency for use cases that are compatible with such a model.

I posit that these proposed models in addition to avoiding thread synchronization of collection cycles may cope better in some use cases (because they provide alternative assumptions about duration and cordoning of the generation) with the issue you mentioned in your blog as quoted below about violations of the young generational hypothesis assumption (which you’ve also mentioned can be sometimes ameliorated by eliminating superfluous boxing and allocations):

Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.
Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation.

You noted in another of your blogs that timely young generation collection cycles are important for maximum performance of (what you refer to as ‘symbolic’1) code that produces a lot of young garbage. So my proposed model above would require some intervention by the programmer to subdivide his algorithm between actors such that collection cycles are frequent enough to maximum performance. But this would conflate concerns because the Pony capabilities actor model is also about data race safe threaded sharing. So thus the need for regions and I think also we need to add the option of regioning by function scope in addition to the proposed by region id.

Will the programmer be capable of this static partitioning of allocation (i.e. is it a deterministic phenomenon) or need AMM to stochastically solve a non-deterministic phenomenon? My thought is that although program demands can be non-deterministic, the algorithms can sometimes be broken down into relatively deterministic parts that are combined non-deterministically. This will depend on whether the algorithm can efficiently and deterministically identify which objects it wants to copy from the short-lived, frequently collected bump pointer allocated region. For algorithms that can’t, then mark-and-sweep is perhaps required to maximize performance. For n-Queens it seems all the young garbage can eliminated by employing a mutable array instead of creating temporary copies, and then zeroing out the rows when backtracking.

Again I want to reiterate that fully AMM seems to not scale to massively multicore which is where the industry is rapidly headed now. The issue is more complicated than just stopping-the-world or not, as I explained up-thread (kudos to @keean for the initial insight) it also involves scaling the CPU’s cache coherence overhead. Pony’s local mark-and-don’t-sweep, global reference counting via message passing AMM model seems to have performance issues. So it seems we need a different model. I’m trying to figure out what the model should be. Note the cache coherency issue is mostly due to sharing between threads, which is mostly resolved by Actors, yet may also have interplay with the GC strategy.

I wrote on Nov 15, 2018:

Changing topic a bit, it seems with recent announcements that AMD is moving towards that sharing data between cores (or at least 8-core chiplets) will have more latency and cost (Infinity Fabric will remain on 14nm for the time being). Perhaps we should deprioritize being able to not copy when sending data between Actors as that will become a nearly useless optimization in the future. This design point may impact the GC issue we have been discussing in the other thread.

@keean wrote on Jan 12, 2019:

@shelby3 it's great to see people trying to make progress in this area, however I am not hopeful for success. Cache is very regular (that is it is easy to construct the blocks from hardware 'macros' when designing the chip, so in terms of hardware design effort it is cheap compared to complex parts like instruction decode, ALUs amd control logic whilst at the same time producing a very large increase in performance (10 - 1000x). In my experience people want CPU performance more than they want battery life, otherwise they could have a laptop that lasts days by using low power ARM cores. Also comparability with a existing software is important. Look at failures like Transmeta for examole, and that nobody uses x86 emulation on Arm.

One of the reasons for the improved cost for ARM servers is the lower heat due to lower power consumption.

I will end up being correct. I usually am proven correct over the long-term.