sync: eliminate global Mutex in Pool operations

In the review for https://golang.org/cl/101715 (“regexp: use sync.Pool to cache regexp.machine objects”), @ianlancetaylor notes:

A similar change was made in https://golang.org/cl/44150043, by [@bradfitz]. It was rolled back in https://golang.org/cl/48190043. [@dvyukov] said: "The remaining concern is: Are there cases where a program creates lots of local short-lived regexps (one-shot), and it's not possible to replace them with global ones? For this case we've introduced a global contended mutex."

I'm guessing that comment refers to allPoolsMu here:

go/src/sync/pool.go

Line 242 in 2e84dc2

allPoolsMu Mutex

It strikes me as odd that sync.Pool should have a global Mutex at all. After all, sync.Pool explicitly cooperates with the garbage collector (and will likely cooperate even more after #22950), and the garbage collector will trace all of the live sync.Pool instances in the course of a normal garbage collection.

The allPools slice appears to exist so that the garbage collector can clear all of the sync.Pool instances before tracing them. We have to identify Pools before tracing them so that we don't over-retain pooled objects, but it seems like we could do that without acquiring global locks beyond the usual GC safepoints.

For example, we could add newly-allocated pools to a per-G list, and move that list to the global list only when the thread running that G acquires the scheduler lock.

If we happen to miss a new Pool on the current GC cycle (for example, because its goroutine wasn't descheduled until near the end of the cycle), that's ok: we'll just wait until the next GC cycle to clear it.

That could make sync.Pool substantially more efficient for objects that tend to be long-lived but are sometimes short-lived too (such as compiled regexps).

(See also #24411, the motivating example.)

For example, we could add newly-allocated pools to a per-G list

I tried to solve the problem today.
My guess is that per-G list is a P local store, so there are two ways to implement it.
Scheme 1 is to modify the runtime to support, scheme 2 is to get a function that returns pid from the runtime via linkname, and use this function to create P local storage.
I tried plan 2, if you use map to implement P local storage, you still need a lock to synchronize. You can also try to implement P local storage with slicing.
Finally found that perhaps can use similar to the https://go-review.googlesource.com/c/go/+/552515 API, use two sharded to solve the problem.
Note: If GOMAXPOCS change, sharded is not 100% guaranteed that the same P operates use the same memory, and different Ps operate use different memory. This means that using sharded simply makes the lock no longer global, making the competition for the same lock less intense.
These are my attempts.
I shared it to help solve the issue.

Change https://go.dev/cl/562336 mentions this issue: sync: eliminate global Mutex in (*Pool).pinSlow operations