estarriolvetch/solidity-bits

Yul optimisation

Opened this issue · 8 comments

Solidity does a pretty good job generally but some things help, for example:

        //uint8(LOOKUP_TABLE_256[(isolateLS1B256(bb) * DEBRUIJN_256) >> 248]);
        bytes memory table = LOOKUP_TABLE_256;
        assembly {
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            result := mload(add(table, add(bb, 1)))
            result := and(0xff, result)
        }
    }

Whereas this function was the same either way:

    function set(BitMap storage bitmap, uint256 index) internal {
        uint256 bucket = index >> 8;
        uint256 mask = MASK_INDEX_ZERO >> (index & 0xff);
        bitmap._data[bucket] |= mask;
        //assembly {
        //    let bucket := shr(8, index)
        //    let mask := shr(and(index, 0xff), 0x8000000000000000000000000000000000000000000000000000000000000000)
        //    mstore(0, bucket)
        //    mstore(0x20, bitmap.slot)
        //    let slot := keccak256(0, 0x40)
        //    let current := sload(slot)
        //    sstore(slot, or(current, mask))
        //}
    }

However, I've done a test version with Chiru's 4.2 ERC721A (combined with a lengthless array optimisation around storage for this and another struct) and it was easiest to simply port across the two functions used and merge them in rather than me go through and yul/test the solidity-bits lib.

Happy to help if you want the lib improved though.

Another function I converted:

        bytes memory table = LOOKUP_TABLE_256;
        assembly {
             let bucketIndex
             let bb
            //bucket = index >> 8;
            let bucket := shr(8, index)
            // index within the bucket
            //uint256 bucketIndex = (index & 0xff);
            bucketIndex := and(index, 0xff)

            for {
                // load a bitboard from the bitmap.
                //uint256 bb = bitmap._data[bucket];
                mstore(0, bucket)
                mstore(0x20, bitmap.slot)
                bb := sload(keccak256(0, 0x40))
                // offset the bitboard to scan from `bucketIndex`.
                //bb = bb >> (0xff ^ bucketIndex); // bb >> (255 - bucketIndex)
                bb := shr(xor(0xff, bucketIndex), bb)
                if iszero(bb) {
                    bucketIndex := 255
                }
            } iszero(bb) {
                mstore(0, bucket)
                bb := sload(keccak256(0, 0x40))
            } {
                if iszero(bucket) {
                    revert(0, 0)
                }
                bucket := sub(bucket, 1)
            }

            //bb.bitScanForward256()
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            bb := mload(add(table, add(bb, 1)))
            bb := and(0xff, bb)
            //result = (bucket << 8) | (bucketIndex - bb.bitScanForward256());
            result := or(shl(8, bucket), sub(bucketIndex, bb))
        }
    }

The 721A version and notes can be found in my comment here:

chiru-labs/ERC721A#391 (comment)

@sillytuna This is interesting. Do you have a gas usage comparison (with solidity optimizer enabled)?

I was testing in situ. Marginal gains from the second two funcs but worth it on something like this, esp if could be called a lot.

Should be quick for you to put in your code and run a quick gas comparison tho.

My concern is it may affect the code readability. I also saw you lengthless array PR for ERC721A.

Maybe shifting from mapping to lengthless array for the bitMap datastructure will be a bigger improvement.

My view on that is it depends on use. Code like this can be needed in gas performant functions and will be write once use many. I'd keep the original solidity in hand to be read for sure and yul would need well testing and gas checks.

Your code is already quite hard to understand just because of the algorithm so I'm not sure yul really changes that. Can be well commented and include solidity comments.

        //uint8(LOOKUP_TABLE_256[(isolateLS1B256(bb) * DEBRUIJN_256) >> 248]);
        bytes memory table = LOOKUP_TABLE_256;
        assembly {
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            result := mload(add(table, add(bb, 1)))
            result := and(0xff, result)
        }
    }

I took a look and it seems the gas saving comes from two factors:

  • removing the require(bb>0) check.
  • inline the isolateLS1B

Unless there is a good reason, I don't think require(bb>0) should be removed. As for the merging the function, solidity optimizer now have this optimization stage and should take care that automatically based on the optimization parameter you specify. It's a deployment gas vs run time gas problem.

Makes sense. I forgot that I'd removed that require because of the specific use case essentially having it covered.

It's certainly worth checking what's worth being yul / inline and what isn't.