zherczeg/sljit

Port PCRE2 JIT to Linux on IBMz (s390x)

edelsohn opened this issue ยท 71 comments

Enable PCRE2 JIT on Linux on IBMz (s390x-linux) and optimize to achieve equivalent speedup over non-JIT code as x86_64. Goal is full functionality and passing testsuite with JIT enabled.

An unfinished port of PCRE2 JIT to s390x exists in the Linux-on-IBM-z Github account and can be used as a starting point. IBM will contribute the code as necessary.

https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental

A $5000 bounty from IBM available posted on Bountysource
https://www.bountysource.com/issues/92353837-port-pcre2-jit-to-linux-on-ibmz-s390x

Systems access is available through the LinuxONE Community Cloud at Marist
https://linuxone.cloud.marist.edu/#/register?flag=VM

This issue is tied to a feature request to the Exim community for PCRE2 support.
https://bugs.exim.org/show_bug.cgi?id=2635

@zherczeg Do you know of someone in the PCRE2 JIT community who would like to work on this issue?

Perhaps @carenas might be interested.
As for me I am working in a University which has many industrial collaborations, that could be an option, although that will be more pricey.

the development branch s390x shows (only in the cloud though, not qemu) some encouraging progress :

$ bin/regex_test
Pass -v to enable verbose, -s to disable this hint.

REGEX tests: all tests are PASSED on s390x 64bit (big endian + unaligned)

pcre2 itself fails (even without JIT) and will need work (pcre2test segfaults with anything that touches invalid UTF) functionality wise (ex: non including a commented out test that will segfault for the JIT part) :

Successful test ratio: 50% (330 failed)
Invalid UTF8 successful test ratio: 0% (129 failed)

and as zherczeg pointed out will likely need a significant amount of work to get close to that performance objective

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

How much effort do you estimate to make PCRE2 functional?

How much effort to optimize it for the platform?

What amount of bounty would make it interesting, probably split into sub-milestones.

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

my bad, fixed the link. it is mostly the linux-on-ibm-z code but with a few fixes on top to fight the bitrot so it will build on top of the current tree for further development.

How much effort

it is too early to know but the original code didn't support an FPU and will need to have put_label support added before it can work with current PCRE2 (including bad UTF8), so the sooner we can get it in a shape good enough for merging (even if it doesn't perform well) the better IMHO to make sure it will not bitrot further.

IBM can redirect the current bounty towards the basic enablement. We would appreciate guidance from the PCRE2 community on a plan and a sizing for the various steps to achieve basic enablement suitable for merger and then further optimization to achieve optimal performance,

@edelsohn: is instruction cache coherency an issue with Z?, the original code went into great extents to try to avoid clearing the instruction cache when updating code is required (ex: when jump address or constants were updated, and now when put_labels are updated), and eventhough it is using a somehow independent pool to keep those values, it is still in the same memory segment than the rest of the code and therefore likely to be considered for caching with the instructions AFAIK.

in positive news side while it doesn't yet fully work (because of put_labels) at least doesn't segfault.

Thanks for the progress.

IBMz has a fairly strong memory consistency model, like x86. Are you experiencing unusual behavior or just checking? I'm not aware of other self-modifying IBMz code explicitly flushing / clearing caches.

just checking, and hoping to have a (likely incomplete) version that could pass all tests by end of the week.

building and running all tests successfully in :

https://travis-ci.com/github/carenas/sljit/builds/182588796

as well as when applied to a TRUNK version of PCRE as shown by (running in the IBM cloud) :

make  check-TESTS
make[2]: Entering directory '/home/linux1/src/pcre'
make[3]: Entering directory '/home/linux1/src/pcre'
PASS: pcre2_jit_test
PASS: RunTest
PASS: RunGrepTest
============================================================================
Testsuite summary for PCRE2 10.36-RC1
============================================================================
# TOTAL: 3
# PASS:  3

note that it is not enabled by default and it is mainly available to coordinate further development, as it is missing functionality which would trigger abort(), has no FPU (neither SIMD) support, and has not been optimized or even profiled so it might be even slower than the interpreter.

Awesome progress, @carenas ! Let us know when you and the community have analyzed the situation and have sizings for the next steps.

Let us know when you and the community have analyzed the situation and have sizings for the next steps.

@zherczeg: would the following make sense for sizing and next steps?

  1. get s390x branch developer ready so it is maintained in-tree to avoid further bitrot (ETA end of the week, will need review and while I agree it has some rough edges should be safe as it is disabled from autodetection and I am hoping to keep further development rebased and under CI), to hopefully aid testing and even contributions from the wider community (still hoping those IBM insiders bring back some of their very useful architecture knowledge, this way).
  2. get a newer version that completes the implementation needed to fully support s390x (FPU and vector operations, mainly what I expect will be needed, with a target CPU of Z13), this I am expecting to be about 40 man hours, and will also include a repository with patches to PCRE to be kept under CI
  3. tie loose ends (most of the abort() should be gone by now, but there are likely still some important TODO and other work that will be needed for performance, as well as support for anything that was not path critical like fallback for non FPU or older Z with a target to hopefully make it run in QEMU) and there is likely here were most of the optimization and performance work will be done with a suggested option to include this code (still disabled for auto) in the next PCRE release to broaden user base, recruit contributors and gather information to help guide future development, and in auto hopefully in PCRE + 1.

note that the original fork was maintained for almost 2 years but it might be still 60% done and there is also the need to learn the architecture (which by being proprietary makes things slightly more difficult), but I am optimistic it could be done thanks to the cloud availability and a good starting base, which is why I'd been focused on last week to make sure all previous investment is not wasted.

I would like to avoid what happened with the TileGX port. It has never been completed, no maintainers and thus it would be the best to remove it. As for a new port, I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level. I am curious what "proprietary" exactly means here. No public documentation available? No free tools available?

Yes, s390x is proprietary, compared with, say, RISC-V, but not compared with Intel x86_64. Proprietary doesn't mean secret. GCC, LLVM, PyPy (including NumPyPy), node.js, OpenJDK, and Mono all are ported to s390x. There is plenty of documentation.

z/Architecture ISA

Linux on z ABI

QEMU for s390x

You can request access to a system (VM/VPS) through the LinuxONE Community cloud that I mentioned in the first message.

The LinuxONE Community Cloud also hosts Travis-CI instances, as Carlo used.

IBM s390x experts are available to answer questions.

40 hours seems a reasonable estimate. We can continue to discuss how to set up incremental milestones and associated bounties.

I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level.

both gcc and clang can crosscompile code for s390x (and indeed I have both setup in the CI in 2 versions of Ubuntu), an I was originally using the standard gcc crosscompiler package from Ubuntu 20.04 to build the code and develop it further, so unlike TileGX this is fairly more open (with CI / cloud VMs available and several linux distributions to use in real hardware)

qemu-user targets s390x and does work, but it fails tests either because the instructions we are using are not correctly emulated in the default CPU it uses, or because the original code targets z12 and the implementation differs enough. running with an emulated z12 or z13 might help but using a real VM is easier.

which is why I mention CI and was hoping in the third step to broaden cpu support, which will also help real users with older hardware, even if initially the target will be z13 (which includes vector support).

in that context my "proprietary" label, meant we will need to rely for now in the linux one cloud for development, but unlike TileGX it is likely to get better later.

I have checked the documentations. At first glance the system design matches well to sljit, it uses two-complement number format, the ABI is similar to PowerPC, and it has condition codes, wide multiply, and IEEE single / double precision floating point. It also has an incredible number of instruction forms, even more than ARM-T2, probably because it is an old architecture. I have one more question, do we need to support EBCDIC? Because PCRE2 has many ASCII / UTF related optimizations which will not work with it.

Linux on IBMz is ASCII and we only are asking to port PCRE2 JIT for Linux, not to z/OS in EBCDIC.

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

is z14 what is being used by the Linux ONE cloud provided vm? the travis containers are running older than usual Ubuntu versions as well so I was suspecting they might be constrained by the CPU version there as well.

note the main concern here is on being able to maintain the code base moving forward, which is why QEMU was ideal (since it can run locally on each developer workstation, and we have a lot of different OS to support there, including one folk with z/OS that is likely to benefit from this port if done in a compatible way). Of course CI and remote access to a VM in native hardware is a good enough substitute but it is less scalable, hence why I was hoping will be only needed through the original bootstrap.

I believe that the LinuxONE systems (hosted at Marist College) now are z15. Mainframe processors are not available in laptops, sorry. I understand the appeal, but I would recommend relying on the remote access over QEMU.

I am not certain what OS levels are available through Travis CI on s390x.

looked at the qemu side, and definitely we are unlikely to get that going regardless of how much we tweak the code generator CPU support; FWIW even qemu system (running fedora rawhide for s390x) will segfault with "interruption code 0010 ilc:3" :

Screen Shot 2020-09-04 at 7 32 12 PM

I think this issue should not be closed

How is the enablement work proceeding?

How is the enablement work proceeding?

slower than planned, mainly to my poor planning; but in the right track after the first phase was completed (with a lot more changes and still a few more reservations than were originally expected).

I have started to do some fixes but I need to learn a lot more before I can do them in a way I want and my time on voluntary work is quite limited.

@carenas Any updates about this project?

phase 1 is included in the RC1 for PCRE 10.36, which means that with the right setup we could now get more people working in parallel to cleanup and complete the implementation to be ready for end users.

after merging phase 1 it was clear that my original plan was going to incur on too much tech debt and so it is reasonable to expect that phase 2 (getting vector operation support) and phase 3 (cleaning up old tech debt) will benefit to have wider distribution and therefore more hands than what I was originally expecting to have (only me)

this obviously doesn't qualify for the bounty terms (which in all fairness I have to admit, I was uncomfortable with, after pulling so much volunteer work reviewing phase 1), but I am still committed to get this out (hopefully with a little more help, and even if that means I will have to "subcontract") that work.

apologies for the delays, but I am hoping at the end the PCRE release that will be enabled for user support with JIT in s390x will be then of better quality.

it is important to note that most of the work I'd been doing (in PCRE, not in sljit) was to add for phase 2 support for FPU in the same way it is done for the other architectures, but my concern is that it might be too hacky and therefore increase tech debt unnecessarily even if it is possible (was hoping to get mostly vector instructions in, while leaving everything else untouched as was planned originally for phase 1), but I am concerned that with adding a third implementation it might also make sense to refactor the other 2 moving code from PCRE into sljit as well, which will be obviously a bigger undertaking that planned.

@carenas Let's figure out how to solve this. There are multiple options:

  1. IBM can increase the bounty, within reason, if the project is larger than originally estimated.
  2. Bounties can be split among multiple people in any proportion.
  3. IBM can re-arrange the bounties into multiple parts associated with incremental milestones.

I'm a little concerned with the architecture redesign because IBM would like to have some PCRE JIT available on IBMz sooner rather than later. We can collaborate on a redesign as a second phase.

@zherczeg for fairness sake could we figure out what would be required for you to implement phase2 (without doing refactoring, and since it will be easier to follow your current pattern) and how big do we need phase3 to be to make sure everything that is currently in TODO/FIXME gets fixed within reason?

it is obviously too late for 10.36, but I could maintain a semi stable experimental patchset on top of it to aid anyone interested on backporting/testing this feature into that release and until we hopefully can get everything cleaned up and released for user consumption with 10.37.

If I understand correctly phase 2 means getting vector operation support. In PCRE2, there are SIMD functions, which can search characters or character pairs. Each of them has a corresponding HAS macro, so you don't need to implement all, only those you want. It is true, SIMD registers are often use the "same" registers as FPU, but these two things are nothing to do with each other. On 32 bit ARM for example, it is a bad practice if fpu and simd instructions affect the same registers, because internally they are different registers, and the CPU have to copy things. The reason why I never attempted to do a vector instruction set in sljit, because these instruction sets have many specialized instructions, and their approach (concept) for handling things are surprisingly different. Hence each SIMD accelerated function while doing the same thing from the perspective of PCRE2, may work quite differently.

I would like to know something. How much work from my side is needed for this project. If the effort is bigger, I would like to have a formal contract, preferrably with the University where I am working.

IBM would like PCRE on Linux on Z to achieve feature and optimization parity with other architectures, such as x86, ARM and Power. I'm confused if the FPU and SIMD vector support are enabled on other architectures or this is new functionality. Or you want to engineer it in a different manner on Z. Or you want to use this as an opportunity to redesign the support.

Also, I would prefer to avoid a university contract because that introduces a huge amount of bureaucracy and delays. I have been able to flexibly work with many other Open Source projects through bounties, from LLVM to PyPy to OpenBLAS to OpenCV to VLC to Sleef with developers in a variety of continents. I hope that we can make progress without undue complexity.

SIMD is currently used on x86 and aarch64. FPU is supported by sljit feature, but PCRE2 does not use it.

IBM would like sljit for PCRE2 to be enabled and optimized on Linux on Z, equivalent to x86 and AArch64. And, ideally, SIMD optimization, equivalent to x86 and AArch64.

I am unclear if @carenas is suggesting a re-engineering of sljit as part of the implementation. Why can't sljit for Linux on Z be implemented in a manner equivalent to x86 and AArch64? I thought that was the proposal.

would love to suggest if we could just jump into a meeting, to get all our ideas clarified in a more effective way?, I am available anytime you need me to, and could setup a google meeting if given appropriate ids

@carenas Who are you proposing for the meeting? You and I? Or @zherczeg as well? I'm available Friday after 11:00 ET, but that may be too late for you and Zoltan.

perfect for me (I am PST), but might be too late for Zoltan who is AFAIK somewhere in Europe.
@zherczeg do you have a suggestion, should we include Philip for the PCRE part?

@edelsohn I know you are very busy but if that time doesn't work for Zoltan I am also available for a 1-on-1 which I think it is long overdue anyway at the time of your convenience and based on the availability below.

hopefully wouldn't take more than 30min; to easy coordination of the time had setup the following:
http://whenisgood.net/bka3b5y

I don't think Philip is needed unless you want to touch code outside of jit. I am in CET time zone, and a call starting after 9.30pm on Friday (which seems 3.30pm in ET and 12.30pm in PST) could work for me. I would prefer a service which works in a browser under Linux, and no registration is needed to join. But you can have a call without me of course.

I am in CET time zone, and a call starting after 9.30pm on Friday (which seems 3.30pm in ET and 12.30pm in PST) could work for me.

sadly I already have a conflict that I won't be able to reschedule around that time (which is why my proposed times
in the "whenisgood" link above started at 3PM PST for Friday and that I now realize was pushing you into Saturday (because of the time differences).

my earlier available hours don't work with @edelsohn constrain of "after 11AM his time", but I could do anytime before 11AM PST in case that could be resolved (even though I think we are too close for comfort and might be better served with a later schedule)

I have plenty of time during the weekend though, or we could push it a few more days until beginning of next week, which will allow us also some more time to collaboratively come out with some minutes to make this more effective.

I would prefer a service which works in a browser under Linux, and no registration is needed to join.

@zherczeg: could you host such a service?, all the ones I can think of might require some sort of user account or a proprietary solution (ex: slack), if using slack with google accounts is good enough I can provide a slack channel, which could be used also long term to make sure there are no more misunderstandings:
https://join.slack.com/share/zt-jfs6uef4-Ojeu02hll4EL0dXFhULgmA

But you can have a call without me of course

my one-on-one with David would be mostly to make sure that all the misunderstandings between the two of us are resolved earlier and in preparation with talking with you, the same also applies if you would like to have a one-on-one at a time when David might not be available.

I am afraid though that without your participation there is no way to solve the current impasse I might had gotten us into, and for that I apologize.

@carenas

You wrote:

I am available anytime you need me to,

I wrote:

I'm available Friday after 11:00 ET, but that may be too late for you and Zoltan.

I didn't write every day after 11:00 ET. You said any time and I proposed the first time available. Apparently you are not available any time. Please be precise in what you write.

I am available other days after 8:00 AM ET, but you did not provide those times in whenisgood.

I also do not understand what is so complicated about the proposed project that we need to talk in person. IBM wants sljit to function in PCRE2 on Linux on Z with equivalent functionality to x86 and AArch64 - at least integer and, ideally, SIMD. Presumably the Z support can use the same design and infrastructure as existing architectures.

It seems the main question is quality. For exaple:

  • The code currently use 3 temporary registers, while 2 would be enough (one more for the application).
  • Currently the code saves all registers on function enter (should only save the used registers)
  • The arch is very old, and it has a large amount of instruction forms. A good code generator as many as possible, and do not emit 2-3 instrutions when one is enough.
  • The implementation should look more like the implementation of other CPUs.

These all are quality questions. The code can work without it, and you may consider it as "equivalent functionality". Probably these can be discussed without a call, but it would be good to know the qualty targets for IBM.

I read the terms of use of the bounty provider, and it seems it does not handle taxes. This looks like a big difficulty for me.

I think I can setup a video meeting with the required constrains using Jitsi Meet (video encouraged but not required)

https://meet.jit.si/pcre2-linux-s390x

could you both be available for 1h (hopefully will take less) around Mon Nov 16th noon ET (AKA EST/GMT-5/UTC-5); whenisgood UX might be a little confusing so will avoid it this time, but I am available for 2h around that time and hopefully fits everyone's constrains or can be adjusted easily.

From my own experience (and I understand the frustrations) getting "together" to meet and understand each other's motivations could go a long way towards resolving difficult issues and finding a common ground that benefit us all with compassion.

Agenda would be (open to further adjustments and not to be followed too strictly) :

  • short introduction of each participant (indicating background, reason for involvement, constrains and personal objectives)
  • presentation of current state (as seen by the organizer) followed with short discussion to correct perceptions from all other participants
  • presentation of business requirements by sponsor (based on what was written in the "spec" above) with an emphasis on market fit. It should be at this point that we should have a common understanding of how well the currently merged implementation that will be released with PCRE 10.36 will fill the ask, as well as if the original plan is still valid, but nonetheless we should come out with a plan agreed by all participants of how we will proceed from here, including clear MUST and SHOULD objectives and a rough timeline for implementing it.
  • brainstorming to evaluate tasks that could be taken to solve all remaining blockers and assign action items to participants (including ETA).

Discussion will be done in english and I would share minutes of them after for revision (within participants, which might require a google account to allow for collaboratively editing) and once agreed to be acurate publish here (in text) for the rest of the community.

I can meet on Monday, Nov 16, at 12n EST.

That is 6pm for me, I can join for half an hour.

That is 6pm for me, I can join for half an hour.

if we move it earlier or later could we get a full 1h (my hope is we won't need really a full hour) and will adapt the agenda for a shorter timeframe otherwise but I am hard to understand when I speak too fast

thanks both for your help and understanding, below an alternative bluejeans which might be more robust as a fallback:

Meeting URL
https://bluejeans.com/313712622?src=join_info

Meeting ID
313 712 622

Want to dial in from a phone?

Dial one of the following numbers:
+1.408.419.1715 (United States(San Jose))
+1.408.915.6290 (United States(San Jose))
(see all numbers - https://www.bluejeans.com/numbers)

Enter the meeting ID and passcode followed by #

Connecting from a room system?
Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode

and that we could use as a backup if we have logistical problems with Jitsi and to avoid wasting precious time

I never saw any Jitsi invitation. I can join via Bluejeans.

I never saw any Jitsi invitation. I can join via Bluejeans.

Jitsi doesn't do invitations AFAIK, but I was going to create the meeting and share the link at that time;
agree Bluejeans is a nicer solution though and can manage the calendar but I am not sure if it will work for Zoltan (hence why I would like to keep it as a backup until we confirm otherwise)

sent you an invite to the email associated with your github so you can manage your calendar, anyway

Maybe I can be there a bit longer. Jitsy seems like a good solution.

And share the link in Github comments? This really isn't the right medium for an interactive conversation to schedule a meeting, nor to share links to a meeting.

I have tried to do some minor improvements in the code, so I tried to log in into the virtual machine I created. However I got an "An unknown error has occurred,Please try again later." After a few days I still got the same error, so I decided to delete the vm and create a new one (the vm quota is 1). However, when I try to create or upload an ssh key for the new vm I got the same error. Since this is the only service where we can test the code, we probably need to wait until they fix it.

There is no known, system-wide problem with the VMs at Marist. Have you reported the problem through the support system?

Is there an easy way to insert a breakpoint instruction on s390? I tried svc and trap but no success so far. I could probably call a function as the worst case, but that is not the nicest soultion.

I'm not a s390x expert. The GDB breakpoint instructions seem to be bytes 0x0,0x1 . I don't know to what instruction that corresponds. Have you tried examining a s390x breakpoint in GDB to see what instruction it inserts?

It looks like gdb cannot decode it as an instruction with disassemble, but it stops at least, and can continue the execution, so it is good in practice. Thank you for the help.

I found another strange thing. The display/i $pc does not work in JIT code, gdb says PC not saved. The followng code shows that it works in normal functions, but not in JIT code:

165       rc = convert_executable_func.call_executable_func(&arguments);
(gdb) display/i $pc
1: x/i $pc
=> 0x2aa00165b98 <pcre2_jit_match_8+672>:       lg      %r1,232(%r11)
(gdb) si
165       rc = convert_executable_func.call_executable_func(&arguments);
1: x/i $pc
=> 0x2aa00165b9e <pcre2_jit_match_8+678>:       aghik   %r2,%r11,264
(gdb) si
0x000002aa00165ba4      165       rc = convert_executable_func.call_executable_func(&arguments);
1: x/i $pc
=> 0x2aa00165ba4 <pcre2_jit_match_8+684>:       basr    %r14,%r1
(gdb) si
PC not saved
(gdb) si
PC not saved

I can type x/i $pc after si as a workaround, but it takes a bit more time. Is there a gdb command to fix this?

I have another question about vector registers. It seems the CPU has 32 vector registers: the first 16 is mapped to the fpu registers and the other 16 can be accessed with the RXB field of the instruction. However the ABI does not mention anything about the vector registers:

https://refspecs.linuxbase.org/ELF/zSeries/lzsabi0_zSeries.html#AEN413

The libc memchr routine uses vector registers above 16 without saving. Can I assume that registers < 16 has the same saving rules as the fpu registers and registers >= 16 are all volatile?

0x0,0x1 seems to be the byte sequence that GDB uses for a breakpoint, but I didn't suggest that it was a normal instruction. I suggested that you could examine a breakpoint set by GDB (which should be the same 0x0,0x1), not that GDB would provide a useful disassembly of the sequence.

I don't know about the x/i $pc error. Is it possible that the JIT code is not setting up the registers completely? The JIT code doesn't have to follow the normal calling convention and maybe it is avoiding some steps for speed, but would complicate debugging.

I'm not an expert on IBMz and don't know about the vector register ABI. I will try to find someone from IBM to join this issue and answer the questions.

(gdb) si
PC not saved

This is a bug. It looks like a known problem in GDB that may occur on s390x when the code being stepped has no binary associated with it.

Can I assume that registers < 16 has the same saving rules as the fpu registers and registers >= 16 are all volatile?

All vector registers are volatile, except for the bits that overlap with the nonvolatile FPRs.

@zherczeg: to workaround not being able to use gdb to disassemble the generated code I integrated sljit with capstone and added a feature to dump the assembler that you could use from my capstone branch branch (not yet ready to be merged though and a little rusty but stable enough to be useful IMHO)

[...] not being able to use gdb to disassemble the generated code [...]

Really? From reading the comments in this issue, I only understand that GDB doesn't disassemble the breakpoint instruction "0x00 0x01". Everything else should be disassembled as usual, right? And the breakpoint instruction shouldn't normally occur in any code. GDB does insert it when the user says "break ", but then GDB hides the breakpoint instruction in the disassembly and shows whatever was there before. So this should only be an issue if the breakpoint instruction was inserted in some other way, and I wonder why that would be necessary.

[...] not being able to use gdb to disassemble the generated code [...]

Really? From reading the comments in this issue, I only understand that GDB doesn't disassemble the
breakpoint instruction "0x00 0x01".

sorry, for oversimplifying the problem, but you are correct, the problem is not in disassembing the code, but in disassembling the code and being able to run into it which my code changes also don't address (as I found more useful to avoid doing breakpoints in the generated code anyway)

my proposed changes add two options to the "runTest" that come with sljit to allow for dumping the generated code assembler (like gdb would do) and to jump into it to validate it does the right thing after a test was created for it, it is in the test where I add logic to validate that the validated code does what it is expected so then I have no need for either gdb or breakpoints.

Carlo originally was looking for some instruction to manually set a breakpoint. When his attempts failed, I suggested the BREAK instruction sequence (0x0,0x1) inserted by GDB.

I think some of the confusion stems from ambiguity about what Carlo really wants to accomplish. What does he want the instruction to do? He wants an instruction that generates some sort of a "trap" operation that non-destructively suspends the program? He wants an instruction that is visible to a debugger (as opposed to the magic 0x0,0x1 sequence that GDB knows to ignore)?

I believe that the solutions are operating correctly but not what Carlo expected and we need more clarification about what Carlo really is trying to accomplish.

Carlo originally was looking for some instruction to manually set a breakpoint. When his attempts failed,
I suggested the BREAK instruction sequence (0x0,0x1) inserted by GDB.

this was work Zoltan was doing though, but yes we both had the same need which I will try to explain below and that is not even specific to s390x but also why we were earlier mentioning QEMU support as something that would be nice to have (since with an emulator you have full control on how the CPU state changes from instruction to instruction)

sljit is creating machine code dynamically, and we make the generator based on our undestanding of the documentation for that instruction.

more often than not (at least for me) I get the opcodes wrong, or misunderstand the side effects of the instructions and end up running something that could trap badly (if lucky)

at that point the only way to know what went wrong is to isolate the code generated and step through it, looking at the CPU and memory effects and for that adding a breakpoint to an specific memory address (when loaded into the PC) helps going to the interesting part of the opcodes faster.

the alternative I was using (and proposed in my dev branch) sidesteps the issue by allowing me to programatically tell the sljit program to print the generated assembler (which capstone allows, and will helpful spot invalid opcodes or even invalid sequences, that way without having to crash), and once I feel confident enough, execute the function instead and make sure its effects are what was expected (based on the documentation) which is what the test suite does for every single instruction on every supported cpu.

In this last step, I might add some asserts and iterate with some throwaway varations of the test case, to confirm all documented side effects are really observed (to guard against any documentation bug) and to make sure the generator fails safely if pushed hard against the specific instruction constrains (ex: when some possible combinations result in invalid instructions)

I believe that the solutions are operating correctly but not what Carlo expected and we need more
clarification about what Carlo really is trying to accomplish.

would let Zoltan explain here, but from his comments I suspect he is trying to add the missing FPU instructions that are needed to generate vector operations.

The 0x0, 0x1 works nicely form my side. If you run the code without gdb, it stops with illegal instruction, but it does not matter for me. In gdb, it stops after 0x0, 0x1, and I can disassembly the rest of the code (from $pc). And the vm is more responsive than it was last year, so I can do some meaningful development this week.

PCRE2 has 3 simd accelerated functions, and I have landed code for the simplest one:
https://lists.exim.org/lurker/message/20210106.075205.4d7fc4b1.hu.html

The good news is it seems s390x simd is quite good for text processing.

The bad news is that I had to add some TODOs because the s390x port emulates CPU status flags. I don't know why since the CPU has hardware support for flags. I also noticed that many instructions are generated for simple computations which needs to be improved as well.

The bad news is that I had to add some TODOs because the s390x port emulates CPU status flags. I don't know why since the CPU has hardware support for flags. I also noticed that many instructions are generated for simple computations which needs to be improved as well.

Right, I wondered about that, too, and asked the original author Mike Monday about this. He answered:

SLJIT supports two flag registers: one representing whether the result of
an operation is zero or not and the other representing a different
condition: for example, whether the left operand was less than the right
operand or the operation overflowed. The way SLJIT is architected we need
to represent both flag registers simultaneously which IBM Z can't natively
do since the ISA only has a single 2-bit condition code register (e.g. the
result of an add logical may be 0 and it may have overflowed - the
condition code can only tell you one of those facts despite both possibly
being true).

I don't understand that explanation, though, because z/Architecture instructions usually do indicate a zero result in the condition code (CC), if applicable. Maybe SLJIT has a special way of dealing with these flags that makes it difficult to map them to the CC, but I'd be very surprised if that couldn't be handled somehow. At the time when I asked Mike about this, he was already off the project, and he considered the port to be in a prototype state, so we didn't follow up on this issue.

I identified the following problems:

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available. In other words only overflows can be detected reliably. However, LOGICAL variants can at least be used for detecting zero result (normal and logical addition is the same operation except setting flags). Subtract something and checking zero result is a frequent operation.
  • Some instructions sets value 0 when the result is zero, others sets both 0 and 2. I don't see any other variants at Appendix C in the ISA document. Currently conditional opcodes are not aware which variant needs to be used. This is probably one reason of the emulation.
  • It seems to me that the COMPARE and ADD / SUBTRACT LOGICAL instructions use the flags in a different way. On other systems COMPARE and SUBTRACT do the same thing, except COMPARE does not store the result. By the way, SUBTRACT LOGICAL can tell if the result is == 0 (equal) or < 0 (borrow). Is it possible to tell if the result is > 0? The source registers cannot be swapped similar to ADD LOGICAL.
  • It is not possible to prevent setting the status flags by any instructions

Overall it seems to me that SUBTRACT instructions cannot be used to set status flags in general, since COMPARE instruction produces different flags. ADD / SUBTRACT LOGICAL can be used if the zero result needs to be checked, but they might set two different values. Setting some combinations (e.g. overflow + zero) might need some emulation, but they probably never needed in practice.

As for the operation, a COMPARE is likely needed after an operation when non-zero status flags are requested. It cannot be done before the operation, since the operation may overwrite flags. Hence, if the result register is the same as one of the source registers, its original value needs to be preserved for the comparison. As for shifting, an or operation with the same register operands can set the zero result flag.

Whatever flag model we choose, we need to check / update manually every instruction which set conditional codes one-by-one. There are a lot of them, so this looks like a lot of work.

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available.

Right. However, in C semantics a signed integer overflow would be an undefined condition. I don't know how SLJIT expects those to be handled.

Also, if necessary, or until a better way is found, you can always add a "load and test" instruction such as LTR or LTGR to compare a signed integer against zero. (Source and target registers can be the same here.)

  • It seems to me that the COMPARE and ADD / SUBTRACT LOGICAL instructions use the flags in a different way. On other systems COMPARE and SUBTRACT do the same thing, except COMPARE does not store the result.

Well, COMPARE indicates the same flags as SUBTRACT, except when SUBTRACT would overflow. Basically, COMPARE sets the flags according to the full arithmetic result. Similarly, in C semantics two signed integers can always be compared, but not always be subtracted.

By the way, SUBTRACT LOGICAL can tell if the result is == 0 (equal) or < 0 (borrow). Is it possible to tell if the result is > 0?

All the "LOGICAL" instructions operate on unsigned integers, so a "LOGICAL" result can never be smaller than zero.

  • It is not possible to prevent setting the status flags by any instructions

Not in general, but there are cases where certain instructions can be used in place of others to prevent the condition code from being set. An example is "rotate then and/or/xor/insert selected bits", where there are variants like RISBGN.

Overall it seems to me that SUBTRACT instructions cannot be used to set status flags in general, since COMPARE instruction produces different flags.

Not sure. It really depends on how SLJIT expects signed integer overflows to be handled.

Let me explain how status flags works in sljit. Sljit only supports two status flag bits, one is "result is zero", the other depends on the instruction and customizable. For example it can be a signed greater flag, an unsigned less than equal or carry for a subtraction operation. Setting a status flag must be explicitly requested, otherwise the flag is undefined. For example, if carry flag is requested for an addition, and the next jump depends on the zero or signed less flag, the behavior is undefined (in debug mode this triggers an assertion in the compiler). So a valid sljit code can only use the carry flag after the previously mentioned addition. Unless it is explicitly stated that an instruction does not modify flags (e.g. mov, call, jump instructions), the flags are undefined. Hence it is valid to do some mov operations before the carry flag is used, but not another addition.

Sljit does not have an "undefined on overflow" behavior: if zero flag is requested to be set, it must be set. Obviously emulation might be needed in some cases, but currently 2+ instruction are used when zero flag is used, and that is a lot. The ideal is 0 instructions in the commonly used cases.

I may misunderstand something, but COMPARE LOGICAL sets 0 for equal, 1 for less, and 2 for greater, and SUBTRACT LOGICAL sets 1 for less, 2 for equal, and 3 for greater. Apart from less, these do not match. We can use the latter all the time, and discard the result when compare is needed, but compare might have instruction forms which are useful.

  • ADD / SUBTRACT : they can indicate that the result is == 0, < 0 or > 0 but only if no overflow occures. Otherwise this information is not available.

Right. However, in C semantics a signed integer overflow would be an undefined condition. I don't know how SLJIT expects those to be handled.

sljit assumes that negative numbers are represented in Twoโ€™s Complement format and therefore mostly ignores overflows (even the ones that could be considered as undefined per C)

in practical terms this is not a problem because all CPUs supported use that format, and most modern CPUs that might be added will also support that format, but it is for sure something to consider if we ever have to generate code for a CPU that has 1 bit complement or something else.

Let me explain how status flags works in sljit. [...]

Thanks, that explanation helps.

Sljit does not have an "undefined on overflow" behavior: if zero flag is requested to be set, it must be set.

Well, the only case where the result of a signed addition can be zero and an overflow occurs is when calculating INT_MIN + INT_MIN. So SLJIT expects the zero flag to be set in this case, right? (I find it a bit unfortunate that we need to introduce special handling just to cover this case.)

I may misunderstand something, but COMPARE LOGICAL sets 0 for equal, 1 for less, and 2 for greater, and SUBTRACT LOGICAL sets 1 for less, 2 for equal, and 3 for greater.

Your understanding is correct. COMPARE LOGICAL sets the CC as usual, while SUBTRACT LOGICAL is designed to prepare the borrow for a multi-precision integer subtraction.

Perhaps we could add another s390-specific field to the sljit_compiler structure that indicates the "CC mode" of the last operation performed, such as "carry/borrow mode" versus "normal mode"? Then we could make the result of get_cc dependent on that. Just a thought.

Zoltan reported the following progress:

I had some free time last week and I have added SIMD support for s390x. The code is landed and the next step could be measuring the improvement, but virtual machines are not suitable for this. The SIMD support is currently enabled by default, although we might need to support compile or runtime check for availability.

I have noticed that the code generator needs improvement. For example, adding 16 to a register (r3 = r4 + 16) is translated to four instructions:

lghi %r1,16
lgr %r0,%r3
algr %r0,%r1
lgr %r4,%r0

This is less efficient compared to other ports which can do this with one machine instruction. Probably the code generator is less complex than other ports, which tries to exploit the strengths of their corresponding instruction set. I suspect improving this is not a trivial task (maybe we need to completely redesign it), but it should be since this is a serious disadvantage for the s390x port.

@aarnez Can you help measure the performance when you have a moment? And maybe you have a suggestion for Zoltan's code generation observation.

#106 should be the next step in this work. The patch focuses only flags, the optimal instruction selection is still far away. I cannot measure the perf on a virtual machine.