capstone-engine/capstone

`auto-sync` progress tracker: Refactor and implement architectures

Rot127 opened this issue ยท 26 comments

Rot127 commented

Note to x86: x86 is not part of this list, because we can not generate all tables in C.
Refer to capstone-engine/llvm-capstone#13 for details.

Note about changes introduced with auto-sync:
For a preview what changes will come in v6, please take a look at the WIP release guide.


This issue tracks the auto-sync refactoring and implementation effort of architecture modules.

The table below lists the responsible developers for each architecture.

In progress

Arch CS PR llvm-capstone PR Part of (planned) release Assigned developer(s) Based on LLVM repo
MIPS None yet None yet v6 @wargio LLVM-project
SPARC None yet None yet v6 @DMaroo LLVM-project
LoongArch #2349 capstone-engine/llvm-capstone#47 v6 @jiegec LLVM-project
BPF None yet None yet v6 @0pendev LLVM-project
SystemZ None yet None yet v6 @Rot127 LLVM-project
Xtensa None yet None yet v6 @ashiskumarnaik LLVM-project
ARC None yet None yet v6 @R33v0LT LLVM-project

.td edits upstreamed

Most LLVM td files miss some information about instructions (memory read/writes, operands incorrectly assigned as in/out etc.). Since we rely on this we need to fix it. Those fixes should be upstreamed to LLVM.

Done

Arch PR Part of release Assigned developer(s) LLVM repo
Alpha #2071 v6 @R33v0LT LLVM-project (release v3.0)
AArch64 #2026 v6 @Rot127 LLVM-project
ARM #1949 v6 @Rot127 LLVM-project
PPC #2013 v6 @Rot127 LLVM-project
TriCore #1973 v5 @imbillow TriDis
HPPA #2265 v6 @R33v0LT Not Auto-sync based

Arch extensions

Adding CPU extensions which are not part of upsteram LLVM is easier now.
Here are they tracked.

Arch Extension name issue previous attempt Done
PPC VLE #2241 https://lists.llvm.org/pipermail/llvm-dev/2014-July/074613.html No
PPC PS (Paired-Single) None https://reviews.llvm.org/D85137 Yes

Effort level of not refactored/implemented archs

Arch Number of operand groups Generates Note Implementation type Difficulty level
AVR ~3 Yes None New Easy
CSKY ~7 Yes None New Medium
DirectX ~1 Yes Deviates from common design. New Medium-Hard
Hexagon ~2 No Deviates from common design. New Hard
Lanai ~10 Yes None New Easy
M68k ~28 Yes None Refactor Medium
MSP430 ~6 Yes None New Easy
RISCV ~7 No Possibly new edge cases to implement Refactor Hard
SPIRV ~9 No td files faulty New Medium
VE ~8 Yes None New Medium
XCore ~15 No td files faulty Refactor Medium
  • Number of operand groups: Operand groups which have a distinct print functions. Indicates effort to implement the LLVM <-> CS mapping code (fill cs_detail and the like).
  • Generates: inc files generate with most recent backends.
  • Note: Worthy to note.
  • Implementation type: Refactor current implementation or implement new arch module.
  • Difficulty level: Guessed difficulty of this arch (base on points above and complexity like number of instructions etc.). Though "Easy" still means you have to familiarize yourself how LLVM definitions and the updater work. My guess is it will take at least a week of work.

Getting started

  • If you like to refactor an architecture module or implement a new one, please comment here and we add you. Also we can give hints to important information.
  • Please add a draft PR once you've done the first commit, so the progress is visible and there is a place for discussion.
  • Please refer to the auto-sync documentation to learn how to refactor or implement an architecture with auto-sync

TODO for refactored archs

List of missing things which should be done before v6 to get a nice round package.

Capstone

  • Update docs with ASUpdater.py instructions.
  • #1984
  • Update all archs to LLVM 18
  • Remove tablegen files from suite.
  • Add CS assert version and add the asserts to the LLVM files again.
  • Wrap all possible code into CAPSTONE_DIET.
  • Run 0x0 to 0xffffffff as input once on ARM, PPC, AArch64 (with details enabled) to check for segfaults.
  • name2id docs. Parameter max should be changed to table size and in the loop be max - 1
  • Consider to have alias details and real details live along. So users do not need to decide for one (how would this play together with CAPSONE_DIET).
  • Possibly #2152
  • #2196
  • Expose PPC instruction formats on the public interface

LLVM revisions

Auto-Sync

  • add refactor setting to auto-sync updater.
  • Add auto-sync unit tests
  • Translate template functions as functions, not as macros.

Backends

  • Generate decoding/printing macros as functions, if there is only a single version (allows proper debugging, which would be a blessing).

ARM

  • Add general alias and alias operand handling.
  • Add vector layout information
  • Set post_index when base regsister is tied. Just to make sure to hit every case.
  • Encoding info
  • Move data type insn mapping to own auto-sync class.
  • #2193

PPC

  • Encoding info

AArch64

  • Encoding info
XVilka commented

@kabeor @aquynh @Rot127 I propose to make the next release, with auto-sync changes a 6.0, not 5.1 because:

  • There are slight API changes
  • The amount of code changes is HUGE

https://github.com/orgs/capstone-engine/projects/1 - then it would need to be updated too.

aquynh commented

please can you summarize the API changes here?

Rot127 commented

@aquynh

ARM

  • Enum changes:
    • ARM_CC_ -> ARMCC
    • System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
    • Some instr. enum entries no longer exist (e.g. VPUSH, VPOP).
  • Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)
    • Groups like RET, INT should be added via Mapper separately.
  • Feature groups like ARM_GRP_CRC are renamed to match LLVM nameing: ARM_FEATURE_HasCRC
  • Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.
  • The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.
  • writeback is part of detail and no longer of detail.arm.
  • Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.
  • The immediate value of operands is no of type uint32_t, no longer int32_t.

PPC

  • Predicate enums members are renamed. They now use the LLVM name (e.g. PPC_BC_NU_PLUS -> PPC_PRED_NU_PLUS).
  • Branch conditions are now saved in more detail in cs_ppc.bc.
  • The base register of an PPC memory operand was not present if reg = r0. This is fixed now.
  • ppc_ops_crx is removed (wasn't used).

AArch64

  • Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.
  • SME operands changed (contin more detail, terminology is closer to the docs).
  • System operands change (now categorized into SysAlias, SysImm, SysReg).

This list is also part of the PR.

aquynh commented

@aquynh

ARM

  • Enum changes:

    • ARM_CC_ -> ARMCC
    • System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
    • Some instr. enum entries no longer exist (e.g. VPUSH, VPOP).
  • Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)

    • Groups like RET, INT should be added via Mapper separately.
  • Feature groups like ARM_GRP_CRC are renamed to match LLVM nameing: ARM_FEATURE_HasCRC

  • Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.

  • The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.

  • writeback is part of detail and no longer of detail.arm.

  • Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.

  • The immediate value of operands is no of type uint32_t, no longer int32_t.

PPC

  • Predicate enums members are renamed. They now use the LLVM name (e.g. PPC_BC_NU_PLUS -> PPC_PRED_NU_PLUS).
  • Branch conditions are now saved in more detail in cs_ppc.bc.
  • The base register of an PPC memory operand was not present if reg = r0. This is fixed now.
  • ppc_ops_crx is removed (wasn't used).

AArch64

  • Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.
  • SME operands changed (contin more detail, terminology is closer to the docs).
  • System operands change (now categorized into SysAlias, SysImm, SysReg).

This list is also part of the PR.

cant we avoid breaking compatibility?

Rot127 commented

@aquynh The short answer is no.

But let me go into more details also for others:

The problem with automatic Capstone updates is that due to the C++ and C difference we have many cases to handle when C is not equivalent to C++.

To reduce those cases we need to be as close to the original C++ syntax and semantic as possible. Because every renaming (i.e. enum values), semantic and overall design changes, almost always add manual work during an update.

This is why those breaking changes are needed.
Each of them moves Capstone code semantically or syntactically closer to the LLVM definitions.

This is of cause a pain for compatibility, but it is definitely worth it in the long run.
Because:

  • All auto-sync archs are semantically pretty much equivalent to LLVM.
    • Which gives more correct results.
    • Results are comparable to llvm-objdump.
    • Eases test generation.
    • Easier to extend CS in the future with other information known in LLVM (see #2045 which adds instruction encodings for all auto-sync archs without much trouble).
  • Unifies how modules work, so we can share code between them (see the new Mapping.* files).
  • Reduces the effort to update (less manual work, due to less edge cases). So hopefully more people update their archs.

Here more detail to each breaking change.

Enum changes:

Done to match LLVM naming. It saves us to change enum names over several files whenever we update.

ARM

Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)

Capstone unique instr. groups (like RET, INT) are added via Mapper separately (is on the toddo list). Because they are not defined in LLVM, we can not generate them without adding exceptions again.

Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.

Simple necessity, because with the new instructions the same bytes have a valid decoding depending on the enabled features.

The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.

Move closer to the LLVM logic. The disponent of a memory access doesn't need to be within the [] brackets (e.g. strt fp, [sp], 4). But the disponent is defined as part of the memory operand. This was incorrectly represented in CS before.

writeback is part of detail and no longer of detail.arm.

We support now the concept of Tied operands (the way LLVM describes writeback registers). So writeback information is now known for all auto-sync archs.

Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.

As said before, modules will be more equivalent to the llvm-objdump results. Also the register naming and the decoded asm string.

The immediate value of operands is not of type uint32_t, no longer int32_t.

See: #2056

PPC

Branch conditions are now saved in more detail in cs_ppc.bc.

Just a nice feature we are now able to provide.

The base register of an PPC memory operand was not present if reg = r0. This is fixed now.

A semantical fix. The base register should have been set.

ppc_ops_crx is removed (wasn't used).

Wasn't used.

AArch64

Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.

This is a big one. But having two names for the same architecture in the code is a nightmare for generation. Also it just doesn't bring any value. Being closer to LLVM is the choice here.

SME operands changed (contain more detail, terminology is closer to the docs).

Again a nice feature addition because we save more detail. Being closer to the official docs when it comes to naming eases integrating Capstone in other projects.

System operands change (now categorized into SysAlias, SysImm, SysReg).

Again, move to LLVM semantic because:

  • It wasn't correct before (system immediate and other alias were incorrectly identified as system registers or not categorized at all).
  • This mimics the inheritance of system operands within the LLVM code.
  • Eases generation.

Personally, I think that Capstone will become more and more irrelevant as disassembler engine if we:

  • Do not modernize it (update archs, update testing)
  • Provide people a relative easy way to add more features and architectures (e.g. the instruction encoding, instruction form information (for PPC) and others).

If we do not go through the pain of breaking compatibility once, to gain log term improvements, Capstone just won't be competitive to other disassemblers in the future.

aquynh commented
aquynh commented
aquynh commented
aquynh commented

lets take one example: we want to rename ARM_CC to ARMCC.

can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?

Rot127 commented
aquynh commented

I answer your suggestions later. On the phone it is difficult to write well. As a general note though: I second @XVilkas here. I work on this now for half a year and opened the ARM PR as draft after two-three months. Exactly to ask for this kind of feedback, suggestions on design and other choices. This is also why I added the list of breaking changes to the PR description, updated it continuously and asked for feedback on the big ones. So no one needed to read through the code and can save time. Also, I think I made clear that I am more than happy to provide detailed answers and provide more details if requested. As I stated in the ARM PR, my time I can spend on this is limited (until end of July). And we want and need to start building on it in Rizin. With all due respect, but there were at least four and a half months to discuss a big decision like this. And we really need to carry on. 26 Jun 2023 17:23:51 Nguyen Anh Quynh @.>:
โ€ฆ
I never say it is not merged, Anton. On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.
> wrote: > If auto-sync work is not merged, I am afraid we have to fork the capstone. > It's your choice - you want updated architectures or not. > > โ€” > Reply to this email directly, view it on GitHub > <#2015 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU > . > You are receiving this because you were mentioned.Message ID: > @.***> > โ€” Reply to this email directly, view it on GitHub[#2015 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AK5ET6CWWG5MAD4QVHRZWLTXNGSQLANCNFSM6AAAAAAX3POTSU]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AK5ET6B4GEQVBVWGEALNKVLXNGSQLA5CNFSM6AAAAAAX3POTSWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS72PDV4.gif]

understood - we are all short of time, especially those who are maintaining this project without paying, in spare time.

we will try to merge this in July.

Rot127 commented

@aquynh I agree that the compatibility concern is very valid. But since I asked in the PRs and got no push back I assumed modernizing is more important.

I would propose to finish up the v5 release and add a note that v6 will bring big change.

If people try out the next branch and figure they rely desperately on some of those old stuff, we can think about how to make it compatible for them in a different branch. Or guide them to do this on their own.

can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?

This specifically is not just a syntax change, but also the values change (ARM_CC_invalid = 0 is removed with ARMCC_UNDEF = 15). Reversing this also means to:

  1. Have two different CC enums (for CS and LLVM)
  2. Having to translate between each of those enums.

But there is little meaning in keeping this complexity (other then compatibility reasons of cause).

XVilka commented

Because it takes longer than I expected, I suggest targeting upcoming LLVM 17.0 release with a few nice updates in ARMv9 and RISC-V extensions: https://discourse.llvm.org/t/llvm-17-0-0-release-planning-and-update/71762

  • July 25th - release/17.x branch created
  • July 27th - 17.0.0-rc1 released
  • August 9th - 17.0.0-rc2 released

https://llvm.org/docs/ReleaseNotes.html#non-comprehensive-list-of-changes-in-this-release

kabeor commented

@XVilka In that way, should we merge #1949 after 17.x release?

Suggest to continue this topic at capstone-engine/llvm-capstone#11

It would probably also make sense to remove now obsolete suite/synctools as well, especially after AArch64 PR is merged.

@xen0n As we finished most of the planned architectures and refactorings, with instruction details being the major missing piece, it's a good time to start implementing other architectures as well e.g., LoongArch.

@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.

@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.

Thanks, I will work on it.

@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.

Got error running ASUpdater.py -a AArch64:

INFO  - Clean build directory
INFO  - Generating Disassembler tables...
INFO  - Generating AsmWriter tables...
INFO  - Generating RegisterInfo tables...
INFO  - Generating InstrInfo tables...
INFO  - Generating SubtargetInfo tables...
INFO  - Generating Mapping tables...
INFO  - Generating SystemOperand tables...
INFO  - Compile Cpp language
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp
INFO  - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src
INFO  - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o
INFO  - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o
INFO  - cc -shared /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o -o /home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so
INFO  - Load language '/home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so'
INFO  - Unresolved template calls: dict_keys([b'unsigned UnscaledVal = MI->getOperand(OpNum).getImm();']). Patch them by hand!
INFO  - Translate '/home/jiegec/capstone/capstone/suite/auto-sync/llvm-capstone/llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp'
Traceback (most recent call last):
  File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 228, in <module>
    Updater.update()
  File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 141, in update
    self.translate()
  File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 118, in translate
    translator.translate()
  File "/home/jiegec/capstone/capstone/suite/auto-sync/Updater/CppTranslator/CppTranslator.py", line 390, in translate
    query: Query = self.ts_cpp_lang.query(pattern)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiegec/capstone/capstone/venv/lib/python3.11/site-packages/tree_sitter/__init__.py", line 93, in query
    return _language_query(self.language_id, source)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Invalid syntax at offset 43

@jiegec I think this one might help: #2255

@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.

@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.

WIP at https://github.com/jiegec/capstone

@Rot127 As discussed in the rizin Mattermost I have some free time and would like to update the BPF support to auto-sync ๐Ÿ™Œ

Great! Please go ahead. Start with the documentation and let me know if you need help with something. If something is not clearly written or needs clarification in you opinion please let me know as well. The docs haven't been read by many people yet. So any fresh looks at it are welcome.

Also notify me when you have a fork pushed and a draft PR opened. So we can link it here. Draft PR is preferred, because we can comment on it more easily.

@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of capstone-engine/llvm-capstone#45 when developing further. My plan is to merge it after ARM, AArch64 and PPC are updated here in the Capstone repo. But better use this capstone-engine/llvm-capstone#45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.

@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of capstone-engine/llvm-capstone#45 when developing further. My plan is to merge it after ARM, AArch64 and PPC are updated here in the Capstone repo. But better use this capstone-engine/llvm-capstone#45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.

Sorry, I have forgotten this thing after a long spring vacation.. I will continue to work on it.