`auto-sync` progress tracker: Refactor and implement architectures
Rot127 opened this issue ยท 28 comments
Note to x86: x86
is not part of this list, because we can not generate all tables in C.
Refer to capstone-engine/llvm-capstone#13 for details.
Note about changes introduced with auto-sync
:
For a preview what changes will come in v6
, please take a look at the WIP release guide.
This issue tracks the auto-sync
refactoring and implementation effort of architecture modules.
The table below lists the responsible developers for each architecture.
In progress
Arch | CS PR | llvm-capstone PR | Part of (planned) release | Assigned developer(s) | Based on LLVM repo |
---|---|---|---|---|---|
SPARC | None yet | None yet | v6 |
@DMaroo | LLVM-project |
BPF | None yet | None yet | v6 |
@0pendev | LLVM-project |
SystemZ | #2462 | capstone-engine/llvm-capstone#55 | v6 |
@Rot127 | LLVM-project |
Xtensa | #2380 | None yet | v6 |
@imbillow | LLVM-project |
ARC | None yet | None yet | v6 |
@R33v0LT | LLVM-project |
.td edits upstreamed
Most LLVM td
files miss some information about instructions (memory read/writes, operands incorrectly assigned as in/out etc.). Since we rely on this we need to fix it. Those fixes should be upstreamed to LLVM.
-
Alhpa(no longer maintained) - ARM
- AArch64
- MIPS
- PPC
- SPARC
Done
Arch | PR | Part of release | Assigned developer(s) | LLVM repo |
---|---|---|---|---|
Alpha | #2071 | v6 |
@R33v0LT | LLVM-project (release v3.0 ) |
AArch64 | #2026 | v6 |
@Rot127 | LLVM-project |
ARM | #1949 | v6 |
@Rot127 | LLVM-project |
PPC | #2013 | v6 |
@Rot127 | LLVM-project |
TriCore | #1973 | v5 |
@imbillow | TriDis |
HPPA | #2265 | v6 |
@R33v0LT | Not Auto-sync based |
LoongArch | #2349 | v6 |
@jiegec | LLVM-project |
MIPS | #2410 | v6 |
@wargio | LLVM-project |
Arch extensions
Adding CPU extensions which are not part of upsteram LLVM is easier now.
Here are they tracked.
Arch | Extension name | issue | previous attempt/notes | Done |
---|---|---|---|---|
PPC | VLE | #2241 | https://lists.llvm.org/pipermail/llvm-dev/2014-July/074613.html | No |
PPC | PS (Paired-Single) | None | https://reviews.llvm.org/D85137 | Yes |
Mips | NanoMips | None | Mediatek LLVM: https://github.com/MediaTek-Labs/llvm-project/tree/mtk-pub/nanomips-llvm16, more context: rizinorg/ideas#5 | Yes |
Mips | EE | None | Not in LLVM, see: #940 (comment) | No |
Effort level of not refactored/implemented archs
Arch | Number of operand groups | Generates | Note | Implementation type | Difficulty level |
---|---|---|---|---|---|
AVR | ~3 | Yes | None | New | Easy |
CSKY | ~7 | Yes | None | New | Medium |
DirectX | ~1 | Yes | Deviates from common design. | New | Medium-Hard |
EVM | ~2 | Not tested | Very small module, llvm repo: https://github.com/etclabscore/evm_llvm | New | Easy |
Hexagon | ~2 | No | Deviates from common design. | New | Hard |
Lanai | ~10 | Yes | None | New | Easy |
M68k | ~28 | Yes | None | Refactor | Medium |
MSP430 | ~6 | Yes | None | New | Easy |
SPIRV | ~9 | No | td files faulty |
New | Medium |
VE | ~8 | Yes | None | New | Medium |
XCore | ~15 | No | td files faulty |
Refactor | Medium |
Note to RISC-V: RISC-V will not be generated via LLVM because the LLVM architecture definitions are not precise enough for our use case. Instead, a SAIL based generator will be used (#2392).
Legend
Number of operand groups
: Operand groups which have a distinctprint
functions. Indicates effort to implement the LLVM <-> CS mapping code (fillcs_detail
and the like).Generates
:inc
files generate with most recent backends.Note
: Worthy to note.Implementation type
: Refactor current implementation or implement new arch module.Difficulty level
: Guessed difficulty of this arch (base on points above and complexity like number of instructions etc.). Though "Easy" still means you have to familiarize yourself how LLVM definitions and the updater work. My guess is it will take at least a week of work.
Getting started
- If you like to refactor an architecture module or implement a new one, please comment here and we add you. Also we can give hints to important information.
- Please add a draft PR once you've done the first commit, so the progress is visible and there is a place for discussion.
- Please refer to the
auto-sync
documentation to learn how to refactor or implement an architecture withauto-sync
TODO for refactored archs
List of missing things which should be done before v6
to get a nice round package.
Capstone
- Missing alias support for SystemZ
- Missing alias support for LoongArch
- Update docs with
ASUpdater.py
instructions. - #1984
- Update all archs to LLVM 18
- Remove tablegen files from suite.
- Add CS
assert
version and add the asserts to the LLVM files again. - Wrap all possible code into
CAPSTONE_DIET
. - Run 0x0 to 0xffffffff as input once on ARM, PPC, AArch64 (with details enabled) to check for segfaults.
-
name2id
docs. Parametermax
should be changed to table size and in the loop bemax - 1
- Consider to have alias details and real details live along. So users do not need to decide for one (how would this play together with
CAPSONE_DIET
). - Possibly #2152
- #2196
- Expose
PPC
instruction formats on the public interface
LLVM revisions
- ARM td files
- PPC td files (have there been any)
- ARM missing
pop
alias (d64f749) - Segfaults for PPC
- AArch64 missing
FeatureAll
check - More PPC
predicates
- PPC false instruction defintions (which should be alias)
- AArch64 - Open issue about missing memory operand info.
- AArch64 - Open issue about patterns not assigned to their instructions (SVE, SME).
- capstone-engine/llvm-capstone@c3484b1
- capstone-engine/llvm-capstone#39
- capstone-engine/llvm-capstone#40
Auto-Sync
- add
refactor
setting toauto-sync
updater. - Add auto-sync unit tests
- Translate template functions as functions, not as macros.
Backends
- Generate decoding/printing macros as functions, if there is only a single version (allows proper debugging, which would be a blessing).
ARM
- Add general alias and alias operand handling.
- Add vector layout information
- Set
post_index
when base regsister is tied. Just to make sure to hit every case. -
Encoding info - Move data type insn mapping to own auto-sync class.
- #2193
PPC
-
Encoding info
AArch64
-
Encoding info
@kabeor @aquynh @Rot127 I propose to make the next release, with auto-sync changes a 6.0, not 5.1 because:
- There are slight API changes
- The amount of code changes is HUGE
https://github.com/orgs/capstone-engine/projects/1 - then it would need to be updated too.
please can you summarize the API changes here?
ARM
- Enum changes:
ARM_CC_
->ARMCC
- System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
- Some instr. enum entries no longer exist (e.g.
VPUSH
,VPOP
).
- Some instruction groups which are not part of LLVM were removed (e.g.
GROUP_INT
)- Groups like
RET
,INT
should be added via Mapper separately.
- Groups like
- Feature groups like
ARM_GRP_CRC
are renamed to match LLVM nameing:ARM_FEATURE_HasCRC
- Features are now checked more strictly (
V8
,MCLASS
,ARM
,THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features. - The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added. writeback
is part ofdetail
and no longer ofdetail.arm
.- Register alias not defined in LLVM (
r15 = pc
etc.) are no longer printed as default. Must be enabled viaCS_OPT_SYNTAX_CS_REG_ALIAS
or-a
for thecstool
. - The immediate value of operands is no of type
uint32_t
, no longerint32_t
.
PPC
- Predicate enums members are renamed. They now use the LLVM name (e.g.
PPC_BC_NU_PLUS
->PPC_PRED_NU_PLUS
). - Branch conditions are now saved in more detail in
cs_ppc.bc
. - The base register of an PPC memory operand was not present if
reg = r0
. This is fixed now. ppc_ops_crx
is removed (wasn't used).
AArch64
- Renamed all
ARM64
->AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM. SME
operands changed (contin more detail, terminology is closer to the docs).- System operands change (now categorized into SysAlias, SysImm, SysReg).
This list is also part of the PR.
ARM
Enum changes:
ARM_CC_
->ARMCC
- System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
- Some instr. enum entries no longer exist (e.g.
VPUSH
,VPOP
).Some instruction groups which are not part of LLVM were removed (e.g.
GROUP_INT
)
- Groups like
RET
,INT
should be added via Mapper separately.Feature groups like
ARM_GRP_CRC
are renamed to match LLVM nameing:ARM_FEATURE_HasCRC
Features are now checked more strictly (
V8
,MCLASS
,ARM
,THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features.The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added.
writeback
is part ofdetail
and no longer ofdetail.arm
.Register alias not defined in LLVM (
r15 = pc
etc.) are no longer printed as default. Must be enabled viaCS_OPT_SYNTAX_CS_REG_ALIAS
or-a
for thecstool
.The immediate value of operands is no of type
uint32_t
, no longerint32_t
.PPC
- Predicate enums members are renamed. They now use the LLVM name (e.g.
PPC_BC_NU_PLUS
->PPC_PRED_NU_PLUS
).- Branch conditions are now saved in more detail in
cs_ppc.bc
.- The base register of an PPC memory operand was not present if
reg = r0
. This is fixed now.ppc_ops_crx
is removed (wasn't used).AArch64
- Renamed all
ARM64
->AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM.SME
operands changed (contin more detail, terminology is closer to the docs).- System operands change (now categorized into SysAlias, SysImm, SysReg).
This list is also part of the PR.
cant we avoid breaking compatibility?
@aquynh The short answer is no.
But let me go into more details also for others:
The problem with automatic Capstone updates is that due to the C++ and C difference we have many cases to handle when C is not equivalent to C++.
To reduce those cases we need to be as close to the original C++ syntax and semantic as possible. Because every renaming (i.e. enum values), semantic and overall design changes, almost always add manual work during an update.
This is why those breaking changes are needed.
Each of them moves Capstone code semantically or syntactically closer to the LLVM definitions.
This is of cause a pain for compatibility, but it is definitely worth it in the long run.
Because:
- All
auto-sync
archs are semantically pretty much equivalent to LLVM.- Which gives more correct results.
- Results are comparable to
llvm-objdump
. - Eases test generation.
- Easier to extend CS in the future with other information known in LLVM (see #2045 which adds instruction encodings for all
auto-sync
archs without much trouble).
- Unifies how modules work, so we can share code between them (see the new
Mapping.*
files). - Reduces the effort to update (less manual work, due to less edge cases). So hopefully more people update their archs.
Here more detail to each breaking change.
Enum changes
:
Done to match LLVM naming. It saves us to change enum names over several files whenever we update.
ARM
Some instruction groups which are not part of LLVM were removed (e.g.
GROUP_INT
)
Capstone unique instr. groups (like RET
, INT
) are added via Mapper separately (is on the toddo list). Because they are not defined in LLVM, we can not generate them without adding exceptions again.
Features are now checked more strictly (
V8
,MCLASS
,ARM
,THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features.
Simple necessity, because with the new instructions the same bytes have a valid decoding depending on the enabled features.
The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added.
Move closer to the LLVM logic. The disponent of a memory access doesn't need to be within the []
brackets (e.g. strt fp, [sp], 4
). But the disponent is defined as part of the memory operand. This was incorrectly represented in CS before.
writeback
is part ofdetail
and no longer ofdetail.arm
.
We support now the concept of Tied
operands (the way LLVM describes writeback registers). So writeback information is now known for all auto-sync
archs.
Register alias not defined in LLVM (
r15 = pc
etc.) are no longer printed as default. Must be enabled viaCS_OPT_SYNTAX_CS_REG_ALIAS
or-a
for thecstool
.
As said before, modules will be more equivalent to the llvm-objdump
results. Also the register naming and the decoded asm string.
The immediate value of operands is not of type
uint32_t
, no longerint32_t
.
See: #2056
PPC
Branch conditions are now saved in more detail in
cs_ppc.bc
.
Just a nice feature we are now able to provide.
The base register of an PPC memory operand was not present if
reg = r0
. This is fixed now.
A semantical fix. The base register should have been set.
ppc_ops_crx
is removed (wasn't used).
Wasn't used.
AArch64
Renamed all
ARM64
->AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM.
This is a big one. But having two names for the same architecture in the code is a nightmare for generation. Also it just doesn't bring any value. Being closer to LLVM is the choice here.
SME
operands changed (contain more detail, terminology is closer to the docs).
Again a nice feature addition because we save more detail. Being closer to the official docs when it comes to naming eases integrating Capstone in other projects.
System operands change (now categorized into SysAlias, SysImm, SysReg).
Again, move to LLVM semantic because:
- It wasn't correct before (system immediate and other alias were incorrectly identified as system registers or not categorized at all).
- This mimics the inheritance of system operands within the LLVM code.
- Eases generation.
Personally, I think that Capstone will become more and more irrelevant as disassembler engine if we:
- Do not modernize it (update archs, update testing)
- Provide people a relative easy way to add more features and architectures (e.g. the instruction encoding, instruction form information (for
PPC
) and others).
If we do not go through the pain of breaking compatibility once, to gain log term improvements, Capstone just won't be competitive to other disassemblers in the future.
lets take one example: we want to rename ARM_CC
to ARMCC
.
can we have compatibility by keeping ARM_CC
, and add (new) ARMCC
, so everyone is happy?
I answer your suggestions later. On the phone it is difficult to write well. As a general note though: I second @XVilkas here. I work on this now for half a year and opened the ARM PR as draft after two-three months. Exactly to ask for this kind of feedback, suggestions on design and other choices. This is also why I added the list of breaking changes to the PR description, updated it continuously and asked for feedback on the big ones. So no one needed to read through the code and can save time. Also, I think I made clear that I am more than happy to provide detailed answers and provide more details if requested. As I stated in the ARM PR, my time I can spend on this is limited (until end of July). And we want and need to start building on it in Rizin. With all due respect, but there were at least four and a half months to discuss a big decision like this. And we really need to carry on. 26 Jun 2023 17:23:51 Nguyen Anh Quynh @.>:
โฆ
I never say it is not merged, Anton. On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.> wrote: > If auto-sync work is not merged, I am afraid we have to fork the capstone. > It's your choice - you want updated architectures or not. > > โ > Reply to this email directly, view it on GitHub > <#2015 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU > . > You are receiving this because you were mentioned.Message ID: > @.***> > โ Reply to this email directly, view it on GitHub[#2015 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AK5ET6CWWG5MAD4QVHRZWLTXNGSQLANCNFSM6AAAAAAX3POTSU]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AK5ET6B4GEQVBVWGEALNKVLXNGSQLA5CNFSM6AAAAAAX3POTSWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS72PDV4.gif]
understood - we are all short of time, especially those who are maintaining this project without paying, in spare time.
we will try to merge this in July.
@aquynh I agree that the compatibility concern is very valid. But since I asked in the PRs and got no push back I assumed modernizing is more important.
I would propose to finish up the v5
release and add a note that v6
will bring big change.
If people try out the next branch and figure they rely desperately on some of those old stuff, we can think about how to make it compatible for them in a different branch. Or guide them to do this on their own.
can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?
This specifically is not just a syntax change, but also the values change (ARM_CC_invalid = 0
is removed with ARMCC_UNDEF = 15
). Reversing this also means to:
- Have two different
CC
enums (for CS and LLVM) - Having to translate between each of those enums.
But there is little meaning in keeping this complexity (other then compatibility reasons of cause).
Because it takes longer than I expected, I suggest targeting upcoming LLVM 17.0 release with a few nice updates in ARMv9 and RISC-V extensions: https://discourse.llvm.org/t/llvm-17-0-0-release-planning-and-update/71762
- July 25th - release/17.x branch created
- July 27th - 17.0.0-rc1 released
- August 9th - 17.0.0-rc2 released
https://llvm.org/docs/ReleaseNotes.html#non-comprehensive-list-of-changes-in-this-release
@XVilka In that way, should we merge #1949 after 17.x release?
Suggest to continue this topic at capstone-engine/llvm-capstone#11
It would probably also make sense to remove now obsolete suite/synctools
as well, especially after AArch64 PR is merged.
@xen0n As we finished most of the planned architectures and refactorings, with instruction details being the major missing piece, it's a good time to start implementing other architectures as well e.g., LoongArch.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
Thanks, I will work on it.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
Got error running ASUpdater.py -a AArch64
:
INFO - Clean build directory
INFO - Generating Disassembler tables...
INFO - Generating AsmWriter tables...
INFO - Generating RegisterInfo tables...
INFO - Generating InstrInfo tables...
INFO - Generating SubtargetInfo tables...
INFO - Generating Mapping tables...
INFO - Generating SystemOperand tables...
INFO - Compile Cpp language
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src
INFO - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o
INFO - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o
INFO - cc -shared /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o -o /home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so
INFO - Load language '/home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so'
INFO - Unresolved template calls: dict_keys([b'unsigned UnscaledVal = MI->getOperand(OpNum).getImm();']). Patch them by hand!
INFO - Translate '/home/jiegec/capstone/capstone/suite/auto-sync/llvm-capstone/llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp'
Traceback (most recent call last):
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 228, in <module>
Updater.update()
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 141, in update
self.translate()
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 118, in translate
translator.translate()
File "/home/jiegec/capstone/capstone/suite/auto-sync/Updater/CppTranslator/CppTranslator.py", line 390, in translate
query: Query = self.ts_cpp_lang.query(pattern)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiegec/capstone/capstone/venv/lib/python3.11/site-packages/tree_sitter/__init__.py", line 93, in query
return _language_query(self.language_id, source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Invalid syntax at offset 43
@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.
@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.
@Rot127 As discussed in the rizin Mattermost I have some free time and would like to update the BPF support to auto-sync ๐
Great! Please go ahead. Start with the documentation and let me know if you need help with something. If something is not clearly written or needs clarification in you opinion please let me know as well. The docs haven't been read by many people yet. So any fresh looks at it are welcome.
Also notify me when you have a fork pushed and a draft PR opened. So we can link it here. Draft PR is preferred, because we can comment on it more easily.
@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of capstone-engine/llvm-capstone#45 when developing further. My plan is to merge it after ARM
, AArch64
and PPC
are updated here in the Capstone repo. But better use this capstone-engine/llvm-capstone#45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.
@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of capstone-engine/llvm-capstone#45 when developing further. My plan is to merge it after
ARM
,AArch64
andPPC
are updated here in the Capstone repo. But better use this capstone-engine/llvm-capstone#45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.
Sorry, I have forgotten this thing after a long spring vacation.. I will continue to work on it.