riscv/riscv-opcodes

including applicable extensions in the opcode syntax

neelgala opened this issue · 13 comments

I have been using this repo as the official source of encodings for our internal design and verification tools. One issue that I have been facing is the lack of concrete information of "under which extension(s) is an instruction applicable". I am looking at decoding only instructions which are applicable for a user-defined ISA. So if the user specifies RV64IMC then only instructions under those 3 extensions must be decoded. Even though the filenaming convention right now is somewhat useful, it does not fully address all the issues. two of which I have described below:

  1. c.flw should be applicable only when F and C are both implemented. So placing it inside opcodes-rvc confuses the tools and having a separate file opcodes-rv32fc increases maintenance.
  2. instructions like pack which are present under multiple extensions (zbp, zbf and zbe). Placing pack into individual opcodes file for each sub-extension might work but is not scalable. One has to remember to edit all those files for any change in the instruction

so, having a file-naming convention alone might not work. The syntax of opcode entries will need to change slightly. The following is a very quick and dirty proposal (and will need refining) of what I think can work to address the above issues:

Add the list of comma-separated extensions under which the encoding is a legal instruction; wrapped within | | at the end of the line.

Examples

c.flw      1..0=0 15..13=3 12=ignore 11..2=ignore |RV32FC|

pack       rd rs1 rs2 31..25=4  14..12=4 6..2=0x0C 1..0=3 |RV32Zbp, RV32Zbf, RV32Zbe, RV64Zbp, RV64Zbf, RV64Zbe|

Tools can then use substring matching to identify if that instruction is applicable for the user-defined ISA or not.

A better way of doing the above would be to use regex (less readable but extremely powerful) :

c.flw      1..0=0 15..13=3 12=ignore 11..2=ignore |RV(32).*(F).*(C).*|

pack       rd rs1 rs2 31..25=4  14..12=4 6..2=0x0C 1..0=3 |RV(32|64).*(Zbp|Zbf|Zbe).*|

The regex will need to follow a few strict guidelines while writing but that should be manageable.

Pros of the proposal:

  • the syntax is pretty regex-able and simply adds on to the current syntax. Current tools depending on this repo will simply need to ignore everything between | |.
  • minimal changes to existing scripts in this repo to generate the current set of artifacts
  • does not require a strict file naming convention - improves scalability
  • number of files in the repo will reduce - improves maintenance

Before I go on to work on a PR for the above, I wanted to get a sense if such a change is welcomed/acceptable?

My immediate reaction is that I prefer a different approach that’s more similar to what we’re currently doing: use the file names to make this distinction, rather than adding metadata to the individual instructions.

For instructions that belong to multiple extensions, we could use the existing @ aliasing scheme when they appear in multiple files, or invent some new prefix that means “I know this is defined elsewhere, but I’m including it here anyway, without explicitly rewriting its operands”.

Regardless, I agree we should solve the problem you’re trying to solve, and your solution is a reasonable approach. I’d like others to weigh in.

I spent a little more time on the file based distinction scheme and could come up with the following. Let me know your thoughts. X and Y below represent extension characters/strings.

  1. rv_x_y - contains instructions common within the 32-bit and 64-bit modes when both x and y extensions are enabled.
  2. rv32_x_y - contains instructions present in rv32xy only (absent in rv64X_Y eg. ???)
  3. rv64_x_y - contains instructions present in rv64xy only (absent in rv32X_Y, eg. addw)
  4. _y in the above is optional and can be null
  5. for instructions present in multiple extensions, the instruction encoding must be present in the first extension when alphabetically canonically ordered. All other extensions can simply include a $import prefix followed by <filename> and <instruction_name> separate by :: . For e.g pack would be present in the rv32_zbe file as
    pack rd rs1 rs2 31..25=4 14..12=4 6..2=0x0C 1..0=3 and rv32_zbf and rv32_zbp files would have the following entries : $import rv32_zbe::pack
  6. For pseudo ops we use $pseudo_op <filename>::<instruction> <overloaded fields/patterns> to indicate the original instruction that this pseudo op depends on and the fields that need change. For e.g. when shfli gets ratified zip can be represented in rv32_zbkb as : $pseudo_op rv32_zbp::shfli shamtw=15

In the above scheme I am basically reserving $ to indicate that a kyeword follows.

The above scheme will still require siginificant re-arrangement of the current repo files For e.g. rv32i will move to rv_i and rv64_i will contain the additional 64-bit mode base instructions and so on.

maybe "canonically" ordered instead of "alphabetically" ordered makes more sense ?

@aswaterman I have gone ahead with implementation of my proposal and have an initial draft of what the revised repo will look like : https://github.com/incoresemi/riscv-opcodes/tree/restructuring-opcodes. I am yet to fix the parse_opcodes.py file, but before I do that I wanted to get a feedback if the revised structure is acceptable.

Important points to note:

  1. I use $import to indicate that an extension is borrowing an instruction from another extension (look at zkn for example)
  2. I use $pseudo_op to indicate instructions which are defined by spec as a pseudo ops for standard instructions (again look at rv_pseudo)
  3. I have also cleaned up compressed instruction support significantly.
  4. the concept of aliases or usage of '@' is no longer supported
  5. There are few places like in zbkb where we have instructions like pack/packh which are pseudo ops fo unratified instructions under zbp/e. I have kept those as pseudo ops. Let me know if they shuold be treated as standard ops for now ?

Feedback is highly appreciated - post which I will start working on the python code.

And to adress greg's points we can have rv_*_unratified as a file naming convention which when it gets ratified simply drops the postfix _unratified - so everyone knows whats ratified and whats not.

Also for my current draft for the bitmanip I have gone ahead with extensions mentioned in 0.94 draft for the unratified instructions.

Yeah, I think this is going in the right direction. And I appreciate that you sought feedback on the design before doing all of the software hacking.

I'd like others who have skin in the riscv-opcodes game to chime in before @neelgala goes off and does a bunch more work.

@aswaterman so I have got the scripting work done for most of it but I am having a hard-time with pseudo ops.

Let's take the example of slli of the base ISA.
In the current framework for rv32 slli is defined as a pseudo op in opcodes-pseudo. For rv64 slli is defined as a standard op in opcodes-rv64i.

However, in my approach I have slli in files rv32_i and in rv64_i both have their respective valid encodings (bit 25 being zero for rv32_i version). So if someone was looking for instruction encodings for ISA=RV32I they would look at rv_i + rv32_i to find the right set of encodings.

Is my approach okay or would you prefer treating slli in rv32_i as a pseudo op of slli in rv64_i ?

Going over it again I think you can discard my previous comment - treating slli in rv32_i as a pseudo op of the rv64_i version makes more sense and keeps the scripting work simple. I no longer need to parse pseudo opcodes as long as the corresponding standard op has been parsed. I checked in spike also - the encodings for the pseudo ops (like slli_rv32) are never used. So I guess this approach is better.

On the latex front, I see riscv-isa-manual uses the output from this repo for the instruction encoding tables. I wanted to know if the following was doable:

  • can we rearrange tables to be alphabetically organized ? start with ADD end with XORI ?
  • For instructions like ECALL, EBREAK, where the entire encoding is static, can avoid the verifical bars for them ? Basically this:
    |00000000000000000000000001110011 | ECALL
    
    instead of
    |000000000000 | 00000 | 000 | 00000 | 1110011 | ECALL
    
    This would make the whole latex-generation code very simple and contributors will have one less issue to worry about when
    adding new instructions.

closed in #106