Add grouping registers together, like how CMSIS maps a struct of types to a base address
Closed this issue · 16 comments
Issue
As saw in #5 , there is a discrepancy between CMSIS and Cppreg which was found to be due to the CMSIS style allowing the ability to inform the compiler about groupings of registers. This allows the compiler to, when reading and writing from memory, use relative memory operations like this
C++(Cppreg) CMSIS vs CPPREG
int a[0] = SomeField0::read(); LD R0, R10 [#0] LD R0, R10
int a[1] = SomeField1::read(); LD R1, R10 [#4] LD R1, R11
int a[2] = SomeField2::read(); LD R2, R10 [#8] LD R2, R12
Since CppReg doesn't have that information explicitly stated, the compiler is stuck assuming that the registers are totally unrelated and is unable to get the address via offsets from a base address. I guess the optimization process is unable to do that extensive inspection for memory addresses (though it can do that for immediate as shown in the sidenote of #6 it seems).
Solution
Add a way to group registers together. I do not have a syntax example off the top of my head, but as @sendyne-nclauvelin mentioned in #5 this would require a not trivial amount of API rework, so before implementing, lets see what other potential API points of pain there are.
@hak8or here are some thoughts about this issue ...
Prologue
This issue was created after it was noticed in #5 that cppreg
was not accounting for the fact that registers might be related to simple offsets and could therefore be related by using immediate values.
A typical example is given here.
What is exactly happening
CMSIS maps C structures directly to memory, and registers are accessed as members of these structures., for example:
typedef struct {
__IO uint8_t TXFIFO; // Base + 0x00
__IO uint8_t STATUS; // Base + 0x01
} UART_TypeDef;
#define PERIPH_BASE ((uint32_t)0xF0A03110)
#define UART_Base (PERIPH_BASE + 0x0010)
#define UART ((UART_TypeDef *) UART_Base)
UART->TXFIFO = 'a';
With such an approach, the compiler is clearly aware the registers (here TXFIFO
and STATUS
) are part of the same structure and the generated code will leverage this fact.
On the other hand in cppreg
there is no such construct. Instead each register is defined as a specific type and the fields contained in the register are also explicitly defined. Therefore the compiler is never made aware that the registers are part of a structure. This leads to the register addresses to be used explicitly (versus using immediate values in the CMSIS approach).
This is clearly the case and the example here is identical to the first one except that for the CMSIS code we now access the registers directly as pointers. And this leads to the exact same code compared to cppreg
.
What's next
This issue highlights a fundamental difference between CMSIS and cppreg
. In CMSIS the register is the fundamental unit when it comes to read/write operations. This requires the value to be properly masked and shifted before a write operation and after a read operation. In cppreg
, these details are abstracted from the caller and the Field
implementation takes care of it for free.
For cppreg
to be aware (and make the compiler aware) that a structure containing registers can be mapped to memory we need to provide some higher order construct.
Also as a side note, it is not very clear how heavy the performance hit ... but let's give it a try.
A first attempt
So the first step was to not deal explicitly with pointers anymore but rather rely on reference in cppreg
. This is not much work and only a few spots are affected (mostly the implementations of the various access policies).
The second step is to provide some syntax construct so that it is possible to express that some registers are grouped. This requires some register cluster definition, for example:
// Here MemoryMap is the type corresponding to the structure to be mapped.
template <Address_t ClusterBase, typename MemoryMap>
struct Cluster {
constexpr static const Address_t cluster_base = ClusterBase;
using map_t = MemoryMap;
static MemoryMap* const _mem_map;
};
template <Address_t ClusterBase, typename MemoryMap>
MemoryMap* const
Cluster<ClusterBase, MemoryMap>::_mem_map = reinterpret_cast<
MemoryMap* const
>(Cluster<ClusterBase, MemoryMap>::cluster_base);
Then we can define a GroupedRegister
type where we actually access the structure member:
template <
typename Cluster,
Width_t RegWidth,
std::uint32_t offset,
volatile typename RegisterType<RegWidth>::type Cluster::map_t::*member
>
struct GroupedRegister :
Register<
Cluster::cluster_base + offset,
RegWidth
>
{
using base_reg = Register<
Cluster::cluster_base + offset,
RegWidth
>;
static typename base_reg::MMIO_t& rw_mem_pointer() {
return std::ref(Cluster::_mem_map->*member);
};
static const typename base_reg::MMIO_t& ro_mem_pointer() {
return std::ref(Cluster::_mem_map->*member);
};
template <typename F, typename base_reg::type value>
inline static void write() {
typename base_reg::type fmt_value = 0;
F::policy::template write<
typename base_reg::type,
typename base_reg::type,
F::mask,
F::offset,
value
>(fmt_value);
Cluster::_mem_map->*member = fmt_value;
};
};
This is still very experimental ... for a GroupedRegister
-based type the original API is still accessible. In addition a new write method is also available.
And ... we are getting closer to CMSIS (same number of instructions). But ... it is a different assembly code, so @hak8or it would be great if you could take a look ... I am curious as to what is happening according to you.
It is here.
The assembly difference was surprising, but looking closer here is what I see.
Diffirences
-
UART::STATUSg::write<UART::STATUSg::Enable, 1>()
wipes out the rest of the fields in the register (does not OR with not explicitly set fields, instead just erases them). This is mostly why you saw the differing assembly. -
This still does not coerce the compiler to use immediate mode offsets to registers instead of just storing two full addresses.
Even smaller example + fiddling
I made a super bare bones example here based off your snippet but with some renaming and removal of unrelated components to increase clarity.
Using the above example, if we swap UART::STATUSg::write<UART::STATUSg::BIT0, 1>();
with UART::STATUS::BIT0::write<1>();
then we get the same correct assembly (OR instruction is there to keep the unset fields the same) excluding the lack of addressing via offset from a base address.
Making the wiping of old bits (removal of the OR) consistent with the CMSIS example via just writing 1 to STATUS, lets try the clustering implementation. This looks as follows:
// ============= CPPReg =============
typedef struct {
volatile uint8_t TXFIFO; // Base + 0x00
volatile uint8_t STATUS; // Base + 0x01
} UART_TypeDef;
struct UART {
struct UART_Cluster : cppreg::Cluster<0xF0A03120, UART_TypeDef> {};
struct TXFIFOg : cppreg::GroupedRegister<UART_Cluster, 8u, 0, &UART_TypeDef::TXFIFO> {
using DATA = cppreg::Field<TXFIFOg, 8u, 0 * 2, cppreg::write_only>;
};
struct STATUSg : cppreg::GroupedRegister<UART_Cluster, 8u, 8, &UART_TypeDef::STATUS> {
using BIT0 = cppreg::Field<STATUSg, 1u, 0, cppreg::read_write>;
using RES = cppreg::Field<STATUSg, 7u, 1, cppreg::read_only>;
};
};
void Demo_CPPReg(void){
UART::STATUSg::write<UART::STATUSg::BIT0, 1>();
// Introducing this does cause a new address to show up in .L2
// instead of using the an offset from the base address.
UART::TXFIFOg::DATA::write<0x12>();
}
// ============= CMSIS =============
#include <stdint.h>
typedef struct {
volatile uint8_t TXFIFO; // Base + 0x00
volatile uint8_t STATUS; // Base + 0x01
} UART_TypeDef;
#define UART ((UART_TypeDef *)0xF0A03120)
void Demo_CMSIS(){
UART->STATUS = 0b0000'0001u;
UART->TXFIFO = 0x12;
}
Sadly, the assembler still is unable to notice how close the addresses are and use relative addressing.
Grand Overview
I do not know how much work should be put in to getting the assembly to fully match and be just as performant as the CMSIS example. The reason why I am so worried about this is because it does introduce overhead in the form of increased binary size and register pressure. It is not rare to encounter situations where you have an array of registers that you want to write to quickly and packed closely together.
For example, when register pressure is high enough (due to having to load so many register addresses) then we get register spillover which can be catastrophic in a hot path. For example, like this. Interestingly, if you remove the -mthumb
flag, then the compiler is able to come in and optimize that away, but that relies on the optimizer instead of the language at that point, which can change between compiler updates (like here, why doesn't the optimizer optimize this away under -mthumb?). A roundabout way to fix this would be to find in GCC why this can't get optimized away and "fix" it but ehh.
@hak8or for review, this is a more "stable" API and this adds the ability to simply index the registers (or the fields) of interest.
The proposed API is in the register_pack
branch.
This works as follow:
-
Create a register pack (
RegisterPack
):struct SomePack : cppreg::RegisterPack<base_address, n_bytes_in_the_pack>;
-
Then add register to the pack by defining
PackedRegister
:using SomePackedReg = cppreg::PackedRegister< SomePack, width_in_bits, offset_in_pack_in_bytes(!!!), (optional_reset_value), (optional_shadow_value_flag) >;
After that any field can be added using the
SomePackedReg
register type and the whole API is accessible. -
If needed define an indexing (the same can be done with fields) as shown in the review example.
And a somewhat very concise and interesting version and cppreg
still generates the same assembly.
That iterator example is pretty sweet. Seems like it could be worthwhile to use for scenarios like resetting a bunch of registers to their default/reset values.
I am seeing that the interface is still fairly compact and easy to understand at a glance. It still generates the same assembly across various architectures and optimizations. Humorously enough this seems to cause an "internal compiler error" in all of the MSVC toolchains. Also, ICC seems to not be using relative addressing for some reason from CPPReg, but I am not clear on why it's doing that.
Depricating old API
Going further with the new API, it seems to be extremely similar to the previous one. I think it may be worthwhile to actually remove the old API and replace it with this.
-
Registers are pretty much always grouped together in chunks. A single lone register is very rare from what I've seen.
-
Removing the old API will make it harder to use CppReg in a way that causes non optimal performance (accidentally using the old Register interface instead of PackedRegister).
-
The API seems to be extremely similar to the originally intuitive API, so I don't think there is any loss in that aspect.
What do you think?
Ok so first let's discuss the new API and focus on the limitations (also the API is still in progress).
-
The current RegisterPack and PackedRegister API can only deal with a group of homogeneous registers in terms of size. For example, it can manage a group of 8-bits register or a group of 16-bits but it cannot (as of now) manage a group of four registers with two being 8-bits and the other two being 16-bits.
-
I think it is possible to implement a more generic version that could manage groups with mixed register sizes but as far as I can say now this will come with limitations; for example:
- sizes in order 8, 8, 16 32, 8 will work,
- sizes in order 8, 32, 8 or 8, 16, 8 will not work,
- the rule being that 8-bits register can be anywhere, but 16-bits or 32-bits registers will need to be at an offset that is a multiple of 2 or 4 (for 16-bits and 32-bits) with respect to the pack base.
If you think this is useful I can implement that; this is not much work.
-
So taking into account these limitations I believe we still need to maintain the current Register implementation. But this raises some questions/remarks:
- How common are groups of registers with mixed sizes? I looked at reference manuals for some MCUs and this seems rather uncommon (but it does exist). I suspect that such cases do not need the pack implementation (in other words there are most likely not subject to the hot path issue raised above).
- For a given peripheral/device it is still possible to define a register pack and the remaining registers as standalone.
- This seems to be very MCU and application dependent on top of being somewhat compiler dependent (as mentioned without Thumb mode the offset issue was not present) so keeping the two implementations seem like the best choice in my opinion.
- Now I understand that this makes the API somewhat sub-optimal: the user has to make a decision between using a register pack or standalone registers and performance might be affected by the decision (hence the need to at some point move to SVD scripting; see below).
This also start to mean we have to think about interfacing cppreg
with SVD ... One of the original idea was to leverage the cmsis-svd scripts to parse SVD files and generate cppreg
-compliant headers. This will significantly simplify using the library. But this first requires to stabilize the API and I believe this issue is last item on the list.
Here is a rephrasing of the question underlying the current limitation:
cppreg
is currently handling registers with sizes up to 32 bits (side note: it only take adding a trait specialization to implement 64-bits support but this is not present at the moment); if we have the guarantee that the address of any register pack will be 4 bytes aligned (i.e., 32 bits aligned) then we can safely implement mixed-size support.
This obviously boils down to: for a N-bits MCU are the base addresses of the various peripherals (N/8) bytes aligned?
So after a bit of research it seems that in most cases, peripheral registers will be aligned in ways that makes it possible always use RegisterPack (granted we add mixed-size support):
- ARM recommends that independently of the size, registers should be 32-bits aligned,
- MSP430 memory map has a region for 16-bits registers and another one for 8-bits registers.
It seems we can then assume this will the most generic case.
@hak8or, to keep it moving here is my proposal:
- I will implement mixed-size support in the RegisterPack API.
- This implementation will detect alignment issues at compile time.
- The current Register API will be maintained if only to provide a fallback for architectures with odd registers setup.
Ok so the final implementation is in the register_pack
branch. Here is a somewhat artificial but meaningful example.
Next steps:
- If we agree on the current register pack API I will merge it; I still believe we should keep the
Register
type itself for (unexpected and probably very uncommon) cases in which the register pack one cannot be used. - Documentation needs to be updated and should reflect that the best approach is now to use register pack when applicable.
Oh dang, lots of activity, awesome!
Regarding register sizes, I can confirm that across a few different MCU's from different vendors (only looked at ARM Cortex-M) the registers are all 32 bits large (and 4 byte aligned), with usually the upper bits marked as reserved or read only (though sometimes mentioned that they have to be kept set to 1).
Given the edge cases you described, I agree that the none packed register API should be kept. Furthermore, I think the new packed registers API is solid for now. When updating the performance comparison (hopefully finishing it up by the end of today, barring any other potential issues), I will have a better idea of if there are any pain spots in the new API.
Unrelated notes:
- I remember you mentioning that you had a test suite for cppreg, though I do not see it available in this repo. I assume it's not available under a permissible license?
- Regarding svd, I totally agree that we should be able to use the svd scripts to generate the headers, and document how to do so. I can work on that (those are python scripts if I recall).
I remember you mentioning that you had a test suite for cppreg, though I do not see it available in this repo. I assume it's not available under a permissible license?
This is not really a licensing issue but this was more to avoid bloating the repository. The test suite (based on googletest) requires some mock up—which I hope to have the time to describe at some point—to be able to run on desktop. In addition, this relies heavily on some CMake boilerplate (see that and this). So this is more a time issue as this will require some documentation. But ideally I would prefer to put the test suite in a dedicated repository. cppreg
is designed for embedded projects and I am not sure it would be good to add to the repository desktop testing code.
Regarding svd, I totally agree that we should be able to use the svd scripts to generate the headers, and document how to do so. I can work on that (those are python scripts if I recall).
Once we stabilize the API and finish the performance review I agree that this should be the next step. The scripts in cmsis-svd provide all the necessary tools (parsing and assembling data). So what will be left is mostly writing a cppreg
generator.
RegisterPack todo list
Ok let's gather here what is needed before I merge the register pack implementation. This requires to take some decisions.
- Implement an enumeration type for register width (@hak8or) to limit the possibility of typo/error and enforce supported widths.
- RegisterPack takes as template argument the size of the pack in bytes; we keep it this way or we decide that the size should be in bits.
- PackedRegister takes as template argument the offset with respect to the pack base in bits; we keep it this way or we decide that the offset should be in bytes.
- Naming: should we change RegisterPack and PackedRegister to something else (Peripheral and PeripheralRegister, Device and DeviceRegister) ...
- Alignment checking is currently done in a "naive" way; this should be revised using a safer implementation and the limitations should be documented (in a pack, of register of size N bits can only be defined is the pack base is aligned on a N bits boundary and if the register address is at an offset multiple of N).
Ok considering this list I decided to start a new issue (#12). This will be easier.