SpenceKonde/DxCore

SPI and ISRs -- Why? I have the answer for you.

xxxajk opened this issue · 18 comments

xxxajk commented

First off, thank you for continuing the development to support these chips. You saved me a ton of development time.

So, I can explain why SPI and ISRs are actually important, and why so many of the misconceptions about using SPI within an ISR is "bad".

SPI.attachInterrupt() and SPI.detachInterrupt(): This allows one to write a library that basically avoids polling, and instead, simply reads the result storing it for use elsewhere. This has been very handy when using SPI in client mode, since you are expecting to handle it elsewhere, and you may have a long running part of code, and you don't wish to lose any data. You do that by capturing the information and process it elsewhere, much the same way Serial does. It also allows you to send data in host mode with queuing. Remember, SPI doesn't actually require the use of the /SS line, and that the communication could be one way.

UsingInterrupt(): This is super important when you don't want anything to disturb the sequence on the data stream, be it a collision, or something else. Consider the fact that most neopixel type addressable LEDs can be quite finicky if a delay is in the wrong place. You don't want an ISR to mess with it. Secondly it allows multiple devices to operate weather-or-not they are in an ISR. I can give you several cases where this becomes quite important: UHS2, used under multitasking (such as my XMEM library), UHS30, various WiFi chips, etc. The actual list is pretty large, and I realize that most MCU coders either don't have the experience to avoid ISR deadlocks, or don't have the correct way of thinking about code flow. That's where developers like me, handle all of the messy parts hidden away in an ISR. Polling is crap, and can be a huge cause of missed data, or not meeting time requirements for a device that is on SPI, or causes a collision because some code uses an ISR to handle a specific device on SPI multidrop, instead of doing the stupid thing of polling, and losing the event and the data. I have a similar argument over heap, and yes, I do use new/delete/malloc/free in ISRs. I even include a replacement or tie into the libc functionality. RTOS on MCUs are pretty popular as well, so it makes sense to do these things.

Here is a sample of very popular libraries that use this convention, some of them I either wrote, or helped to write.
xmem2
UHS2.0
UHS3.0

There are others such as ethernet too.
So I suggest that you keep the above functionalities. They're kinda important, and you aren't the first one to whine and question the rationale. Fact remains it's useful, and in a lot of cases, a requirement.

In order to help you become educated on the "controversy" and a complete discussion on the topic, which contains the relevant links to the Arduino mailing list go read this.
SPI sharing between interrupts and main program

I see the utility of attachInterrupt() and detachInterrupt() (which are entirely undocumented) - but I'm curious about how you would make an ISR that would be able to form a coherent system with the tools that SPI.h provides - I mean, there's never been an ISR provided for it, so as it was attachInterrupt() would cause the next SPI transfer to jump to the undefined vector and from there to badISR which jumps to 0, which (prior to 1.4ish) would be an ungraceful crash into a broken state where no interrupts would run and peripherals were in different states than initialization code assumed. Since whatever version the dirty reset detection went in, that is caught and a software reset is fired* - but neither of those achieves the goal...

So the way you're expected to use those member functions is that you write your own ISR and... How exactly does this interact with the SPI library? That seems un-arduino-like in terms of the depth of the code that it would require, and I'm not sure how to make that into something coherent with the SPI.h API.... I'm not sure where in any of those massive libraries I'd look for that part? I also wonder how doing it that way makes things any easier, since the ISR will have to be specific to the architecture, so putting the attach/detach (which don't work without advanced programming) functions into SPI.h fails to serve the purpose of SPI.h, that being to bundle the architecture dependent SPI-related stuff up so SPI-using libraries have a standard portable API, since architecture specific ISR code is still needed.

When I made that change to the library, I had no idea what the intended use case was (Are those documented anywhere? If so, where? They're not documented on the Arduino website's SPI.h documentation)

As for usingInterrupt() - the problem here comes about for the same reason that attachInterrupt() sucks so much more.

On classic AVRs, there were two kinds of pin interrupts INTn, and PCINTn, and a part would have a PCINT on all or most pins, but only a few INTn interrupts (sometimes only one). attachInterrupt() and SPI.usingInterrupt() only worked on INTn interrupts, which kept things simple and let attachInterrupt be used without blocking off all pin intertrupts. Modern AVRs dispense with the two tiered pin interrupt system, and merge the two into one, by making the all-pin PCINT-like functionality trigger on rising/falling/low/change instead of just change. But this leaves no coherent way to implement attachInterrupt without blocking off every pin interrupt.

I rewrote attachInterrupt with great difficulty, in assembly written with aggressive optimizations to prevent most of the duplication of code that would be present when keeping a separate copy of the ISR for each port, and splitting each port's stub-isr (which bypasses calling conventioned to reach the initial dispatch function. the eventual dream was that I would be able to use things like if __builtin_constant_p(pin) and so on, to - as long as only constant-foldable values were passed as the pin - only "reference" the file for that port, and thus, allow user interrupts on other ports; one of the most compelling of them being the 1 line do-nothing ISR that is the natural choice to wake from sleep when you have no other interrupts on that port (which is the majority of the time, even if you're not trying to design for this, and it's almost always trivial to arrange if you are), something like ISR(PORTx_PORT_vect) {VPORTx.INTFLAGS = VPORTx.INTFLAGS;} - it pissed me off that if I used any library that used attachInterrupt, instead of that interrupt (which would be push push in push in out pop out pop pop reti - 17 clocks and 11 words for the compiler generated case, hand written asm can take advantage of the fact that it needs just one register, and that neither in nor out change SREG, and condense it a naked ISR consisting of just push in out pop reti - 9 clocks and 5 words), and instead use a monstrosity that took hundreds of clocks regardless of what your ISR did. Unfortunately, I wasn't able to achieve that primary objective, only the secondary objective of reducing the flash overhead and time overhead greatly. I still think it's possible, but I have not yet gained the necessary depth of understanding to make it happen.

Anyway - it turned out that this broke under the new attachInterrupt logic (and in fact it may have never worked on mTC). As I looked at it, an inherited-from-stock-core function at that point, I grew increasingly disgusted with the implementation, and as a fix was needed urgently, I just made all pin interrupts do the global interrupt thing. I intended to come back to this, but had kinda forgotten about it. I see better ways to implement it now - while nothing I'd call performant, these would be more than twice as fast as the horror that was originally implemented.

* A dirty reset is when execution reaches 0x0000 but no hardware reset has occurred, which happens when: some idiot jumps to 0x0000, if the stack is smashed, if a bad or null function pointer is called, or an interrupt is triggered and no ISR exists (in the last case, on a modern AVR, interrupts will not run even when enabled because RETI has not been executed so the CPUINT still has LVL0EX set, and won't run other non-prioritized interrupts). Dirty reset detection on my cores - checks the reset cause flags. if they're non-0, it writes that value back to the flag register to clear the flags, then stashes it in GPIOR0 for user code that needs it. If no flags are set, no hardware reset has occurred, and we immediately issue a software reset to convert the dirty reset into a clean one.

So, I agree on usingInterrupts, though I still disapprove of that sort of thing. I may make it conditional on having at least 8k of flash (this library is the same one that mTC gets. and the 2/4k people would give me hell), because the overhead will be non-negligible.

SPI.attachInterrupt() and SPI.detachInterrupt() I'd like to get a confirmation from you about what I described above.

I will note that you are the first one to mention this in well over a year.

xxxajk commented

I am currently working on a project, that (if I can get the events, timers, and lut's to cooperate) will not only revert the "undocumented" attach/detach, but include an extra parameter to enable client mode.

attachInterrupt() shouldn't take any args, instead simply turn it on (and off for the reverse).
For smaller RAM/flash parts, sure, don't make it available, that makes total sense.

Explanation:
Client mode is needed because -- it's a client :-)
Events/logic is so I can control the client as if it is a master, and do all of this using only 3 wires (2 data + ground), not 4.
TX is connected to RX (via some external logic), not going to explain here, just tie them together at both ends and this will work for 2. More than 2 or long wires requires the logic (which is basically a differential driver that loops back).

Communication is such that I only need 7 bits, not the full 8 bits of SPI, and it's synchronous.
With those facts, I can use the first bit, as an arbitrator/signal to start the "client clock" and run it for only 8 pulses, then stop.
If I want to send something, I can just put what to send in the SPI out buffer and trigger an event, which will cause it to all happen for that mode.
I can see if what I transmitted matches, and if not, resend it, because some other node sent and stomped on the transmission. :-D
There are more details, but that's the basics of the design, enough to tell kind of what I am doing, without going into too much detail that is unimportant.

NOTE-- using an AVR64DD14, if I end up needing more pins, I'll go with the 20... I shouldn't have to though.

So far:
I have been able to get TCA0 set to the desired rate (1MHz), and use that to event clock TCB0.
I can have TCB0 stop after the desired clock count (which is 8 clocks), or at least the count register says that's what it is doing...

The issue is that LUT1 isn't "seeing" the WO from TCB0, instead it emits a simple pulse, so I might have to instead use events to the LUTs and use an S/R latch? Haven't gotten that far yet, but WO totally doesn't track in the LUT.
What is it actually tracking?!?! first pulse? something else? why is it repeating if I never triggered the event via software or hardware?? Questions questions... Observations after the code.

Here's the basic code, minus the SPI stuff, just outputs to pins (/ss and sck have to do that anyway).

#include <Arduino.h>
#include <Event.h>
#include <Logic.h>

#define LINK_SPEED 125000LU // actual link speed, Multiply this by 8. 125000 = 1mhz clock
#define PHY_LINK_SPEED (8LU*LINK_SPEED) // the actual speed that SPI clocks at.

void setup() {
        Serial.begin(115200); // debug
        // delay, so that we have enough time to view output
        for(unsigned int i=0; i< 40000; i++) {
          Serial.flush();
        }
        Serial.print("\r\n\r\nStarting...\r\n");
        Serial.flush();

        pinMode(PIN_PD4, INPUT_PULLUP); // for now, but when using SPI:  PD4 RX (MOSI) -- needs to detect falling edge event
        pinMode(PIN_PC1, OUTPUT);
        pinMode(PIN_PC2, OUTPUT); // not used in this example, but is debug in my code.
        pinMode(PIN_PC3, OUTPUT);
        pinMode(PIN_PD7, OUTPUT); // for now... this will be redirected elsewhere later on (PC3) and will get the /SS signal from there

        unsigned long tempperiod = (F_CPU / (1+PHY_LINK_SPEED));
        byte presc = 0;
        while (tempperiod/(1<<presc) >= 65535 && presc < 7) {
                presc++;
        }
        unsigned int Period = tempperiod/(1<<presc);
        PORTMUX.EVSYSROUTEA = PORTMUX_EVOUTD_ALT1_gc;
        PORTMUX.TCAROUTEA = (PORTMUX.TCAROUTEA & ~(PORTMUX_TCA0_gm)) | PORTMUX_TCA0_PORTC_gc; // out to C1
        // WORKING!
        takeOverTCA0();
        TCA0.SINGLE.CTRLB = (TCA_SINGLE_CMP1EN_bm | TCA_SINGLE_WGMODE_SINGLESLOPE_gc);
        TCA0.SINGLE.CTRLA = (1 << presc) | TCA_SINGLE_ENABLE_bm;
        TCA0.SINGLE.PER = Period;
        TCA0.SINGLE.CMP1 = map(128, 0, 255, 0, Period);
        TCB0.CTRLA = TCB_CLKSEL_EVENT_gc;
        TCB0.CCMPL = 7;
        TCB0.CCMPH = 0;
        TCB0.CNT = 0;
        TCB0.CTRLA = TCB_CLKSEL_EVENT_gc | TCB_ENABLE_bm;
        TCB0.CTRLB = TCB_CNTMODE_SINGLE_gc | TCB_CCMPEN_bm;
        Event0.set_generator(gen::tca0_cmp1);
        Event0.set_user(user::tcb0_cnt);
        Event0.start();

        Event2.set_generator(gen2::pin_pd4);
        Event2.set_user(user::tcb0_capt);
        Event2.set_user(user::evoutd_pin_pd7); // debug
        Event2.start();



        Logic1.enable = true;
        Logic1.input0 = logic::in::tcb; // B0 -- supposedta be WO!!?? isn't??
        Logic1.output = logic::out::enable; // PC3
        Logic1.filter = logic::filter::disable;
        Logic1.truth = 0xFE;
        Logic1.init();
        Logic::start();
        Serial.print("OK.\r\n");
        Serial.flush();
}

void loop() {
        // continually generate trigger
        // PC3 shows a pulse of only 1 cycle, THERE WAS NO EVENT! 
        // The cycle REPEATS at ~737.5Hz. Where's that coming from???
        //Event2.soft_event(); // THIS is supposed to trigger one shot, and it does not matter
        Serial.println(TCB0.CNT);
        for(unsigned int i=0; i< 60000; i++) {
                Serial.flush();
        }
}

PC1 contains the 1MHz clock, which would be connected to SCK
PC3 should contain the output of TCB0's WO and last 8 clocks, but....
Put a 'scope probe on PC3, which shouldn't be showing ANYTHING, see rage comment in loop()

Periodically print out the value of TCB0.CNT - you should get a range of values from 0 to 7. If you're only getting 0's, then the timer isn't ticking.

I suspect that the issue here is that either TCBs only clock from events in some modes, or that WO is not defined in the compare modes. I think it's the latter....

Ifn ya don't need millis (delay() and delayMicroseconds() still work when millis is disabled), you could use a pair of TCBs (or on a DD28, see below for why you might want that, there's an extra TCB, so you would be able to keep millis.

I've looked at similar methods. I do not believe that you can run in slave mode without SS being driven low during transactions and going high otherwise (it doesn't just disable master mode, it also resets the slave mode when it goes high (clearing incomplete transfers and so on).

You do realize that there aren't any AVR64DD20's in the nice little QFN, only the brick sized SOIC-20, so if size is a priority, that's not going to be a winner. And it costs more than the AVR64DD28 in a slightly larger QFN (and only pennies less than the AVR64DD28 in SOIC packaging). The die wouldn't fit in the 3x3 QFN (confirmed by filing the top off of AVR32DD20 and AVR64DD14 (I am not going to sit there filing for an hour to get into an SOIC-20; 14 and 20 pin, and 28 and 32 pin parts each share a die, so 14 and 20 should be the same. (16k and 32k ones in each class are the same too; they do that pretty often, So there are 4 DD dies - and 12 DDs. My guess is that the factory "calibration" includes telling the chip what it is). The 32DD is about as large a die as fits in the 3x3 QFN, and the 64k one is a little bit larger. (my understanding is that this is also why there's no 3x3 QFN 3217 - they were using a less advanced process node then, so a 32k modern AVR with peripherals wasn't able to fit. The difference between the two is smaller than you might expect based on the premium they charge on the 64k ones, but the fact that they use the same die for the two smallest sizes on many products is consistent with the really disappointing

xxxajk commented

Timer stops at 7, "running" bit claims it's not running after counting to 7. Real question is where's the pulse coming from...

Note also that the only time I need is the 32Khz clock, delay won't be used in the final design, and serial won't either. There's plenty of pins if I could have done it all bitbang style, but the problem with that is the IRQ storm from that, hence the attraction of the LUTs/CCL. There's also a bug someplace (possible missing SEI? bad vector?) where if I run this with the millis/micros disabled, it wedges. Could also be the optimizer messing things up as usual... I have a few other bugs I have tripped up too, but I'm guessing the features are so unused, that nobody has noticed. Shall I just attach my findings here, or add a new report?

xxxajk commented

Here's what the scope is pooping out.
IMG_20230926_075134
IMG_20230926_075246

xxxajk commented
08:01:00.159 -> 0
08:01:00.159 -> 1
08:01:00.159 -> 1
08:01:00.159 -> 1
08:01:00.159 -> 1
08:01:00.159 -> 2
08:01:00.159 -> 2
08:01:00.159 -> 2
08:01:00.159 -> 2
08:01:00.159 -> 3
08:01:00.159 -> 3
08:01:00.159 -> 3
08:01:00.159 -> 4
08:01:00.159 -> 4
08:01:00.159 -> 4
08:01:00.159 -> 5
08:01:00.159 -> 5
08:01:00.159 -> 5
08:01:00.159 -> 6
08:01:00.159 -> 6
08:01:00.159 -> 6
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.159 -> 7
08:01:00.225 -> 
08:01:00.225 -> TCB0.CTRLA: F
08:01:00.225 -> TCB0.CTRLB: 56
08:01:00.225 -> TCB0.EVCTRL: 1
08:01:00.258 -> TCB0.INTCTRL: 0
08:01:00.258 -> TCB0.INTFLAGS: 1
08:01:00.258 -> TCB0.STATUS: 0
08:01:00.258 -> TCB0.CNT: 7
08:01:00.258 -> TCB0.CCMP: 7

CNT == CCMP == stop
INTFLAGS says it is done too
STATUS = 0 which means the clock stopped. (expected)
there should be nothing happening (but there is)

xxxajk commented

I've managed to get closer to a solution, but unfortunately, Microchip really screwed up on timers and docs, as usual.
Sure, I can get a one-shot to go -- from software only by setting CNT to 0. Apparently the event system doesn't reset the count when you fling it a capt event. It totally ignores the event and starts counting. Lovely, isn't it?
Perhaps I can somehow gate TCA0 (which provides the count clock) and force that in a capture mode which will reset the counter, question is if the compare will still emit a signal, or not. Guessing "no".

xxxajk commented

There's one last detail, needed (automatic switching to slave). There are several different, and all incompatible ways to do this (there's no real standard yet). Here's what I've seen in the wild...

separate SPISlave library -- bad, because of code duplication
looking at the pin register for /SS and changing mode based on that -- bad, a lot of code does not use the default
a method that enables/disables -- not too bad, only breaks on incompatible libs
an overloaded begin, that takes an optional bool that defaults to false, true turns on slave mode -- this one makes sense.

Note that when in slave mode, transaction isn't used anymore, transfer takes data from a queue (pointer to data with length, set to 0 on /SS high edge), and if the queue is empty, either the output is disabled and reflects the IDLE state or a default value is preloaded with the idle state e.g. 0xff or 0x00. Ideally RX is queued too, within an entire /SS transaction.

Thoughts?

xxxajk commented

Also found out the issue with the trigger for TCD. Basically, boils down to the chip won't clear if the start event is already low. basically, to start, it wants a simple pulse. The fix is to pass the event through an edge detecting LUT :-D
This explains a few things.
Interesting thing is the output from CNT==TOP is a pulse, which is fine.

So, I've been working on my project, and here's where the need for the ISR comes in.
First off, you don't use the TX and RX ISRs, I agree that they're useless unless you are going to be running at a very very low rate. The correct thing to do is enable the select interrupt, which does give you just enough time to do the full transfer without missing the first byte.

My process is the following when entering the ISR:
0x1: Disable global interrupts
0x2: Disable the SPI /SS interrupt
0x3: Enable global interrupts
0x4: Push whatever data was previously requested into the SPI
0x5: While SS is low check for an incoming byte, and when there is one, push whatever NEXT data out, and store what came in.
0x6: Disable global interrupts
0x7: Do the user callback
0x8: Enable the SPI /SS interrupt, and prepare a new receive/transmit set of buffers
0x9: Enable global interrupts
0xA: RTI

The whole trick here is that while it is in an ISR, it is expected to stall the user's code, but allow other stuff to happen such as serial, millis, etc.

blink Uhhhhh..... What you describe doesn't do what you say it does.
So you have this, essentially (names I don't remember spelling have ... as placeholder. Point is the locations where interrupts are turned on and off, and when they can actually run:

ISR(SPI0_vect) { /* or whatever it's called */
  cli();
  SPI0.INTCTRL = ...;
  sei();
  SPI0.DATA = *dataptr++;
  while (!digitalReadFast(SS)) {
    if (SPI0.INTFLAGS & ...) {
      SPI0.DATA = *(dataptr +1);
      *(dataptr++) = SPI0.DATA);
    }
  cli();
  onCompleteSlave();
  SPI0.INTCTRL = ...;
  sei();
}

Upon entering the ISR, and until a reti instruction is executed. the LVL0EX bit in CPUINT.STATUS will be set. As long as LVL0EX is set, only the level 1 interrupt vector, if configured, can be triggered. Only a single lvl1 int can be specified. Therefore, the second half of your final sentence does not reflect what is the documented behavior of interrupts. The interrupt system does not work like classic AVRs.

The global interrupt flag is not cleared on entering an interrupt- the CPUINT.STATUS ___EX bit appropriate for the interrupt priority is set instead, and on a RETI, cleared. I don't know what the exact result is when reti's and interrupts are not matched with each other but I suspect that when a RETI is executed, it clears the highest bit of CPUINT.STATUS that is set. It doesn't have any way to know which reti is associated with which vector, so all it could do is rely on the fact that if any LVLnEX is set, then it is in an interrupt of that level, and should clear that bit as it executes reti, allow interrupts to trigger only when status is 0, or that interrupt is elevated, and when entering the ISR, the LVLnEX bit would be set based on what it's priority is (I think the xmegas did it like this too, only because they were xmega it had to be overly complicated. They had at least 4 priorities, and I think the same less than useful NMI), and the more I think about it, the more I am convinced that other approaches are implausible,

Also, because of this, lines 1, 2, 3, 6, 8, and 9 have no effect on the behavior of the code other than slowing it down unless you have an elevated interrupt...

This core does not use elevated interrupts and it's use is very rare in arduino circles.

Unfortunately, the stock core, and megacorex as far as I know still have the original implementation for Serial - which does use it, under some cases, when using serial (when the TX buffer is full and you try to write more to it, and when you call Serial.flush()... even though there isn't a need, and even though there are both hangs in the field (at very high baud rates) and theoretical concerns in multiple scenarios (mostly at low bauds)

Considering it's "Secret sauce" the steps are fairly close to what is done, but not exact.
Classic AVR totally allows reentrant ISRs, which is both a pain and benefit, if you know how to manage that.
I manage this though, using a trick, where the code changes the path to emulate this.
There's two ways to do so, depending on the interrupt controller.
I'll use the NVIC (used on ARM, MIPS and RISC) as an example, although different, the idea is pretty much the same.
First thing we do is find an unused interrupt. It could, for example, be either an IRQ not attached to anything, or, to a block of IP that isn't used in the program. On the NVIC, we can software trigger these. We set it to the lowest possible priority (I realize, this is not an option on the modern AVR). when we have something that is reentrant,or something else, we can push a callback address to the lower priority IRQ, trigger it via software, and just return.

For modern AVR, it's slightly different... we need to instead change the code path.
To do this we need to push on our new "fake" address to the alternate path, and call RTI, which clears the modern AVR interrupts stuff, but instead of returning to the users code, we are running our code. when that's done, we finally return to the user's code.

Basically we split the ISR into a top-half, and a bottom half. The top half services the actual ISR and it's hardware. The bottom half runs in the normal context until it is done, and returns back to the top half, which returns with RTS instead of RTI.

Yeah, a bit convoluted, but, that's probably a better description than the terse one above is. :-)

If you are at all curious on how I accomplish this on classic AVR, ARM and MIPS, I have some code on github that can show you how this is done across those platforms.

Here's a classic AVR one. This does task switching, but needs extra hardware. https://github.com/xxxajk/xmem2

...and the universal one (Classic AVR/NVIC on MIPS/ARM) https://github.com/felis/UHS30/tree/master/libraries/dyn_SWI

At some point I will be adding modern AVR, but perhaps you can understand easier what is going on by looking at actual production code.

That answers my technical reaction to your scheme.

Considering how convoluted a situation that is, it does sort of circle back to my original line of comments which is - that's a plausible if convoluted solution that effectively makes a simulation of reentrant code. But how does this involve SPI.h? The point of the standard libraries is to push architecture dependent code behind the standard library so the app code isn't architecture dependent... as this library would hgave tro be architecture dependent.

It applies in two cases... either monitoring of /SS or RX/TX via ISR for something gawd-aweful slow.
Sometimes this is in the SPI API, sometimes NOT. Depends on the vendor. For the slow case, which is going to probably be used in an educational setting (which is the original intent of these boards) usually attach/detach is "good enough".
For the rest of us, an expanded API, which can turn on/off the ISR would be great.
Just have it as inline ASM, doesn't even need a register (if i am remembering correctly), if you are just enabling/disabling the whole thing. Depends if you want it to be dirty, or not, I guess. I'll have to look at AVR ASM (have to do this anyway, to optimize the current ISR I'm using).

Edit: Yeah, SBI CBI, forgot about those, that's going to be my "fix" ;-)

CBI/SBI only work on the low I/O space.

There's also nothing special to using them anymore (there sorta was on classic AVR). If you do, in c

GPIOR3 |= 1<<n; //where n is a compile time known constant value between 0 and 7 inclusive

will compile to sbi 0x1F, n

GPIOR3 = 1<<n; //same assumptions about n

compiles to ldi (some reg), (1<<3, constant so math optiized away), out 0x1F, somereg.
And this

GPIOR3 |= (1<<n) | (1 <<m); //same assumptions about n apply, and to m as well. Additionally n != m

Yet because two bits were set there in one line, you get the worst result: ldi, in, ori, out which is NOT INTERRUPT SAFE and if an interrupt reads that variable, you must disable interrupts when you do that (not the case with the others). So you may need to wrap that in a cli/set.

If the register isn't in the low I/O, including all the ones that enable and disable a peripheral specific interrupt on a modern AVR, all of those will produce the even slower lds (2 word 3 clock) ori sts (2 word 2 clock), and similarly interrupt unsafe.

On the modern AVRs the low I/O is the following:
VPORTA-VPORTG, 4 bytes per port for IN, OUT, DIR, and INTFLAGS - so you can CBI/SBI those. And you get the 4 GPIOR (GPIOR0-3) which can also be CBI/SBI'ed. But that's all. Nothing else can use SBI/CBI (see, the opcode contains 5 bits to specify low I/O register number 0 through 31 - there can only be 32 registers on an AVR that are accessible with cbi sbi sbic sbis - and 3 bits for the number of the bit, and there's one each for set clear testskipifclear and testskipifset. So each of the registers requires 4*8 = 32 opcodes, 32^2 = 1024 opcodes (1/64th of the total). AVR's allocation of opcodes is very surprising if you aren't expecting it to look like this....

  • "Immediates" - 4096 opcodes each (16 registers times 256 possible values of the immediate), and there are 6 of them, so 3/8ths of the possible opcodes go right there.
  • STD/LDD take 8192 opcodes (st/ld are not significant users of opcodes - the rest of all st/ld is like 288 opcodes I think?). So this is another 1/8th of the opcodes not counting st/ld)
  • rcall and rjmp are each 1/16th of the opcodes. so together that's another 1/8th of the opcode space. Up to 5/8ths now, and we're only through 10 instructions!
  • In and Out are 32*64 = 2048 each, so this is another 1/16th between both of them.
  • conditional branches - 8 set 8 cleared tests, each times 128 different offsets it can jump to, 32*64 = 2048. (this is either 16 instructions - if you consider brne and the like as separate instructions, or 2 if you adopt the more reasonable position that the fundamental instructions are brbs/brbc. The other mneumonics are just syntactic sugar)
  • Now there are the two operand binary instructions - add, adc, and, or, sub, subc eor and so on. The ones that can take any addresses number at least 8, so that's another 1/8th of the total right there. 3/4ths of possible opcodes now accounted for by 24 instructions. The remaining ~100 instructions include like 20-someodd synonyms and syntactic sugar mneumonics (mostly the branches), leaving >60 distinct instructions to be shoehorned into the remaining 8192 opcodes. About 1.5k are unused (generally badly fragmented). Some of these also likely do something special that Microchip doesn't tell us mere mortals about.

This explains the paucity of registers with bit level access on AVR, as well as other painful limitations like only having std/ldd on Y and Z (it would have taken another 4096 instructions to do it for x too).

Classic AVRs frequently put what they expected to be used often there after they'd filled it with the pin-related ones, but like most manufacturers, they're not great at knowing what features we use. On modern AVRs they effectively admitted this and chucked all the riffraff out of the low and high I/O spaces.

This is why I always disable global interrupts, then do the change, and re-enable them.