wchill/SwitchInputEmulator

Recommended handshaking for a Python Client

Closed this issue · 17 comments

Thanks for working on this.

I am trying to write a python client to interface with the AVR code that will sync with the AVR and issue commands via serial, (I chose this route because I didn't want to use your QT project.) I had a question about your state machine that you implemented in the AVR logic.

The state machine in the AVR code waits for a start sync command, and then responds with a message back to the client. It goes through a number of intermediary sync stages until it reaches an "in sync" state. At this point, it starts listening for a packet of 9 bytes - 8 bytes for joystick data, 1 byte for CRC8. (Brilliant, BTW.) If it receives a properly formatted packet, it will send a new USB packet and then send a USB ACK message back to the client. If the CRC8 check fails, it will send a NACK message back to the client.

My question: If my Python script is set up to be an "open serial / sync / send packets / close serial" procedural script rather than a service, what is the best way to return to the sync state every time? Should I just try to send a packet, and if I get a NACK or a start sync response attempt to sync?

Hi @IBNobody ,

Please see issue #2. I describe how to reliably sync with the AVR in that issue.

(As for the CRC8, I had to add it because I noticed that very rarely some packets would come in corrupt, which would of course mess up everything - I even saw the repercussions on stream for some other people who decided to do a Twitch Plays when their code went OOS. It was pretty funny.)

Thanks. That helped and will let me improve my Python script. I will submit a pastebin of a template here when I am finished with it for others referencing this issue.

Sure, if you open a PR I'd be happy to accept the check-in (the Python code I have in the repo is from a really old version).

https://pastebin.com/Ni0A0bGV

I will do a PR once I vet this later.

However, one troubling thing I found was that the system was not as high-performing as I was expecting. I recently made a jump rope routine for FFIX using an alternate Teensy++ payload, and I was able to get better timing with it than I am with this implementation.

With your current implementation, It takes 4.3 ms to send a packet when connected to my PC, and roughly 8ms when the Switch is connected. The bulk of the PC communication is tied to the baud rate. (Increasing the baud rate, sending single bytes, and ignoring incoming traffic reduced the communication time.) I am not sure why the Switch takes longer? Does the Switch not poll its USB as fast?

I will look into it more, but one idea that came to mind was to reduce the size of the packet. Right off the bat, I can remove the vendor byte to reduce packet size to 8 bytes. If I make the first byte a command byte, I could have variable length packets that would also speed things up. (An ABXY button press packet would only need to be 3 bytes: COMMAND, ABXY, CRC.) Since we have 5 types of events (low button, high button, dpad, left stick, right stick), we would need around 60 separate commands.

Do you know of any other speed increases? Would I be able to bump the HID_Task to the serial ISR to force a USB packet?

I've done some extensive research into this to see where it's possible to reduce latency.

USB 1.1 and 2.0 work based on polling, where the host has to explicitly schedule time slots for each peripheral device. So we can't force a packet to be sent, the host has to request an update from us. (This is not true of USB 3.0 which works asynchronously, but USB 3.0 is a completely different beast and I'm not sure if HID would even work via USB 3.0 instead of downgrading to 1.1.)

The Switch will only ever be able to poll USB HID devices at 125Hz (8ms), unless something has changed since the last time I checked (firmware 6.0). As far as I know, this is something that is not possible to overcome without either avoiding USB entirely, or without modifying the Switch's USB drivers. So this will always be a hard cap.

On the PC side of things, I would expect around 1ms poll latency (1000Hz poll rate) and the rest would be transfer time. So at 19200bps 9 bytes takes about 3.75ms to transfer, plus 0.5ms base latency (depending on when you want to transfer, it could either be really close to the next poll or it might have just missed it, so we average), giving us 4.275ms. At 115200bps 9 bytes would take 0.7ms to transfer, plus the 0.5ms latency, so 1.2ms on average. So we can get a cheap win there (I personally forgot why I set it to 19200 and not 115200).

You're right in that we can drop the vendor byte, but we would still need to keep a mechanism to allow for a fully atomic update if you need to change inputs for everything (buttons, dpad, sticks). I guess we can repurpose the vendor byte for this since we're just hardcoding it to 0 anyway right now.

But yeah, unfortunately it's not going to be possible to drop the worst case latency below about 9.2ms (8ms switch, 0.2ms transmit for 3 bytes and 1ms PC).

If you do need something significantly faster, the Joycons when connected to the Switch communicate via ~3Mbps UART, but you are going to need much better hardware to support that data rate and will likely need to either modify your Joycons or buy Joycon rails and rig something up.

There is a lot more stuff that you would also have to emulate in order to get that to work and it's something I've been exploring, but have not had enough time to delve into. I successfully emulated part of the custom protocol and it might be enough to get something going with attached fake Joycons as the underlying protocol is the same no matter what communication medium is being used (USB/BT/hardwired), but as I haven't tried I can't say for sure.

Another thing to keep in mind might be that this implementation fires and forgets on the PC side, so because of the Switch polling you could potentially have anywhere from 0 to 8ms latency. If the variability is a problem as opposed to the actual latency number, you might want to look into uncommenting the line that sends 0x91 bytes on success and synchronizing on that byte before sending out inputs.

Very thoughtful and useful responses!

(I personally forgot why I set it to 19200 and not 115200).

It is likely because the MCU you are using cannot output the baud you want due to the base clock driver causing an error between the target baud rate and the actual baud rate.

https://www.pjrc.com/teensy/td_uart.html

The formula for the register Baud = F_CPU/(16*(x+1)) has some error-free values at 50k, 62.5k, 100k, and 125k. Since I use an FTDI USB to serial adaptor, the host end has much more flexibility for baud rates. I will give those even values a shot.

I think what I need to strive for is more of a monotonic solution rather than a faster solution. The FFIX jump rope minigame was killer on button presses, but even at its fastest speed, it required a period of 383ms for a double-jump event. (Press A, Release A, Press A, Release A, wait to pad to 383ms.) On the host end (and not using your payload), I was able to get some accuracy gains in Windows/Python by using the time.perf_counter(). I was able to achieve a host-level accuracy of 0.3ms. However, I don't know what the accuracy of the Switch was because I did not have any polling feedback.

I will continue the experiments tonight (without implementing the packet changes yet) to see if I can measure how much timing error is in the system. (I need to figure out a way to get feedback from the switch.)

#6 created.

31250 was the highest baud I could get. I will scope out the lines and see if I can see any issue.

I ran into another handshaking issue that has me scratching my head, and the solution may end up being an edit to the C code.

If I am sync'ed, the MCU is going to constantly spam RESP_USB_ACK's at me every time the HID packet is sent. If I purposefully send a bad CRC'ed packet, the RESP_UPDATE_NACK gets lost in the spam of RESP_USB_ACK packets.

I corrected this behavior in the joystick.c file by adding a boolean latch that keeps the HID_Task from spamming ACKs.

            } else {
                // Everything is ok
                buffer.Button = (usbInput.input[0] << 8) | usbInput.input[1];
                buffer.HAT = usbInput.input[2];
                buffer.LX = usbInput.input[3];
                buffer.LY = usbInput.input[4];
                buffer.RX = usbInput.input[5];
                buffer.RY = usbInput.input[6];
                buffer.VendorSpec = usbInput.input[7];
                send_USB_ack = true;
                // send_byte(RESP_UPDATE_ACK);
            }
        // We'll then populate this report with what we want to send to the host.
        disable_rx_isr();
        if (state == SYNCED) {                
            memcpy(&JoystickInputData, &buffer, sizeof(USB_JoystickReport_Input_t));
            if (send_USB_ack) {
                send_byte(RESP_USB_ACK);
                send_USB_ack = false;
            }
        } else {
            memcpy(&JoystickInputData, &defaultBuf, sizeof(USB_JoystickReport_Input_t));
        }
        enable_rx_isr();

Is this the easiest solution, or is there a better solution I can implement on my client end?

Also, I will need to submit another PR. I discovered a bug in my read_byte_latest() function.

That would be fine if the behavior you're looking for is "the last packet you sent is correct" instead of "I sent a packet to the Switch successfully".

I personally just read all available bytes and checked to see if any was a NACK, and then resent the last state if one was found.

I personally just read all available bytes and checked to see if any was a NACK, and then resent the last state if one was found.

I do not think that will work for my script because that would require me to wait a few ms and timeout. What was the reason that you commented out the RESP_UPDATE_ACK statement? That would give me the same functionality, but only sending one RESP_USB_ACK had the added benefit of knowing when the packet went live.

In other news my optimization is not going well. It has gotten worse. A full packet send takes 16ms, and the bulk of this is caused by reading the packet response. I noticed an odd behavior with my Python script that if I use the time.perf_counter() as a wait method, the full packet send time varies between 5ms to 21ms. If I remove my time.perf_counter()-based wait method, I am back to a steady 16ms. I wonder if this is being caused by time.perf_counter() not using the true performance counter on Windows 10. This is causing quite a bit of jitter in my button presses and has prevented me from using my script to jump rope.

I commented out the ACK because it made debugging hard due to all the noise and to be honest I wasn't really using it. I did previously have something similar to your approach where I would only send an ACK after successfully receiving a packet, I just don't remember why I changed it.

Regarding the timer, I know Windows 10 does some funky things with timers, but time.perf_counter() seems to have good enough resolution that I don't think that's the issue (tested empirically).

What's the approach you're taking when not using time.perf_counter()?

So, I think I am falling victim to a Win10 issue with the 16ms PC RTC tick and the pyserial read.

In the snippit below (latest Python, latest pyserial), if I uncomment out any of the statements between t_1 and t_2, the time delta between t_0 and t_1 changes!

def p_wait(wait_time):
    """Wait x seconds (precise)."""
    t_0 = time.perf_counter()
    t_1 = t_0
    while t_1 - t_0 < wait_time:
        t_1 = time.perf_counter()

def testbench_packet_speed(count=100, debug=False):
    """Test packet speed."""
    testbench_sum = 0
    testbench_min = 999
    testbench_max = 0
    testbench_avg = 0
    testbench_err = 0

    for i in range(0, count + 1):

        # Send packet and check time
        t_0 = time.perf_counter()
        status = send_packet(debug)
       
        t_1 = time.perf_counter()
        # p_wait(0.01)
        # time.sleep(0.01)
        # input('key')
        t_2 = time.perf_counter()
        print('%.6f' % ((t_1 - t_0)*1000), '%.6f' % ((t_2 - t_0)*1000))
        # Count errors
        if not status:
            err += 1
            print('Packet Error!')

        # Compute times
        testbench_delta = t_1 - t_0
        if testbench_delta < testbench_min:
            testbench_min = testbench_delta
        if testbench_delta > testbench_max:
            testbench_max = testbench_delta
        testbench_sum = testbench_sum + (t_1 - t_0)

In the case of p_wait(0.01), I get the results below...

15.990600 25.992300
5.851800 15.853600
5.852200 15.854900
5.868000 15.870300
5.906400 15.908700
5.870200 15.872400
5.807900 15.811200
5.845300 15.847800
5.911600 15.913700
5.864300 15.866200
5.868700 15.870900

I think what is happening is that I am shifting the pyserial read to a different RTC tick after the first iteration.

I'm not sure I have a solution to this other than to spawn a thread that reads the serial input buffer in the background like you were doing.

That was it. I am marking this as a success and am closing this issue.

To recap...

  • I took the stock joystick.c and made it only send one USB ACK packet, I put the HID task in the RX ISR. Finally, I bumped the baud rate to 31250.

  • I changed the driver setting on my FTDI USB serial port to 1ms from 16ms.

  • I modified my Python p_wait statements to use a balance of sleep and while waiting, since Win10 sleep resolution is 1.5ms (over 16ms on Win7). Finally, I began to use a timer function (p_timer) that ensured that the packet writes were evenly timed.

def p_wait(wait_time, min_sleep_time=0.002):
    """Wait x seconds (precise)."""
    t_0 = time.perf_counter()
    if wait_time > min_sleep_time:
        time.sleep(wait_time - min_sleep_time)
    t_1 = time.perf_counter()
    t_2 = t_1
    while t_2 - t_0 < wait_time:
        t_2 = time.perf_counter()
    return t_2 - t_0

def p_timer(t_0, wait_time, min_sleep_time=0.002):
    """Pad time by x seconds (precise)."""
    t_1 = time.perf_counter()
    if wait_time > min_sleep_time:
        time.sleep(wait_time - min_sleep_time - (t_1 - t_0))
    t_2 = time.perf_counter()
    while t_2 - t_0 < wait_time:
        t_2 = time.perf_counter()
    return t_2 - t_0

@IBNobody Are these modifications available anywhere? I am encountering similar timing issues.