raspberrypi/linux

[regression] alsa system call blocks on record between 5.4.83 and 5.5.19

HinTak opened this issue · 22 comments

I am looking at a device driver issue ( respeaker/seeed-voicecard#290 ); the later part of the investigation was all conducted with a fresh raspios_lite_armhf-2021-03-25/2021-03-04-raspios-buster-armhf-lite.zip, just swapping the kernel: raspberrypi-kernel_1.20210303-1 (5.10.17) and raspberrypi-kernel_1.20210108-1 (5.4.83), and the last of the rpi-5.4.y (also 5.4.83 but a dozen of commits after raspberrypi-kernel_1.20210108-1), rpi-5.5.y, rpi-5.6.y I built. The driver code does not change between 5.4 and 5.6, so I am 100% sure it is a regression within the raspberry pi kernel between 5.4.83 and 5.5.19 (the last of rpi-5.4.y and rpi-5.5.y). I just don't know what caused it yet.

Here is what I know so far:

  • the breakage happens between 5.4.83 and 5.5.19 (the last of rpi-5.4.y and rpi-5.5.y).

  • snd soc core seems to finish loading the driver even in 5.10.x - The driver is able to reach the end of soc_init_pcm_runtime() -> soc_new_pcm() in snd_soc sucessfully even in 5.10.x . I chased a red herring through soc-pcm due to 1d5cd52 ( a message changed from dev_info to dev_dbg and got suppressed).

  • it seems to be devicetree / overlay related, since the 6-mic device and the 4-mics device share some hardware, and the 4-mic is still working fine. Also I accidentally used the 5.6 device tree for rpi-5.5.y and rpi-5.4.y (just missed the copy step whle downgrading). 5.5.x was able to boot with the 5.6 device tree, but 5.4 failed. (I fixed it by mounting on desktop linux, looked around, realized the problem, and copied the correct dtb and re-powered). So there are definitely some incompatibility between 5.4 and 5.5/5.6 in the device tree, even for the same hardware.

Trying to record with arecord results in pcm_read:...: read error: system call interrupted, and it seems to block on the system call from alsa.

Since the problem now boils down to differences between the last of rpi-5.4.y and rpi-5.5.y, and I can go and see what's the difference is, but is there any help / tips I can get ? Especially in device tree / soc area.

I can bisect through those branches, but seeing as I had one failed-to-boot mistake already, I assume some of rpi-5.5.y would also fail to boot; any guidance on that? (though I'd look at diffs in arch/arm/boot/dts/ first).

While looking at this, I found a previous "regression" raspberrypi/documentation#1671 had an in-tree fix #3602 a few months before it was noticed in the out-of-tree drivers. I am mentioning this because late 5.4.y has this fix (for the draws-overlays, and a few other audio devices), but the fix is not in 5.5/5.6, and is collapsed in rpi-5.7.y . So it appears that I might need to check 5.7 too, as it seems that 5.4 is better maintained than 5.5/5.6 .

I have built HEAD of rpi-5.4.y, rpi-5.5.y and rpi-5.10.y branches, after I realised rpi-5.5.y was not really worked on. Anyway, I went ahead and enabled dynamic debugging on all of them, and switched on "file sound/* +p" (the whole of the sound directory. The kernel debug messages are still identical!

When the driver overlay is being applied:

[  456.043453] snd_soc_seeed_voicecard: loading out-of-tree module taints kernel.
[  460.830943]  ac10x i2c_id number: 1
[  460.830958]  ac10x data protocol: 0
[  461.047836]  ac10x codec count  : 1
[  461.048879]  ac10x i2c_id number: 0
[  461.048893]  ac10x data protocol: 0
[  461.283648]  ac10x codec count  : 2
[  461.544868] ac10x-codec 1-0035: ASoC: dai register 1-0035 #1
[  461.544886] ac10x-codec 1-0035: ASoC: dynamically register DAI 1-0035
[  461.544907] ac10x-codec 1-0035: ASoC: Registered DAI 'ac10x-codec.1-0035'
[  461.545788] seeed-voicecard soc:sound: ASoC: binding bcm2835-i2s-ac10x-codec0
[  461.545981] ac10x-codec 1-0035: ASoC: adding Playback widget
[  461.546000] ac10x-codec 1-0035: ASoC: adding Capture widget
[  461.560011] input: seed-voicecard-headset as /devices/platform/soc/fe804000.i2c/i2c-1/1-0035/input/input0
[  461.572485] seeed-voicecard soc:sound: ASoC: probe seeed-8mic-voicecard dai link 0 late -2
[  461.572505] seeed-voicecard soc:sound: ASoC: probe seeed-8mic-voicecard dai link 0 late -1
[  461.572523] seeed-voicecard soc:sound: ASoC: probe seeed-8mic-voicecard dai link 0 late 0
[  461.572540] seeed-voicecard soc:sound: ASoC: probe seeed-8mic-voicecard dai link 0 late 1
[  461.572556] seeed-voicecard soc:sound: ASoC: probe seeed-8mic-voicecard dai link 0 late 2
[  461.572576] ac10x-codec 1-0035: ac108_set_sysclk freq = 24000000 clk = 0
[  461.572590] ac108_set_sysclk  :24000000
[  461.585978] AC101 as Master
[  461.588134] seeed-voicecard soc:sound: ASoC: registered pcm #0 bcm2835-i2s-ac10x-codec0 ac10x-codec.1-0035-0
[  461.589364] seeed-voicecard soc:sound: ac10x-codec.1-0035 <-> fe203000.i2s mapping ok
[  461.589389] ac10x-codec 1-0035: Channel 1 AAF -> Capture
[  461.589403] ac10x-codec 1-0035: Channel 2 AAF -> Capture
[  461.589421] ac10x-codec 1-0035: Channel 3 AAF -> Capture
[  461.589434] ac10x-codec 1-0035: Channel 4 AAF -> Capture
[  461.589445] ac10x-codec 1-0035: ADC1 -> Capture
[  461.589456] ac10x-codec 1-0035: ADC2 -> Capture
[  461.589467] ac10x-codec 1-0035: ADC3 -> Capture
[  461.589478] ac10x-codec 1-0035: ADC4 -> Capture

and each time when I try to run arecord:

[  541.302341] ASoC: ac10x-codec.1-0035 <-> fe203000.i2s info:
[  541.302359] ASoC: rate mask 0xde
[  541.302375] ASoC: min ch 8 max ch 8
[  541.302390] ASoC: min rate 8000 max rate 48000
[  541.370827] ac10x-codec 1-0035: Update DAI routes for ac10x-codec.1-0035 capture
[  541.370849] ac10x-codec 1-0035: Connecting DAI route Channel 4 AAF -> Capture
[  541.370867] ac10x-codec 1-0035: Connecting DAI route Channel 3 AAF -> Capture
[  541.370884] ac10x-codec 1-0035: Connecting DAI route Channel 2 AAF -> Capture
[  541.370900] ac10x-codec 1-0035: Connecting DAI route Channel 1 AAF -> Capture
[  541.373889] bcm2835-i2s fe203000.i2s: I2S SYNC error!
[  541.373908] bcm2835-i2s fe203000.i2s: slots: 2 width: 32 rx mask: 0x03 tx_mask: 0x03
[  541.373921] bcm2835-i2s fe203000.i2s: frame len: 64 sync len: 1 data len: 32
[  541.373935] bcm2835-i2s fe203000.i2s: rx pos: 1,33 tx pos: 1,33
[  541.373947] bcm2835-i2s fe203000.i2s: sampling rate: 48000 bclk rate: 3072000
[  541.373961] bcm2835-i2s fe203000.i2s: CLKM: 1 CLKI: 1 FSM: 1 FSI: 0 frame start: rising edge

and yet it is stuck at system call in 5.5/5.10, and okay in 5.4.

The only difference is the line:

[  461.544868] ac10x-codec 1-0035: ASoC: dai register 1-0035 #1

but apparently it was first moved then removed in:

7ca24386a7c2fb3828303b7c694cb0b4af1eac5c ASoC: soc-core: merge snd_soc_register_dai() and soc_add_dai()
71cb85f5e9da4aa3ab62020389b513275121083d ASoC: soc-core: call snd_soc_register_dai() from snd_soc_register_dais()

So I am out of ideas - unless it is this group of snd_soc_register_dai in v5.5 which broke it subtly.

@pelwell apoplogies for pinging - identical out-of-tree driver/overlay code between 5.4 and 5.5, driver side and sound soc dynamic debug messages identical (except for the one line the snd_soc_register_dai() merge), both manage to get to bcm2835-i2s hw_params; 5.4 can record, 5.5 cannot... what other part of the kernel can I switch on dynamic debugging besides "file sound/* +p" (all of sound)? and other debugging ideas? I feel like possibly posting to alsa-devel ...

I have looked at the history of rpi-5.4.y and rpi-5.5.y - they looks like they are fresh and independent side-branches of linux-stable so cannot be meaningfully bisected. It looks more like a selective reverse-patch to find what breaks...

I did something quite barbaric - since the out-of-tree driver requires no change to build against 5.4 vs 5.5 , and I suspect it is the internal rewrite and tidy up in sound/soc that broke the driver in some subtle way, I simply delete include/sound and sound in 5.5 and grafted the two subdirectories from 5.4, then make mrproper and build and install 5.5 with 5.4 sound, and I am able to restore functionality this way. Somewhere in "sound" and "include/sound" is where the regression lies.
Cc @popcornmix @pelwell

I shall next try "include/sound/soc*" and "sound/soc/"....

I have narrowed it down to just "include/sound/soc*.h" and "sound/soc/soc-*.c", which is only 14(?) files with about 14 commits , the bulk of them in https://lore.kernel.org/alsa-devel/87ftj23jph.wl-kuninori.morimoto.gx@renesas.com/ , and maybe about 4-5 authors. I'd likely end up having to modify the driver code to work around it (since it broke between 5.5 to 5.10 current) , but mostly certainly I need to post to alsa-devel, before or after I narrow it further.

I have narrowed it down to 13 commits in 14 files from 5 people. The Intel / acpi one is unlikely, but here is the details.

Time to write to alsa-devel.

$ git diff --numstat origin/rpi-5.5.y 
0       3       include/sound/soc-acpi-intel-match.h
0       4       include/sound/soc-acpi.h
9       43      include/sound/soc-component.h
9       9       include/sound/soc-dpcm.h
9       31      include/sound/soc.h
78      65      sound/soc/soc-component.c
43      15      sound/soc/soc-compress.c
687     555     sound/soc/soc-core.c
17      3       sound/soc/soc-dapm.c
107     60      sound/soc/soc-generic-dmaengine-pcm.c
9       2       sound/soc/soc-ops.c
59      82      sound/soc/soc-pcm.c
58      34      sound/soc/soc-topology.c
7       4       sound/soc/soc-utils.c

Edit: wrong incantation of git log. The change set is rather larger and it is 140 commits involving 23 people in 14 files. Luckily I didn't e-mail the wrong people...

Written to alsa-devel https://mailman.alsa-project.org/pipermail/alsa-devel/2021-April/183770.html and CC'ed the 5 people whose kernel code changes are involved. Let's see what follows on alsa-devel .

It is down to about 80 commits involving 8 files and about 10 people, and I have started bisecting. Since it takes about an hour to build, it will take about 2 days to bisect 80 commits...

It really shouldn't take an hour to do a build; I almost never do a "make clean" between builds, even when switching kernel versions - the Linux Makefiles are good like that.

@pelwell this is on the pi 4 building for itself so the timing should be similar on everybody's pi 4. Strictly speaking I am just bisecting though snd_soc_core and snd_pcm* so "make modules" should be sufficient. Somehow it does not - I am doing a "poor man's bisect", just git archive ... 8 fIles across, and somehow the module does not remade... So I am doing "make mrproper ;.copy .config ; make oldconfig"... Down to 10 commits from 80 now...

It's hard to guess why "make modules" isn't working for you, but at least the end is in sight for you.

Maybe I am being cautious - I'd like everything that depends on snd_soc_core to remade, but only itself was. Anyway, it is not too bad, just coming back to it every hour or so ... Should finish later today or early tomorrow. At the moment, it looks like one of the commits which changes soc_*_pcm_runtime() quite early in v5.5 merge window .

The problematic commit is this one. I have flipped it back and forth to make sure.

commit 4378f1fbe924054a09ff0d4e39e1a581b9245252
Author: Peter Ujfalusi <peter.ujfalusi@ti.com>
Date:   Fri Sep 27 10:16:46 2019 +0300

    ASoC: soc-pcm: Use different sequence for start/stop trigger
    
    On stream stop currently we stop the DMA first followed by the CPU DAI.
    This can cause underflow (playback) or overflow (capture) on the DAI side
    as the DMA is no longer feeding data while the DAI is still active.
    It can be observed easily if the DAI side does not have FIFO (or it is
    disabled) to survive the time while the DMA is stopped, but still can
    happen on relatively slow CPUs when relatively high sampling rate is used:
    the FIFO is drained between the time the DMA is stopped and the DAI is
    stopped.
    
    It can only fixed by using different sequence within trigger for 'stop' and
    'start':
    case SNDRV_PCM_TRIGGER_START:
    case SNDRV_PCM_TRIGGER_RESUME:
    case SNDRV_PCM_TRIGGER_PAUSE_RELEASE:
            Trigger order: dai_link, DMA, CPU DAI then the codec
    
    case SNDRV_PCM_TRIGGER_STOP:
    case SNDRV_PCM_TRIGGER_SUSPEND:
    case SNDRV_PCM_TRIGGER_PAUSE_PUSH:
            Trigger order: codec, CPU DAI, DMA then dai_link
    
    Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
    Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
    Link: https://lore.kernel.org/r/20190927071646.22319-1-peter.ujfalusi@ti.com
    Signed-off-by: Mark Brown <broonie@kernel.org>

The ti.com address bounced, so I am hoping the rest of the alsa-devel people can help.

I can be double-sure when/if I unbreak 5.10 by reverting this. I'm going to give that a try now.

soc-pcm.c has since a lot between 5.5 and 5.10, but I was able to reverse the idea of 4378f1f on v5.10.27 ( rpi-5.10.y, 7fb9d00 about a week ago) with this change, and restore the functionality of the driver:

diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
index 91bf33958..20077dd8c 100644
--- a/sound/soc/soc-pcm.c
+++ b/sound/soc/soc-pcm.c
@@ -1042,7 +1042,7 @@ static int soc_pcm_trigger(struct snd_pcm_substream *substream, int cmd)
        case SNDRV_PCM_TRIGGER_START:
        case SNDRV_PCM_TRIGGER_RESUME:
        case SNDRV_PCM_TRIGGER_PAUSE_RELEASE:
-               ret = snd_soc_link_trigger(substream, cmd);
+               ret = snd_soc_pcm_dai_trigger(substream, cmd);
                if (ret < 0)
                        break;
 
@@ -1050,8 +1050,9 @@ static int soc_pcm_trigger(struct snd_pcm_substream *substream, int cmd)
                if (ret < 0)
                        break;
 
-               ret = snd_soc_pcm_dai_trigger(substream, cmd);
+               ret = snd_soc_link_trigger(substream, cmd);
                break;
+
        case SNDRV_PCM_TRIGGER_STOP:
        case SNDRV_PCM_TRIGGER_SUSPEND:
        case SNDRV_PCM_TRIGGER_PAUSE_PUSH:

I agree with the idea of 4378f1f though, so I think the seeed studio respeaker driver needing the previous behavior is broken...

I agree with the idea of 4378f1f though, so I think the seeed studio respeaker driver needing the previous behavior is broken...

I'm relieved to hear that - reverting that change seems rather risky.

Yes, I think the seeed studio respeaker driver needs fixing... For now, I think I have spend enough time of this and would leave it stew for a bit on alsa-devel, and see if I get some helps/tips from the alsa folks.

@pelwell I have an ugly-but-functional change to the driver https://mailman.alsa-project.org/pipermail/alsa-devel/2021-April/184023.html , could do with some suggestions on how to achieve the same result in a more elegant way. If you have a moment and clued up about devicetree's, can I trouble you to have a look to see how it might be done better?

I can't see how a DTS change could solve a problem which currently seems to need an explicit call between drivers - sorry.

@pelwell the device is somewhat strange in using the playback codec chip's clock for recording too (for echo cancelation kind of usage where recording happens together with playback). Anyway, I think I have a less-ugly way of doing it, which is modifying the set_clock prototype and passing back the dai to the recording codec to extract the playback codec struct. This way at least the dai link code does not include /link or call the codec code directly.

Thanks for taking the time to look at it.

@pelwell I have come up with a different way of getting the same result: HinTak/seeed-voicecard@19067f3 . The more I look at the code the less happy I am with the way the vendor scattering start/stop code across the codec/dai-link, but that's for another time...