mcuee/libusb-win32

Missing data for In transfer of size >65536 bytes

sonatique opened this issue ยท 38 comments

Hello,
I discovered by accident that as soon as I want to use transfer size > 65536, for inbound transfers, I start randomly missing some pieces of data, even if the actual length of transferred data is well beyond 65536.

I am only using bulk In transfer on a USB2.0 High Speed link/device.

For instance: I am using libusb-1.0 with libusb0.sys driver (and therefore libusbK.dll in the middle).

  1. I submit one transfer with a buffer of 131072 bytes and set the Length field at this value to read data on my endpoint
  2. I get a callback saying (via ActualLength field) 1024 bytes have been transferred. Everything OK, I resubmit
  3. I get another callback saying 1592 bytes have been transferred. Everything OK, I resubmit
  4. I get another callback saying 1328 have been transferred: I can clearly see from the data that there is an large amount of missing bytes, in the order of 1000.

If I use 65536 or less, everything is fine. If I use 65536+512, or any value larger than 66536 + N * 512, I get the missing data issue.

If I use Zadig to switch to libusbK.sys or WinUSB.sys, I get no issues at all, for larger or smaller than 65536 transfer buffer size.

I can see from the comment at line 352 of transfer.c that "each bulk/interrupt irp/urb pair can request at the most 65536 bytes" but I don't see how this value is actually used in the code, I find no other occurrence of this value. Instead I can see that "maxTransferSize" is used somehow, but I do not expect maxTransferSize to be equal to 65536 on my system (Windows 10).

And even if maxTransferSize was equal to 65536, I would expect the code to work, but it doesn't seem it does correctly.

Is this a known issue / limitation?

mcuee commented

Does this occure if you do not use libusb-1.0 API, rather using libusb-win32 API or libusbK API?

As far as I know, there is no such issue if I use libusbK API (using kBench) and libusb0.sys.

Can you post the libusb-1.0 debug log (and probably also libusb0.sys debug log if you can).

I didn't try without libusb-1.0, I only tried with different drivers, but the same top-level components.
I can not easily try without libusb-1.0 since my device requires complexe initialization to work, and this is currently implemented in my main application using libusb-1.0.
I currently do not have another device I could use to test without libusb-1.0 .

I will try to post some debug logs, yes, OK.

What is the trick for getting some output from libsub0.sys by the way? It's quite frustrating that I cannot get a single line of output.
I downloaded the Debug binary package from sourceforge and carefully installed libusb0.sys in place of the existing one (I triple checked the date and size of the only libusb0.sys to be found in my system).
I use DebugView with all options checked
I cannot see a single line of output for libusb0.sys.

On the opposite side of the spectrum, libsusb-1.0 with debug log level is so verbose that it slow down my application to the point where I get timeouts I don't usually have.

What is your technique for debugging libusb0.sys?

Here is some libusb-1.0 debug logs.
First 2 are OK, the third one has missing data, but I cannot see anything wrong here.
The log stops when my toplevel application detects missing data:

[16:43:13.594 P:26184 T:001 A:00000 DBG] LibUsbFactory: [libusb_submit_transfer] transfer 00000195FA08FCB0
[16:43:13.598 P:26184 T:001 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] matched endpoint 86 with interface 0
[16:43:13.600 P:26184 T:001 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] reading 524288 bytes
[16:43:13.953 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() returned 0
[16:43:13.953 P:26184 T:017 A:00000 DBG] LibUsbFactory: [handle_event_trigger] event triggered
[16:43:13.964 P:26184 T:061 A:00000 DBG] LibUsbFactory: [windows_iocp_thread] transfer 00000195FA08FCB0 completed, length 1032
[16:43:13.968 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() for 2 HANDLEs with timeout in 60000ms
[16:43:13.974 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() returned 0
[16:43:13.974 P:26184 T:017 A:00000 DBG] LibUsbFactory: [handle_event_trigger] event triggered
[16:43:13.974 P:26184 T:017 A:00000 DBG] LibUsbFactory: [windows_handle_transfer_completion] handling transfer 00000195FA08FCB0 completion with errcode 0, length 1032
[16:43:13.983 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_handle_transfer_completion] transfer 00000195FA08FCB0 has callback 00000191003BC37C
[16:43:13.983 P:26184 T:017 A:00000 DBG] LibUsbFactory: [libusb_handle_events_timeout_completed] doing our own event handling
[16:43:13.983 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() for 2 HANDLEs with timeout in 60000ms
[16:43:13.993 P:26184 T:043 A:00000 DBG] LibUsbFactory: [libusb_submit_transfer] transfer 00000195FA08FCB0
[16:43:13.993 P:26184 T:043 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] matched endpoint 86 with interface 0
[16:43:13.993 P:26184 T:043 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] reading 524288 bytes
[16:43:14.003 P:26184 T:061 A:00000 DBG] LibUsbFactory: [windows_iocp_thread] transfer 00000195FA08FCB0 completed, length 1600
[16:43:14.003 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() returned 0
[16:43:14.003 P:26184 T:017 A:00000 DBG] LibUsbFactory: [handle_event_trigger] event triggered
[16:43:14.003 P:26184 T:017 A:00000 DBG] LibUsbFactory: [windows_handle_transfer_completion] handling transfer 00000195FA08FCB0 completion with errcode 0, length 1600
[16:43:14.013 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_handle_transfer_completion] transfer 00000195FA08FCB0 has callback 00000191003BC37C
[16:43:14.013 P:26184 T:017 A:00000 DBG] LibUsbFactory: [libusb_handle_events_timeout_completed] doing our own event handling
[16:43:14.013 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() for 2 HANDLEs with timeout in 60000ms
16:43:14.013851 [P:26184|T:086] TRACE<MultiStreamV1DataProcessor.cs:ProcessData:61> 1032
[16:43:14.099 P:26184 T:086 A:00000 DBG] LibUsbFactory: [libusb_submit_transfer] transfer 00000195FA08FCB0
[16:43:14.099 P:26184 T:086 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] matched endpoint 86 with interface 0
[16:43:14.104 P:26184 T:086 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] reading 524288 bytes
16:43:14.104191 [P:26184|T:086] TRACE<MultiStreamV1DataProcessor.cs:ProcessData:61> 1600
[16:43:14.113 P:26184 T:061 A:00000 DBG] LibUsbFactory: [windows_iocp_thread] transfer 00000195FA08FCB0 completed, length 23144
[16:43:14.134 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() returned 0
[16:43:14.134 P:26184 T:017 A:00000 DBG] LibUsbFactory: [handle_event_trigger] event triggered
[16:43:14.137 P:26184 T:017 A:00000 DBG] LibUsbFactory: [windows_handle_transfer_completion] handling transfer 00000195FA08FCB0 completion with errcode 0, length 23144
[16:43:14.143 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_handle_transfer_completion] transfer 00000195FA08FCB0 has callback 00000191003BC37C
[16:43:14.143 P:26184 T:017 A:00000 DBG] LibUsbFactory: [libusb_handle_events_timeout_completed] doing our own event handling
[16:43:14.143 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() for 2 HANDLEs with timeout in 60000ms
[16:43:14.422 P:26184 T:086 A:00000 DBG] LibUsbFactory: [libusb_submit_transfer] transfer 00000195FA08FCB0
[16:43:14.424 P:26184 T:086 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] matched endpoint 86 with interface 0
[16:43:14.426 P:26184 T:086 A:00000 DBG] LibUsbFactory: [winusbx_submit_bulk_transfer] reading 524288 bytes
16:43:14.429049 [P:26184|T:086] TRACE<MultiStreamV1DataProcessor.cs:ProcessData:61> 23144
[16:43:14.429 P:26184 T:061 A:00000 DBG] LibUsbFactory: [windows_iocp_thread] transfer 00000195FA08FCB0 completed, length 1312
[16:43:14.433 P:26184 T:017 A:00000 DBG] LibUsbFactory: [usbi_wait_for_events] WaitForMultipleObjects() returned 0

Hi @sonatique, thanks for the report.

It would seem that the problem can be in many places.

libusb-1.0 DLL calls libusbk DLL to access libusb0, so there are a few places this can go bad (apart from the device itself).
Could you try making a tiny program in libusbk to see if the problem is libusb-1.0 or deeper ?

WinUSB is supported directly in libusb-1.0, so the problem could lie in libusbk.

After that as @mcuee stated, you could try porting the same program to libusb-0.1.

The problem is still interesting to us, as running through libusb-1.0 is not an unlikely scenario.

@dontech , thank you for your interest in my issue.
As I wrote: if I replace libusb0.sys with libusbK.sys (not changing anything else, so still with libusb-1.0 and libusbK.dll): everything is fine. I think this decreases the probability of having the issue in libusbK.dll or libusb-1.0.
Using WinUSB.dll and WinUSB.sys with libsub-1.0 also works without problem, further decreasing the probability of the issue being in libusb-1.0.

Given the comment in transfer.c about some limit involving 65536, but given that the code below doesn't use the value 65536 at all, I have the feeling that something may be "out of sync" around here, maybe insufficiently tested.

I'll be happy to try to reproduce by writing a libsub-0.1 program, but as said, not easy with my device, it would require porting a lot of code. And let's say that I do, and that I share the program with you: you would probably still be unable to reproduce, given you don't have my device. We should share a common device, firmware or something, without it it seems tedious.

So this makes me want to ask these questions:

  1. What is the general common way of debugging or unit testing the code that ends-up in libusb0.sys? Are there some tutorial out there that you would recommend? How do you do on your side?

  2. Where are we supposed to see Debug messages when using Debug compiled libusb0.sys ? How can I see messages output by the Debug libusb0.sys that is available to download on sourceforge?

  3. Is there a well known device (I mean a devkit, or whatever) that is commonly used in the libusb-XYZ community to reproduce issues? There is this "sam3uBenchmark" in libusb-1.0 that could easily be modified to reproduce my issue (just need to increase the size of the buffer to 2x65536 for instance), but what is the "sam3U" USB device?
    Is that a device that is easy to get and that everybody could buy, use a common firmware, then have a common hardware / firmware / driver basis to work together on issues? Does anybody have a reference to it? Is there a firmware available to download somewhere?

  4. I have not been able to compile a working libusb0.sys by myself. Using Visual Studio to build the projet produces a non-functional driver, about 2 times to small, for some unknown reason.
    Using the various batch file and script seems to work better but there seems to be hardcoded references to some tools in some folders that I don't have / don't known what it is, and therefore things fail before the end.
    Are there instruction about how to build libusb0.sys using these scripts?

I am willing to help, but the curve to enter seems steep.

Thank you for any information you may provide to me.

First of all thanks for spending time on this! :-)
Make sure you use the latest (3.0.0.1) to test with (download here on github).

@dontech , thank you for your interest in my issue. As I wrote: if I replace libusb0.sys with libusbK.sys (not changing anything else, so still with libusb-1.0 and libusbK.dll): everything is fine. I think this decreases the probability of having the issue in libusbK.dll or libusb-1.0. Using WinUSB.dll and WinUSB.sys with libsub-1.0 also works without problem, further decreasing the probability of the issue being in libusb-1.0.

Great work. This means the problem is either in libusbK DLL when it calls libusb0.sys or in libusb0.sys itself. Let me see if I can reproduce this with one of my HS controllers (e.g. an STM32 controller or similar).

Given the comment in transfer.c about some limit involving 65536, but given that the code below doesn't use the value 65536 at all, I have the feeling that something may be "out of sync" around here, maybe insufficiently tested.

Yeah probably both. The limits could very well be the problem, as they are almost never tested. Also, I do not think this was the focus previously, as very few people transfer that much in each packet. But as data-rates go up, this becomes normal.

I'll be happy to try to reproduce by writing a libsub-0.1 program, but as said, not easy with my device, it would require porting a lot of code. And let's say that I do, and that I share the program with you: you would probably still be unable to reproduce, given you don't have my device. We should share a common device, firmware or something, without it it seems tedious.

Probably I do not need your device. If the problem is in the library/driver, a loopback device in high speed should be fine.

  1. What is the general common way of debugging or unit testing the code that ends-up in libusb0.sys? Are there some tutorial out there that you would recommend? How do you do on your side?

There is currently no support for unit-testing or similar. Only on-device testing. WDM drivers are not typically written with unit-testing in mind, as the WDM system is from a time before that was a thing.

  1. Where are we supposed to see Debug messages when using Debug compiled libusb0.sys ? How can I see messages output by the Debug libusb0.sys that is available to download on sourceforge?

DebugView from microsoft should spit out debug messages if you use the debug build. Please file a report if that does not work.

https://learn.microsoft.com/en-us/sysinternals/downloads/debugview

  1. Is there a well known device (I mean a devkit, or whatever) that is commonly used in the libusb-XYZ community to reproduce issues? There is this "sam3uBenchmark" in libusb-1.0 that could easily be modified to reproduce my issue (just need to increase the size of the buffer to 2x65536 for instance), but what is the "sam3U" USB device?

A legacy controller

https://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-6430-32-bit-Cortex-M3-Microcontroller-SAM3U4-SAM3U2-SAM3U1_Datasheet.pdf

It was hot when the library was written. Personally i use STM32s for most of my testing, and embedded linux in host and device modes (gadget mode).

Is that a device that is easy to get and that everybody could buy, use a common firmware, then have a common hardware / firmware / driver basis to work together on issues? Does anybody have a reference to it? Is there a firmware available to download somewhere?

I would get an STM32F, or STM32H as they are cheap and accessible and have HS parts (most micros are FS only). But there are many brands and names that would work. For testing you typically only need the basic features and a simple loopback program running on the device.

  1. I have not been able to compile a working libusb0.sys by myself. Using Visual Studio to build the project produces a non-functional driver, about 2 times to small, for some unknown reason.

That should work out-of-the-box. Install Visual Studio 2019, and simply "build". If that does not work, file a bug report here on github.

Using the various batch file and script seems to work better but there seems to be hardcoded references to some tools in some folders that I don't have / don't known what it is, and therefore things fail before the end.
Are there instruction about how to build libusb0.sys using these scripts?

That should also work out-of-the-box if you use the right EWDK version 22000.

  1. Map the EWDK to "d:" drive or change the letter in the make.cfg
  2. Open a prompt and "make dist", and it should build everything and place it in "packages".

https://learn.microsoft.com/en-us/windows-hardware/drivers/other-wdk-downloads

If that or any other make target does not work, file a bug report here on github.
I have changed allot in the build system for 3.X.X.X series so some target may be a bit flaky.
E.g. test-signing is currently broken.

I am willing to help, but the curve to enter seems steep.

I think I should probably update the README on how to build this... :-). It should not be that steep.

Thanks for helping,

/pedro

@sonatique, please state what version of libusb-1.0 libraries you are using

mcuee commented

I would get an STM32F, or STM32H as they are cheap and accessible and have HS parts (most micros are FS only). But there are many brands and names that would work. For testing you typically only need the basic features and a simple loopback program running on the device.

Yes STM32F/STM32H is good.

Another good option is EZ-USB FX2LP (high speed, very cheap boards from online shops) and EZ-USB FX3 (superspeed).

Reference discussion here: you can see that I have many boards.
libusb/libusb#1052

EZ-USB FX3 libusbK benchmark testing FW and test results. You can either use Cypress' original FW, or the modified FW by Travis.
https://github.com/mcuee/libusbk/tree/master/BmFW/CYPRESS_FX3

@sonatique Also please post the skeleton libusb-1.0 code you are using to write/read from/to the device. From your description you are using the async API, but please post some details.

Also a dump of the descriptors on your device would be helpful. E.g. the size of the physical endpoints would be great.

@dontech : sorry for the large delay before replying, but I am now back at work on this issue,

I made a libusb-1.0 based program that can easily reproduce the issue using a device running a loopback firmware (I used a EZ-USB FX3 with default SDK "auto" loopback firmware).
I made a commit here with said program: libusb/libusb@master...sonatique:libusb:loopbacktest

As said in this commit comments:

The idea is to write 64bit incrementing numbers, by bunch of 1000 (i.e. 8000 bytes write transfer, value chosen to ensure short packets) to loopback device, and concurrently read from loopback device using a buffer of a given size, and compare received value with an incrementing counter.

If received value doesn't match expected value, it indicates some data has been lost.

I can see no loss data for IN transfer buffer size <= 65536 and lost data with read buffer size > 65536

Steps to reproduce issue with CYUSBKIT-003 EZ-USB FX3 dev board and FX3 SDK demo firmware "dma_examples\cyfxbulklpauto" (binary file "USBBulkLoopAuto.img" put in he commit for simplification, source code available as part of Cypress/Infineon FX3 SDK)

  1. Upload USBBulkLoopAuto.img to EZ-USB FX3 RAM (for instance using Infineon USB Control Center)
  2. Use Zadig Version 2.8 to force libusb-win32 (v1.2.7.3) driver for "FX3" device
  3. Compile and run loopbacktest program: everything should work (no "Integrity error!" message, only printf of current received data)
  4. Stop running program, change source code line 86 from "static uint8_t buf[OK_ReadBufferSize];" to "static uint8_t buf[NotOK_ReadBufferSize];", recompile, run.
  5. Program will promptly stop with "Integrity error!" message. From expected and received value we can easily see that some data has been lost.
  6. Use Zadig to force libusbK device, run same program again: no issues.

Note 1: I verified using a USB physical traffic sniffer that IN and OUT data present on the bus match expectation in any cases (traces available on demand)

Note 2: Any loopback capable USB device should work for reproducing the issue

Note 3: When I first see the bug I was using libusb-win32 v1.3.0.0, but for ease of testing I used latest Zadig which only carries 1.2.7.3. Issue is there in both cases.

I hope that will help finding and solving the issue, thank you very much in advance for your work!

OK @sonatique i reproduced the problem with your example, and with a similar libusb0 example. I have located the source of the problem.

The problem is that the maximum size for a transfer to the core USB driver that libusb0 sends IRPs to is 64K. At least in Windows 10. Not sure what the limit was in previous Window versions.

In the driver there are 2 paths. Small ones (below 64K) and "transfer", and large ones are "large transfer". Not sure why its made like this, as it would probably have been better just to have 1 path, so it got tested more.

For large transfers, it splits the transfers in sub transfers and sends all the transfers to the driver below. This is where there is a problem. If the first transfer is "short", meaning that the entire transfer window is not used, the remaining transfers are cancelled.

This is where it gets nasty. The "larger transfer" code is inventive, but has some problems, and some bugs.

There are 2 problems/bugs here:

  1. The code decides that a transfer is "short" if the returned size is less than 64K. This is not true, as short transfers are defined by not being dividable by the endpoint size. Since all endpoints are less than 64K, it ends up cancelling all remaining transfers 100% of the time. Not critical but super inefficient.

  2. The other problem is that when it cancels the the remaining transfers, it sometimes receives data from the underlying driver even though the transfer is cancelled. This is the problem you are facing, as this causes an entire transfers to get lost.

Need to do some more testing and fixing to get this under control.

@sonatique: As a side-note, I can mention that I do not think going above 64K actually makes sense for an application.
The maximum you can receive in a microframe (USB 3.0) is:

1024*48 = 49152 bytes.

So going above 64K should not yield any significant advantage. In your application, do you get more bandwidth by going above 64K? This would be interresting to know.

@dontech Thansks a lot for having looked at this, for your detailed explanations, and in advance for your efforts at fixing these problems.

To answer your question: actually with my application, I do not get an higher throughput if I, for instance, use 128K instead of 64K, I can reach my maximum (180MBytes/s) in both cases. However, when doubling the buffer size, I divide the CPU consumption by about 2, and this is very interesting for my needs. By increasing the buffer size even more (I found that 2MBytes was a nice value) I can make CPU consumption of the USB acquisition part going to something negligible, for the same maximal rate of 180MBytes/s
I cannot say that this would apply to all applications / use cases, but in my case increasing the buffer size seems to decrease the overhead percentage and therefore give more bytes per CPU cycle, which I appreciate ;-)

OK I have made a conclusion. The entire "large_transfers" code is a no-go.

  1. It works by over-committing allot of transfers, and then cancelling once its gets a short packet.

  2. It then calls IoCancelIrp() to cancel the remaining IRPs, so the "main" request can complete.

The problem is that the behavior for IoCancelIrp() for USB transfers is not well defined.

A) In some cases the Cancel function ends up cancelling pending packets which have not started. OK.

B) In some cases the Cancel causes the driver below to abort in the middle of an operation where data fetching has already started. This means the device sees this as sent, but the driver now throws it away.

This will never work.

I have made a patch that simply removes "large_transfers" support completely, as I am sure it is fundamentally broken.

5beec48

@mcuee @tormodvolden @TravisRo do you have any comments on this?

Now, after this we need to find out why you have high CPU usage. My guess would be that for some reason your application IOCTL requests from your app is more expensive than the IOCTL from the driver to the core USB stack.

IOCTL from applications in general to the kernel require more than intra-kernel IOCTLs I guess

I have some questions:

A) What is your endpoint parameters? Could you dump your USB descriptor?

B) Is the CPU load smaller with libusbk ?

@dontech thanks for your question.
My application is an high-level application with lots of things going on, and my own USB handling user code is for sure responsible for some of the CPU usage anyway. Generally speaking it uses N concurrent buffers each of size S to read data as fast as possible with as low CPU usage as possible.
I don't think I am experiencing an really too high CPU usage when buffer size is 64K, I would say it is within the expected range. At least am not alarmed by it.
But I am in the process of optimizing my application, so I am trying to squeeze some CPU percentage out.
I had S = 64K by default so I first started to play with N, but I quickly realized that I could easily reduce CPU usage by increasing S. The sweet spot seems to be N = 3 and S = 2MBytes in my case.
This is where I realized that I had "data corruption" when using libusb0.sys.ยจ
Regarding your question:

A) Is this what you mean?
image (1)

B) Quick answer: no.
Here are more detailed rough "measurements" (approximative numbers, but obtained repeatedly)
libusb0.sys:
N=16, S=64K: 180MBytes/s, 16% CPU
N=3, S=512K: 180MBytes/s, 4% CPU
N=3, S=2048K: 180MBytes/s, 1.5% CPU

libusbK0.sys:
N=16, S=64K: 180MBytes/s, 17% CPU
N=3, S=512K: 178MBytes/s, 5% CPU
N=3, S=2048K: 178MBytes/s, 1.5% CPU

WinUSB.sys:
N=16, S=64K: 180MBytes/s, 17% CPU
N=3, S=512K: 177MBytes/s, 3% CPU
N=3, S=2048K: 177MBytes/s, 1.5% CPU

Wow those are really great numbers. Thanks.

I should probably study the libusbk code to see how they solve the problem of large transfers.

OK i figured it out. The difference is that libusbk queues all the writes, and then fires them serially after each other, until it gets a short packet. This will work.

We could change libusb0 to do the same, but then i need to change quite a bit in the driver transfer code....

@sonatique

Minor thing -- you can build Zadig to use libusb-win32 1.3.0.1. Or you can use the binary I built.
https://github.com/mcuee/libusb-win32/releases/tag/snapshot_1.3.0.1 (zadig_2.8_mod_1301.exe)

There is a driver display issue though, probably a libwdi bug.

It is interesting to know that libusb-win32 has issues to deal with large transfer.

Normally I use libusbK kBench program to carry out the test using different drivers (libusbK.sys, libusb0.sys and WinUSB) under Windows.
https://github.com/mcuee/libusbk/tree/master/libusbK/src/kBench

Or libusbdotnet BenchmarkCon application under macOS (it also works under Windows but not as good as kBench). It does not seem to work with isochronous transfer though and it has not been updated for a long time.
https://github.com/LibUsbDotNet/LibUsbDotNet/tree/v2/stage/BenchmarkCon

I have not encountered such issues for USB Bulk transfer -- maybe the test program can not catch the issue.

@dontech
libusb-win32's benchmark example may need some updates to catch up with kBench. It has not been updated for a long time.
https://github.com/mcuee/libusb-win32/blob/master/libusb/examples/benchmark.c

@sonatique I think I have fixed the problem.

  1. I have extended the current transfer functionality, to be able to re-use the IRP and URB multiple times efficiently in a row. Should be just as good or better than libusbk.

  2. I have removed the capped transfer, and now you can transfer of to 4GB in 1 transfer if you want. Warning: This could take a very, very long time.

  3. Code for large transfers should now be safe and allot less complex than before.

https://github.com/mcuee/libusb-win32/tree/libusb-max-size

@sonatique: I made this for you to test with:

https://github.com/mcuee/libusb-win32/releases/tag/test_release

Please make sure to enable test signing before using it, or to load it up with zadig.

Please report back if this works for you, and what performance you get. It should be good!

It is interesting to know that libusb-win32 has issues to deal with large transfer.

Normally I use libusbK kBench program to carry out the test using different drivers (libusbK.sys, libusb0.sys and WinUSB) under Windows. https://github.com/mcuee/libusbk/tree/master/libusbK/src/kBench

Or libusbdotnet BenchmarkCon application under macOS (it also works under Windows but not as good as kBench). It does not seem to work with isochronous transfer though and it has not been updated for a long time. https://github.com/LibUsbDotNet/LibUsbDotNet/tree/v2/stage/BenchmarkCon

I have not encountered such issues for USB Bulk transfer -- maybe the test program can not catch the issue.

Not any more... :-)

load it up with zadig.

@dontech : hum: how do you tell Zadig to install driver files from a given folder? I can not figure that out, I can only select one from the built-in list.
Maybe I didn't understand what you meant.

@sonatique:

  1. Open zadig
  2. Select "extract files" instead of "install driver"
  3. Go to c:/Users/<user>/usb_driver
  4. Replace libusb0.sys in the amd64 folder
  5. Install INF from device manager

This of course only works if you remember to disable driver signing:

As admin:

bcdedit.exe -set loadoptions DISABLE_INTEGRITY_CHECKS
bcdedit.exe -set TESTSIGNING ON

Looking forward to your results

@dontech and @sonatique

It seems to me that @dontech has signed properly the .sys files with the right digital signature. In that case, you can rebuild Zadig and no need to use test signing.

libwdi 1.5.0 source code:
https://github.com/pbatard/libwdi

Just modify the configure file to point to the right locations.
https://github.com/pbatard/libwdi/blob/master/msvc/config.h

#ifndef WDK_DIR
#define WDK_DIR "C:/Program Files (x86)/Windows Kits/8.0"
#endif

/* WDK WDF coinstaller version */
#define WDF_VER 1011

/* CoInstaller subdirectory for WinUSB redist files ("winusb" or "wdf") */
#define COINSTALLER_DIR "wdf"

/* 64bit subdirectory for WinUSB redist files ("x64" or "amd64") */
#define X64_DIR "x64"

/* embed libusb0 driver files from the following location */
#ifndef LIBUSB0_DIR
#define LIBUSB0_DIR "D:/libusb-win32"
#endif

/* embed libusbK driver files from the following location */
#ifndef LIBUSBK_DIR
#define LIBUSBK_DIR "D:/libusbK/bin"
#endif

@mcuee : thanks I will try.

It seems to me that @dontech has signed properly the .sys files with the right digital signature. In that case, you can rebuild Zadig and no need to use test signing.

No. I only signed it with my signature. So you need to turn on test signing. Submitting to microsoft is a pain that i try to avoid unless necessary.

The rest of what @mcuee said is correct. But turn ON testsigning.

@dontech : I just tested your correct libusb0.sys for performance: results are equal to what I had with the bugged version, that's very good! From my measurement it appears that libusb0.sys seems to be the most efficient driver amongst the 3, excellent.

Just to be sure I will test for data integrity on Monday, and let you know, but I assume it will be OK.

That's great, thanks a lot for your work!

@sonatique and @dontech

I did some tests with libusbK kBench as well last time and indeed libusb0.sys is the most efficient. I assume it is because it is based on WDM and not KMDF (WinUSB.sys and libusbk.sys are based on KMDF).

cyusb3.sys is also based on KMDF but it has the same performance as libusb0.sys (using Cypress's own host application)

Reference discussion in OSR ntdev
https://community.osr.com/discussion/comment/302128

Benchmark data: superspeed bulk IN/OUT transfer with the bulksrcsink example from Cypress 
(1024B packet size, burst length 16, theoretical max speed up to 454,300 KB/sec 
as per https://www.cypress.com/file/125281/download ).

Real world speed using the same Windows 10 computer and same Cypress BulkSrcSink FW.
437,800 KB/sec for cyusb3.sys using Cypress Cystream application
437,700 KB/sec for libusb0.sys using libusbk kBench application
386,600 KB/sec for libusbk.sys using libusbk kBench application
385,300 KB/sec for WinUSB using libusbk kBench application

The above is my test results using a 9th Core i7 Dell Laptop with 16GB RAM.

With a faster desktop PC with 64GB RAM, Travis got 426,088 KB/sec with libusbk.sys 
and similar results with WinUSB. libusb0.sys is still a bit faster at 440,504 KB/sec.

BTW, I believe kBench (using libusbK API) is more efficient than libusb-1.0 API.

@mcuee: libusb0 might be even faster now, to take the lead.

The reason is that previously each bulk transfer call would result in one transfer in the driver below.

Now, with the latest changes i made, even smaller transfers can be split in several USBD calls.

So e.g.:

before: 1 transfer of 1MB of data on an 4KB endpoint would result in 1 USBD transfer of 4KB = 4KB
now: 1 transfer of 1MB of data on an 4KB endpoint would result in 256 USBD transfers of 4KB = 1MB

Each IOCTL has an overhead on CPU and bandwidth, so cutting them down is more efficient.

@dontech : I just tested your correct libusb0.sys for performance: results are equal to what I had with the bugged version, that's very good! From my measurement it appears that libusb0.sys seems to be the most efficient driver amongst the 3, excellent.

Just to be sure I will test for data integrity on Monday, and let you know, but I assume it will be OK.

That's great, thanks a lot for your work!

Excellent. It also turned out the code was easier than i thought. Re-using the IRPs and URBs gave it the last boost.

Now we just need someone to review.

@tormodvolden ?
@TravisRo ?

@dontech : I just test with the simple loopback test I provided on December 7 and I can no longer see any "data corruption", that is great, thanks a lot!
I tested with my desired buffer size of 2MBytes and it was fine.
EDIT: I also tested with my real application (3 buffers of 2MBytes each) and I no longer see corruption!

However: I then tried to increase buffer size even more, just for fun (since you claimed we could go up to 4GBytes ;-) )
4, 8 , 16, 32, MBytes was OK.
But, then at 64MB or more, I immediately get a BSOD, every time. Sometimes Stop code is BAD_POOL_HEADER, sometimes KERNEL_MODE_HEAP_CORRUPTION.

I have no idea where this comes from. Did you try on your side with 64 MBytes or more?

OK that is probably a separate issues.

The number 4GB was supposed to be 2GB, which is the limit imposed by the IOCTL from user-space to the driver.

I never tried it, just looked at the possible values.

Why 64MB is the limit before a crash is beyond me, but lets take this is a separate bug-report.

Verified closed by reporter.