SheetJS/sheetjs

Discourage use of the FileReader

jimmywarting opened this issue · 3 comments

First and formost i would like to discourage the use of readAsBinaryString

Javascript dose not handle binary strings particular well...

Note that this method was once removed from the File API specification, but re-introduced for backward compatibility.
Using FileReader.readAsArrayBuffer() is recommended.[2]

  • If they think it's okay to use a binary string then they might think that getting a binary string from XHR or fetch(url).then(res => res.text()) works well also - when in fact it dose not, ajax performs text encoding on binary strings when being fetched[6]
  • They removed readAsBinaryString from the spec for a reason, so we should stop using it.
  • When the ArrayBuffer interface appeared, readAsBinaryString has been deprecated because all its use cases could be done in a better way using that new interface.
    Indeed, readAsBinaryString only converts the binary data into a DOMString (UTF-16). There is not much you can do from it afterward. Also, storing it as an UTF-16 string implies that it takes far more space in memory than the original data size. Add to this that strings are immutable, I guess you can see how inefficient it is to work from this. [1]
  • Also, chrome can't handle 500+ MiB strings but buffers can[3]

I do not know what the reason behind using a binary string is, but if you for some reason need to get some part of the chunk as a text then a better conversion would then be to use str = new TextDecoder().decode(data.slice(0, 1024))?
The sheets are also in binary format so maybe it would be easier to parse the binary using DataView. (Using the DataView instead of Buffer would reduce the dependency on the browser build)


Now for the 2nd reason of why not to encourage the use of FileReader any more:

  • NodeJS v15.7 introduced support for Blobs[4]. They are looking into adding support for getting blobs from the FileSystem as well.
  • What NodeJS don't intend to implement is the old FileReader. It's a old legacy API now days when there is this new Blob reading methods supported directly onto the blob[5]
  • I think the new Blob read methods have a simpler syntax with the use of async/await + promises
  • And everyone will be able to read any instance of any Blob polyfill like the fetch-blob package for example
    • Not every FileReader Polyfill would be able to read any Blob Polyfill since they would probably read some internal/private _buffer property if they don't use the new read methods

If we instead encouraged everyone to use Uint8Array instead then i think it would have a better cross platform support on Deno, NodeJS, and also in the browser. (if it's even possible to use sheetjs on the server - i don't know)

It would also make the code simpler if there is just one way do read the data instead of multiple ways to reduce the code complexity
It would also make it easier for PPl to know how to import data from ajax since it has no way of getting data as binary string


I'm willing to help with PR's if we can change any occurrence of FileReader

-      $elm.on('change', function (changeEvent) {
-        var reader = new FileReader();
-
-        reader.onload = function (e) {
-          /* read workbook */
-          var bstr = e.target.result;
-          var workbook = XLSX.read(bstr, {type:'binary'});
-
-          /* DO SOMETHING WITH workbook HERE */
-        };
-
-      reader.readAsBinaryString(changeEvent.target.files[0]);
-     })

into something like this:

+      $elm.on('change', async evt => {
+        const file = evt.target.files[0];
+        const buffer = await file.arrayBuffer();
+        const workbook = XLSX.read(new Uint8Array(buffer), { type: 'array' });
+       });

I could do multiple smaller PR's for each file if that is easier to review, or i could do several at once.

I could also help create some depreciation warning to the console if anyone are using the binary,
XLSX.read(data, { type: 'binary' }) syntax

I would also like to emphasize that the read method could be a tiny bit smarter if it could do auto detection based on the kind of data it is. instead of setting { type: 'array' } it could instead switch based on the first argument and look if it is a ArrayBuffer or ArrayBufferView.

Then it could be a tiny bit easier to just do:

$elm.on('change', async evt => {
  const file = evt.target.files[0]
  const workbook = XLSX.read(await file.arrayBuffer())
})

  const res = await fetch(url)
  const workbook = XLSX.read(await res.arrayBuffer())

The library functions support binary strings, ArrayBuffer and Uint8Array, NodeJS Buffer including the 0.8 pre-Uint8Array buffers, and arrays of numbers that represent bytes. It runs in the browser; NodeJS and other server-side JS engines like Rhino; react-native and other mobile frameworks; Photoshop, Duktape and other JS environments.

The original reason for preferring binary strings in the browser demos is IE10 performance. "Old Habits Die Hard". We've accepted PRs that have phased out use of readAsBinaryString in favor of readAsArrayBuffer.

MDN still recommends binary strings for XMLHttpRequest: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data#receiving_binary_data_in_older_browsers

function load_binary_resource(url) {
  var req = new XMLHttpRequest();
  req.open('GET', url, false);
  //XHR binary charset opt by Marcus Granado 2006 [http://mgran.blogspot.com]
  req.overrideMimeType('text\/plain; charset=x-user-defined');
  req.send(null);
  if (req.status != 200) return '';
  return req.responseText;
}

That having been said, we'd accept PRs that phase out use of readAsBinaryString in the demos. Demos for "older" frameworks like AngularJS should not use ES6+ language features. Demos for "newer" frameworks like React should use modern idioms.

PS: the parsers already detect input type. The type parameter is only needed to distinguish binary strings from plaintext JS strings (since both are strings).

The original reason for preferring binary strings in the browser demos is IE10 performance. "Old Habits Die Hard".

Well, IE is pretty dead now, just 0.7% marketshare worldwide. MS will stop support IE in August. MS 360 apps themself have dropped support for IE11 already. It's not only them who stop supporting IE, me and other bigger website stopped as well

We've accepted PRs that have phased out use of readAsBinaryString.

Cool, i can do that, how about

PS: the parsers already detect input type. The type parameter is only needed to distinguish binary strings from plaintext JS strings (since both are strings).

Cool

Moving the discussion to the PR