zen-fs/dom

Can't read root directory with `StorageStore`

Closed this issue · 13 comments

With localStorage Storage Backend,
build on top of SyncStore,
we can't read root dir.

This because unsymetric serialization/deserialization
of Uint8Array data into string.

  • serialisation SyncStore.js:L40: data.toString() // "1,2,3,4..."
  • deserialisation SyncStore.js:L31: encode(data); // TextEncoder asciiToUint8

stack:

    h ApiError.js:99
    FileError ApiError.js:65
    ENOTDIR ApiError.js:80
    getDirListing SyncStore.js:364
    readdirSync SyncStore.js:251
    BrowserFS filesystem.js:503
    BrowserFS filesystem.js:9
    E filesystem.js:5
    readdir filesystem.js:502

Note: strings are stored as utf16, that's about 2 bytes per chars

Simple fix: (storage size = data length * 7) (about 2.57+1 ascii char by data byte)

  • serialisation SyncStore.js:L40: data.toString() // "1,2,3,4..."
  • deserialisation SyncStore.js:L31: Uint8Array.from(data.split(','), Number)

Base64 fix: (storage size = data length * 2.7) (about 1.33 ascii char by data byte)

  • serialisation: window.btoa(Array.from(data, x=>String.fromCharCode(x)).join(''))
  • deserialization: Uint8Array.from(window.atob(data), x=>x.charCodeAt(0))

Utf16 fix: (storage size = data length * 1) (need to handle odd data lenght)

  • itoc = (x)=>String.fromCodePoint(x>>11!==27?x:x|0x10000)
  • ctoi = (x)=>x.codePointAt(0) // Uint16Array will discard high bit
  • serialisation: Array.from(new Uint16Array(data.buffer), itoc).join('')
  • deserialisation: new Uint8Array(Uint16Array.from(data, ctoi).buffer)

With utf16, to handle odd data length:

  • on serialization: (add last odd byte, and count of extra data bytes; be carefull with LE/BE)
    • ...new Uint16Array(data.buffer, 0, length/2)... // discard odd last byte
    • if (length%2)
    • then append String.fromCodePoint(data[length-1] * 0x101, 0x303)
    • else append String.fromCodePoint(0x202)
  • on deserilization: (remove extra bytes)
    • len = data.length, extra = data[len -1]
    • return new Uint8Array(data, 0, len - extra)

When manipulating buffer, mostly byteoffset should be 0.
But what if byte offset is not 0, and especially if it is odd...
so do ...new Uint16Array(data.buffer, offset/2, (offset%2 + length)/2)...
and then carry {o2: offset%2, e2: (offset%2 + length)%2 +2} into then extra char:

  • extraCodePoint = ((o2<<2) + (e2)) * 0x101
  • len = data.length, extra = data[len -1] & 3, offset = data[len -1] >> 2
  • return new Uint8Array(data.buffer, offset, len - extra - offset)

To avoid extra bug, we could do slice buffer:

  • return new Uint8Array(data.buffer.slice(offset, len - extra - offset)
function utf16Encode(data8) {
  const itoc = (x)=> String.fromCodePoint((x>>11 !== 27) ? x : x|0x10000);
  const btoc = (x)=> itoc(x * 0x101); // encode twice to workaround endianess

  let {length, buffer, byteOffset, byteLength} = data8 ;
  const pfx = byteOffset % 2 ; // extra start byte to embed
  const sfx = (byteOffset + byteLength) % 2 ; // either odd and padding end bytes
  
  byteOffset -= pfx ;       // embed start byte
  byteLength += pfx - sfx ; // ... and remove odd end byte
  const data16 = new Uint16Array(buffer, byteOffset, byteLength/2);

  const dataChars = Array.from(data16, itoc);
  const lastChar = btoc(data8[length - 1]); // odd end byte
  const extraChar = btoc(4*pfx + sfx + 2); // padding bytes count

  if (sfx) dataChars.push(lastChar);
  dataChars.push(extraChar);

  return dataChars.join('');
}

function utf16Decode(string) {
  const ctoi = (x)=> x.codePointAt(0); // Uint16Array will trunc extra high bit

  const data16 = Uint16Array.from(string, ctoi);
  const data8 = new Uint8Array(data16.buffer);

  let {length, buffer, byteOffset, byteLength, byteEnd} = data8 ;
  const pfx = data8[length - 1] / 4 |0; // embeded null byte
  const sfx = data8[length - 1] % 4 |0; // duplicate odd end byte, and extra word

  byteOffset += pfx ;
  byteLength -= pfx + sfx ;
  byteEnd = byteOffset + byteLength ;

  return new Uint8Array(buffer, byteOffset, byteLength); // performance
  // return new Uint8Array(buffer.slice(byteOffset, byteEnd)); // compatibility
}

Thanks for the report. We'd appreciate a PR!

@saoirse-iontach,

The encoding/decoding of Uin8Arrays was fixed in 2054a8e, which was not published. I went ahead and published all of the changes on core up until now (core@0.1.0), and I've updated all of the backends to use the new core version as well as publish those changes.

Your issue with encoding/decoding should be fixed in v0.0.5.

As for UTF16-- I don't think BFS supports using UTF16 with backends since some (e.g. zip) do not support that encoding, and all the backends must implement the internal BFS API.

Does this resolve your issue?

The encoding/decoding of Uin8Array in core is about serializing directory object to Uint8Array inode data.
Her is about encoding Uin8Array inode data into string.
And I also use you core fixe in my code and still have this bug.

so that fix doesn't resolve this issue, which is specific to fs-dom SyncStore.js.
And UTF16 is a specific concern of localStorage. (most of people don't think about because implementation detail are no really published, but it is a problem from large file or large filesystem)

I have to code a test suite for my code proposition,
and then integrate in my code to live test.
After that I could do a git clone and pull request, it is simple.
(Maybe one or two week depending of my planning)

About perf/compat, I think anybody switching between TypedArray an ArrayBuffer should learn {ByteOffset, ByteLength} concern, so I prefer performance solution (and so I have updated my osjs adapter according that, it is not sot terriblous)
((length===byteLength ? unit8array.buffer : unit8array.buffer.slice(byteOffset, byteEnd)))

I saw you core fix 2 weeks ago, and I think you have not publish because some other fix are to be tested. Thank you.

Test:

/**
 * intData:        ... < 128
 * surrogateData:  0 or D8-DF
 * mixedData:      0...255 including D8-DF
 * randomData:     any
 */
surrogateData = // 0 or D8-DF
[ '',
  'D800',        '00D9',
  'DADB',        '00DCDE00',
  'D8D9DA00',    '00DBDCDE',
  'D8D9DADB',    '00DCDEDFD800',
  'D8D9DADBDC00','00DEDFD8D9DA',
  'D8D9DADBDCDE','00DFD8D9DADBDC00',
  ''
].join('0000'+'0000')
 .match(/../g).map(x=>parseInt(x,16))
intData = Array.from(surrogateData, (v,i)=>i);
mixedData = Array.from(surrogateData, (v,i)=>v||i*17);
randomData = Array.from(surrogateData, (v,i)=>v||Math.random()*256|0);

assert=(test,msg)=>{
  if (test) return;
  debugger;
  throw new Error(`Assertion failed: ${msg}`)
}
test=(data)=>{
  buffer = Uint8Array.from(data).buffer;
  len = data.length;
  nbSurrogate = new Uint16Array(buffer).reduce(
    (a,v)=>((v>>11===27)?a+1:a), 0
  );
  for(i=0;i<8;i++)
  for(j=0;j<8;j++)
  {
    input = new Uint8Array(buffer,i,len-i-j);
    serial = utf16Encode(input)
    output = utf16Decode(serial)

    assert(`${input}`===`${output}`, 'equality')
    // extra bytes:
    //  - counter: 2 bytes
    //  - padding: 2 bytes
    //  - surrogate for copied last byte: 2 byte
    //  - other surrogate: 2*nbSurrogate
    assert(2*serial.length >= input.length+2, 'too short')
    assert(2*serial.length <= input.length+6+2*nbSurrogate, 'too long')
  }
}
test(intData);
test(surrogateData);
test(mixedData);
test(randomData);
"succes"

Edit: utf16Encode/utf16Decode above are fixed
==> then Test Passed

To be exhaustive, serial data should be wrote into localStorage on a first pass,
then after closing and reopening the browser, we should do reload data to finish test....

@saoirse-iontach,

My apologies, I fixed the usage of Uint8Array.toString() in the core but completely missed it being used in StorageStore.put. I've made the fix in 99e92c7. Does v0.0.6 fix the issue?

TextEncoder/TextDecoder are about unicode string, but don't apply to any binary data.

If you decode raw data, TextDecoder could either throw or correct bad unicode sequences ;
And then in turn, TextEncoder will encode different data.

Try that:

input = new Uint8Array([192,192,192]);
string = new TextDecoder().decode(input);
output = new TextEncoder().encode(string);
({input, output, string})

Output: Uint8Array(9) [ 239, 191, 189, … ]

TextEncoder will only work with 7bits ascii, else valid utf8, no raw binary.

And my code (x>>11 !== 27)? ... : ... is to avoid such invalid utf16 code.

@saoirse-iontach,

My apologies, I fixed the usage of Uint8Array.toString() in the core but completely missed it being used in StorageStore.put. I've made the fix in 99e92c7. Does v0.0.6 fix the issue?

No, still Error: Error: ENOTDIR: File is not a directory., '/'
but inode data are completly different

previous inode mode : 12332 / 0x302c
new inode mode : 49135 / 0xbfef

stored inode data : "\u0000\u0010\u0000\u0000�A\u0000���\u0001�xB\u0000���\u0001�xB\u0000���\u0001�xB\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000de4547d9-906a-42e4-9db6-c804a84d1b00"

id of deserialized inode: "xB\u0000���\u0001�xB\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000de4547d9-906a-42e4-9db6-c804a84d1b00"

As you can see, ascii data (node id appenedd at end of data buffer) are well encoded, but raw binary (inode numbers) result in a different and too large (>38) output data.

This should be fixed by 792cefe on the core. I will close the issue after I've updated all of the DOM backends to use the latest internal API (probably 0.2.0).

That should work with 'hex' or 'base64' encoding,
but inefficient storage (will consume 2.7x more storage than binary data).

This was an issue with the core that was resolved.