piskvorky/smart_open

Compatibility issue with soundfile

alexjbuck opened this issue · 4 comments

Problem description

Somewhere between smart_open and soundfile data is getting lost when writing or reading to S3 (minio in this case).

The following is a mostly mvp that reproduces this issue. When reading a FLAC file from S3 through passing a file handle from smart_open to the soundfile library, soundfile appears to find 1 less sample (2 bytes in this case) than was written. You can successfully read len(input)-1 samples out of the file where input is the original list used to create the audio file. When you try to read every sample out, libsndfile errors: LibsndfileError: Internal psf_fseek() failed.

I do not know if this is a soundfile, libsndfile or smart_open issue, because the audio library seems to think its getting shorted on data, in that the seek fails when seeking a length that should work (the length of the original sample).

To confirm that it isn't an indexing issue, the actual length of the returned samples is one short (24999 vice 25000 in my example).

There is a related issue that I'll file separately that happens when you don't specify the number of frames/samples to read from the audio file when reading it through smart_open.

I also included an example of using smart_open with a filesystem target, to demonstrate that it works for local filesystem objects, but not for remote s3 connections, which also makes me think this might be behavior inside smart_open at fault.

My final objective is to read and write audio files (FLAC preferrably) to and from S3 storage from within Python.

  • What are you trying to achieve?
  • What is the expected result?
  • What are you seeing instead?

Steps/code to reproduce the problem

import os
import boto3
import smart_open
import soundfile as sf
from s3path import S3Path
from soundfile import SoundFile

session = boto3.Session(
    aws_access_key_id='<id>',
    aws_secret_access_key='<key>',
)
client = session.client('s3',endpoint_url="http://localhost:9000")

stop = 5
sample_rate = 5000
start=0
t = np.linspace(start=start, stop=stop, num=stop*sample_rate)
x1 = np.sin(2*pi*100*t)
x2 = np.sin(2*pi*200*t)
signal = x1+x2

path = S3Path.from_uri('s3://etl')
filepath = path / 'test.flac'

transport_params = {'transport_params':{'client':client}}

# Writing to local filesystem through soundfile - Apparent Success
with smart_open.open('test.flac', "wb", **transport_params) as file:
    with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
        f.write(signal)
        print(f"Filesystem IO: {f.frames=}")
# > Filesystem IO: f.frames=25000

# Reading full sample from filesystem - Success
with smart_open.open('test.flac', 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal))
        print(f"Filesystem IO: {len(samples)=}")
# > Filesystem IO: len(samples)=25000

# Writing to S3 through soundfile - Apparent Success
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
    with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
        f.write(signal)
        print(f"S3 IO: {f.frames=}")
# > S3 IO: f.frames=25000

# Reading 1 less sample than was written - Success
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal)-1)
        print(f"S3 IO: {len(samples)=}")
# > S3 IO: len(samples)=24999

# Reading the same number of samples as was written - Failure
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal))
        print(f"S3 IO: {len(samples)=}")
# > LibsndfileError: Internal psf_fseek() failed.

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-14.0-arm64-arm-64bit
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
smart_open 6.4.0

Full Error

---------------------------------------------------------------------------
LibsndfileError                           Traceback (most recent call last)
Cell In[330], line 58
     56 with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
     57     with sf.SoundFile(io.BytesIO(file.read())) as fin:
---> 58         samples = fin.read(len(signal))
     59         print(f"S3 IO: {len(samples)=}")
     60 # > LibsndfileError: Internal psf_fseek() failed.

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:895, in SoundFile.read(self, frames, dtype, always_2d, fill_value, out)
    893     if frames < 0 or frames > len(out):
    894         frames = len(out)
--> 895 frames = self._array_io('read', out, frames)
    896 if len(out) > frames:
    897     if fill_value is None:

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1344, in _array_io(self, action, array, frames)
   1342 ctype = self._check_dtype(array.dtype.name)
   1343 assert array.dtype.itemsize == _ffi.sizeof(ctype)
-> 1344 cdata = _ffi.cast(ctype + '*', array.__array_interface__['data'][0])
   1345 return self._cdata_io(action, cdata, ctype, frames)

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1356, in _cdata_io(self, action, data, ctype, frames)
   1354 frames = func(self._file, data, frames)
   1355 _error_check(self._errorcode)
-> 1356 if self.seekable():
   1357     self.seek(curr + frames, SEEK_SET)  # Update read & write position
   1358 return frames

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:802, in SoundFile.seek(self, frames, whence)
    800 self._check_if_closed()
    801 position = _snd.sf_seek(self._file, frames, whence)
--> 802 _error_check(self._errorcode)
    803 return position

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1407, in _error_check(err, prefix)
   1405 def _error_check(err, prefix=""):
   1406     """Raise LibsndfileError if there is an error."""
-> 1407     if err != 0:
   1408         raise LibsndfileError(err, prefix=prefix)

LibsndfileError: Internal psf_fseek() failed.

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

I have further narrowed this down.

If you write through an io.BytesIO object then it works.

# Writing to S3 through soundfile - Apparent Success
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
    with io.BytesIO() as temp:
        with sf.SoundFile(temp,mode='w', samplerate=sample_rate,channels=1,format='flac') as flac:
            flac.write(signal)
            print(f"S3 IO: {flac.frames=}")
        file.write(temp.getvalue())
# > S3 IO: f.frames=25000

# Reading without defining samples - Success
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        print(f"S3 IO: {fin.frames=}")
        samples = fin.read()
        print(f"S3 IO: {len(samples)=}")
# > S3 IO: fin.frames=25000
# > S3 IO: len(samples)=25000

Compare this to the version that failed (which is the way I was led to believe this should work through the examples)

# Writing to S3 through soundfile - Apparent Success (actual failure!)
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
    with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
        f.write(signal)
        print(f"S3 IO: {f.frames=}")
# > S3 IO: f.frames=25000

might this be a side-effect of #796 cc @jakkdl @mpenkov ?

@alexjbuck can you re-run your MWE using smart_open==6.4.0 and confirm it is already occurring before that PR was released?

might this be a side-effect of #796 cc @jakkdl @mpenkov ?

@alexjbuck can you re-run your MWE using smart_open==6.4.0 and confirm it is already occurring before that PR was released?

this issue (oct 2023) is older than #796 (merged feb 2024), so it can't be the cause of it.
In the OP they specify as running the test on smart_open==6.4.0

ah right, missed that 👍