Compatibility issue with soundfile
alexjbuck opened this issue · 4 comments
Problem description
Somewhere between smart_open and soundfile data is getting lost when writing or reading to S3 (minio in this case).
The following is a mostly mvp that reproduces this issue. When reading a FLAC file from S3 through passing a file handle from smart_open
to the soundfile
library, soundfile
appears to find 1 less sample (2 bytes in this case) than was written. You can successfully read len(input)-1
samples out of the file where input
is the original list used to create the audio file. When you try to read every sample out, libsndfile
errors: LibsndfileError: Internal psf_fseek() failed.
I do not know if this is a soundfile
, libsndfile
or smart_open
issue, because the audio library seems to think its getting shorted on data, in that the seek fails when seeking a length that should work (the length of the original sample).
To confirm that it isn't an indexing issue, the actual length of the returned samples is one short (24999 vice 25000 in my example).
There is a related issue that I'll file separately that happens when you don't specify the number of frames/samples to read from the audio file when reading it through smart_open
.
I also included an example of using smart_open
with a filesystem target, to demonstrate that it works for local filesystem objects, but not for remote s3
connections, which also makes me think this might be behavior inside smart_open
at fault.
My final objective is to read and write audio files (FLAC preferrably) to and from S3 storage from within Python.
- What are you trying to achieve?
- What is the expected result?
- What are you seeing instead?
Steps/code to reproduce the problem
import os
import boto3
import smart_open
import soundfile as sf
from s3path import S3Path
from soundfile import SoundFile
session = boto3.Session(
aws_access_key_id='<id>',
aws_secret_access_key='<key>',
)
client = session.client('s3',endpoint_url="http://localhost:9000")
stop = 5
sample_rate = 5000
start=0
t = np.linspace(start=start, stop=stop, num=stop*sample_rate)
x1 = np.sin(2*pi*100*t)
x2 = np.sin(2*pi*200*t)
signal = x1+x2
path = S3Path.from_uri('s3://etl')
filepath = path / 'test.flac'
transport_params = {'transport_params':{'client':client}}
# Writing to local filesystem through soundfile - Apparent Success
with smart_open.open('test.flac', "wb", **transport_params) as file:
with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
f.write(signal)
print(f"Filesystem IO: {f.frames=}")
# > Filesystem IO: f.frames=25000
# Reading full sample from filesystem - Success
with smart_open.open('test.flac', 'rb', **transport_params) as file:
with sf.SoundFile(io.BytesIO(file.read())) as fin:
samples = fin.read(len(signal))
print(f"Filesystem IO: {len(samples)=}")
# > Filesystem IO: len(samples)=25000
# Writing to S3 through soundfile - Apparent Success
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
f.write(signal)
print(f"S3 IO: {f.frames=}")
# > S3 IO: f.frames=25000
# Reading 1 less sample than was written - Success
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
with sf.SoundFile(io.BytesIO(file.read())) as fin:
samples = fin.read(len(signal)-1)
print(f"S3 IO: {len(samples)=}")
# > S3 IO: len(samples)=24999
# Reading the same number of samples as was written - Failure
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
with sf.SoundFile(io.BytesIO(file.read())) as fin:
samples = fin.read(len(signal))
print(f"S3 IO: {len(samples)=}")
# > LibsndfileError: Internal psf_fseek() failed.
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-14.0-arm64-arm-64bit
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
smart_open 6.4.0
Full Error
---------------------------------------------------------------------------
LibsndfileError Traceback (most recent call last)
Cell In[330], line 58
56 with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
57 with sf.SoundFile(io.BytesIO(file.read())) as fin:
---> 58 samples = fin.read(len(signal))
59 print(f"S3 IO: {len(samples)=}")
60 # > LibsndfileError: Internal psf_fseek() failed.
File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:895, in SoundFile.read(self, frames, dtype, always_2d, fill_value, out)
893 if frames < 0 or frames > len(out):
894 frames = len(out)
--> 895 frames = self._array_io('read', out, frames)
896 if len(out) > frames:
897 if fill_value is None:
File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1344, in _array_io(self, action, array, frames)
1342 ctype = self._check_dtype(array.dtype.name)
1343 assert array.dtype.itemsize == _ffi.sizeof(ctype)
-> 1344 cdata = _ffi.cast(ctype + '*', array.__array_interface__['data'][0])
1345 return self._cdata_io(action, cdata, ctype, frames)
File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1356, in _cdata_io(self, action, data, ctype, frames)
1354 frames = func(self._file, data, frames)
1355 _error_check(self._errorcode)
-> 1356 if self.seekable():
1357 self.seek(curr + frames, SEEK_SET) # Update read & write position
1358 return frames
File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:802, in SoundFile.seek(self, frames, whence)
800 self._check_if_closed()
801 position = _snd.sf_seek(self._file, frames, whence)
--> 802 _error_check(self._errorcode)
803 return position
File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1407, in _error_check(err, prefix)
1405 def _error_check(err, prefix=""):
1406 """Raise LibsndfileError if there is an error."""
-> 1407 if err != 0:
1408 raise LibsndfileError(err, prefix=prefix)
LibsndfileError: Internal psf_fseek() failed.
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
I have further narrowed this down.
If you write through an io.BytesIO object then it works.
# Writing to S3 through soundfile - Apparent Success
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
with io.BytesIO() as temp:
with sf.SoundFile(temp,mode='w', samplerate=sample_rate,channels=1,format='flac') as flac:
flac.write(signal)
print(f"S3 IO: {flac.frames=}")
file.write(temp.getvalue())
# > S3 IO: f.frames=25000
# Reading without defining samples - Success
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
with sf.SoundFile(io.BytesIO(file.read())) as fin:
print(f"S3 IO: {fin.frames=}")
samples = fin.read()
print(f"S3 IO: {len(samples)=}")
# > S3 IO: fin.frames=25000
# > S3 IO: len(samples)=25000
Compare this to the version that failed (which is the way I was led to believe this should work through the examples)
# Writing to S3 through soundfile - Apparent Success (actual failure!)
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
f.write(signal)
print(f"S3 IO: {f.frames=}")
# > S3 IO: f.frames=25000
might this be a side-effect of #796 cc @jakkdl @mpenkov ?
@alexjbuck can you re-run your MWE using smart_open==6.4.0
and confirm it is already occurring before that PR was released?
might this be a side-effect of #796 cc @jakkdl @mpenkov ?
@alexjbuck can you re-run your MWE using
smart_open==6.4.0
and confirm it is already occurring before that PR was released?
this issue (oct 2023) is older than #796 (merged feb 2024), so it can't be the cause of it.
In the OP they specify as running the test on smart_open==6.4.0
ah right, missed that 👍