io.UnsupportedOperation: read
Closed this issue · 9 comments
Hello! I'm new to Avro and Spavro, but thanks for a great library!
I've just started playing around and testing the capabilities and possible use-cases for Avro by using the following toy code (derived from the official Avro docs). Everything works as expected until the final attempt to append data to the existing users.avro
file, where the io.UnsupportedOperation
exception is raised.
It's very plausible that I'm doing something wrong, but I can't immediately see what that might be. For context, my ultimate goal is to append data to an existing file, optionally with a new schema (an aside: it would appear as if it is not possible to update the schema and append new data in the same operation)?
from spavro.datafile import DataFileWriter, DataFileReader
from spavro.io import DatumWriter, DatumReader
from spavro.schema import make_avsc_object
# An initial schema
base_schema = {
'namespace': 'example.avro',
'type': 'record',
'name': 'User',
'fields': [
{'name': 'name', 'type': 'string'},
{'name': 'favorite_number', 'type': ['int', 'null']},
{'name': 'favorite_color', 'type': ['string', 'null']}
]
}
# Turn it into a Python object which the DataFileWriter can use
schema = make_avsc_object(base_schema)
# Create a DataFileWriter with our Schema and a path to write to
writer = DataFileWriter(open('users.avro', 'wb'), DatumWriter(), schema)
# Write some data
writer.append({'name': 'Alyssa', 'favorite_number': 256})
writer.append({'name': 'Ben', 'favorite_number': 7, 'favorite_color': 'red'})
# Close the fil
writer.close()
# Re-open and read the file we just wrote to
reader = DataFileReader(open('users.avro', 'rb'), DatumReader())
for user in reader:
# Print each user within the file
print('user', user)
# Get the schema we used to write to the file
written_schema = reader.datum_reader.writers_schema
# Close the reader
reader.close()
print(f'written_schema [{type(written_schema).__name__}]', written_schema)
# Get the original schema as a dictionary (not sure why this method is called
# `to_json` since it spits out a Python object)
new_schema = written_schema.to_json()
# Append a new field to the schema
new_schema['fields'].append({'name': 'first_pet', 'type': ['string', 'null']})
# Turn the new schema into a Python Schema object
new_schema = make_avsc_object(new_schema)
# Open a writer to use the new schema and write a new user reocrd
writer = DataFileWriter(open('users.avro', 'wb'), DatumWriter(), new_schema)
writer.append({'name': 'Roger', 'favorite_number': 1, 'first_pet': 'Hector'})
# Close the writer
writer.close()
# Create a reader to re-read our file
reader = DataFileReader(open('users.avro', 'rb'), DatumReader())
for user in reader:
print('user', user)
# Print the schema contained within the file to ensure it is correct
print('edited schema', reader.datum_reader.writers_schema.to_json())
reader.close()
writer = DataFileWriter(open('users.avro', 'wb'), DatumWriter())
writer.append({'name': 'Roger', 'favorite_number': 1, 'first_pet': 'Hector'})
writer.append({'name': 'James', 'favorite_number': 12, 'another_pet': 'Jessie'})
# Close the writer
writer.close()
Exception:
Traceback (most recent call last):
File "/Users/george/sites/notebooks/spavro_tests.py", line 69, in <module>
writer = DataFileWriter(open('users.avro', 'wb'), DatumWriter())
File "/Users/george/sites/notebooks/venv/lib/python3.6/site-packages/spavro/datafile.py", line 102, in __init__
dfr = DataFileReader(writer, io.DatumReader())
File "/Users/george/sites/notebooks/venv/lib/python3.6/site-packages/spavro/datafile.py", line 236, in __init__
self._read_header()
File "/Users/george/sites/notebooks/venv/lib/python3.6/site-packages/spavro/datafile.py", line 302, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "/Users/george/sites/notebooks/venv/lib/python3.6/site-packages/spavro/io.py", line 802, in read_data
return datum_reader(decoder.reader)
File "src/spavro/fast_binary.pyx", line 121, in spavro.fast_binary.make_record_reader.record_reader
File "src/spavro/fast_binary.pyx", line 173, in spavro.fast_binary.make_fixed_reader.fixed_reader
io.UnsupportedOperation: read
It looks like you just missed an argument in the DataFileWriter
that causes the error. If you call the method as writer = DataFileWriter(open('users.avro', 'wb'), DatumWriter(), new_schema)
it works as it should.
EDIT:
Didn't realize this could run without passing a schema by reading the previous schema. That does look buggy. Spavro is trying to perform a read operation on a file object which was opened as wb
. SInce it can't be both, it probably needs to get reference to the file and open it again with the right permission.
You should notice that what ends up in the file at the end of your script is:
{'favorite_color': None, 'name': 'Roger', 'favorite_number': 1, 'first_pet': 'Hector'}
{'favorite_color': None, 'name': 'James', 'favorite_number': 12, 'first_pet': None}
This is because:
- The old records aren't added to the file before it's overwritten.
- The schema doesn't have an
another_pet
field so it's ignored. - Jame's record is valid because
first_pet
can be null.
If you want to mutate the schema and add new records you should:
- Create a new schema that's a union of the old and new requirements.
- Read the old records into memory from the file object.
- Write the new and old records into your file object using the new inclusive schema.
Great, thanks @shawnsarwar - I think your last comment highlights where I'm going wrong. I had incorrectly assumed that it's possible to append new records (and set a new schema) to an existing Avro file. It seems I should therefore consider existing Avro files to be read-only.
However, I do think you're right that the bug itself is still valid, since the writers_schema
kwarg on DataFileWriter
is optional.
Thanks for the help!
Thanks guys, I'll check the DataFileWriter. That code should be the same, or nearly the same, as the original Apache reference implementation (one of the goals of Spavro was to be API compatible with the reference implementation).
Also I agree on the "to_json" method name being terrible. I kept it because that's how it is in the reference implementation. One of my todo's is to add a spavro API in addition to the default Apache implementation API with the the intent of cleaning up a lot of cruft.
Thanks @mikepk - appreciate you taking a look at the DataFileWriter. Apologies for leaving in my comment re to_json - I actually read your comment saying the same thing (in the code base or in another issue, can’t quite remember) before posting here.
So looking through the tests... it looks like if you intend to append to an Avro data file, the expectation is that you will open the output file in append mode "plus" to allow reading and writing. i.e.
writer = DataFileWriter(open('users.avro', 'ab+'), DatumWriter())
That's not super clear from the call signature, but the append tests in the test suite all open the writer with append mode r/w "ab+"
. The "write" mode will automatically nuke the file and write a whole new file. The missing write schema is implicitly expected for appending records. It might be good to check the mode of the file handle and throw a more meaningful error that says "open in append mode if you're appending records" or something to that effect.
Also to be clear, the spec doesn't really allow writing different records with different schemas, the data file container is expected to have all of the records written with the same writer schema. So to @shawnsarwar 's point, if you want to change the schema you should read all the records first, then re-write them with the new schema or write two separate data files each with their own schema.
http://avro.apache.org/docs/current/spec.html#Object+Container+Files
Avro includes a simple object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding.
I pushed a small tweak that just checks whether the file object passed in is in the right mode (rb+ or ab+) and throws a more helpful exception if not. wb+ unfortunately truncates the file before writing so it's always blank even when "appending".