Metadata field can't handle strings longer than 255 characters
Opened this issue · 5 comments
I tried to upload image captions as a metadata point. The idea being I could then filter the dataset based on the contents of the captions. I ran into an error:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 542247/542247 [00:00<00:00, 2294043.12it/s]
0%| | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TransportQueryError Traceback (most recent call last)
Cell In[13], line 15
12 for start in tqdm(range(0, total, batch)):
13 data = all_metadata[start:start+batch]
---> 15 with ds.metadata_context() as ctx, open(annotations_file) as f:
16 for image, metadata in data:
17 ctx.update_metadata(image, metadata)
File ~/.miniforge3/envs/dagstest/lib/python3.10/contextlib.py:142, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
140 if typ is None:
141 try:
--> 142 next(self.gen)
143 except StopIteration:
144 return False
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/dagshub/data_engine/model/datasource.py:118, in Datasource.metadata_context(self)
116 ctx = MetadataContextManager(self)
117 yield ctx
--> 118 self._upload_metadata(ctx.get_metadata_entries())
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/dagshub/data_engine/model/datasource.py:183, in Datasource._upload_metadata(self, metadata_entries)
182 def _upload_metadata(self, metadata_entries: List[DatapointMetadataUpdateEntry]):
--> 183 self.source.client.update_metadata(self, metadata_entries)
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/dagshub/data_engine/client/data_client.py:109, in DataClient.update_metadata(self, datasource, entries)
102 assert len(entries) > 0
104 params = GqlMutations.update_metadata_params(
105 datasource_id=datasource.source.id,
106 datapoints=[e.to_dict() for e in entries]
107 )
--> 109 return self._exec(q, params)
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/dagshub/data_engine/client/data_client.py:82, in DataClient._exec(self, query, params)
80 logger.debug(f"Params: {params}")
81 q = gql.gql(query)
---> 82 resp = self.client.execute(q, variable_values=params)
83 return resp
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/gql/client.py:403, in Client.execute(self, document, variable_values, operation_name, serialize_variables, parse_result, get_execution_result, **kwargs)
400 return data
402 else: # Sync transports
--> 403 return self.execute_sync(
404 document,
405 variable_values=variable_values,
406 operation_name=operation_name,
407 serialize_variables=serialize_variables,
408 parse_result=parse_result,
409 get_execution_result=get_execution_result,
410 **kwargs,
411 )
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/gql/client.py:221, in Client.execute_sync(self, document, variable_values, operation_name, serialize_variables, parse_result, get_execution_result, **kwargs)
219 """:meta private:"""
220 with self as session:
--> 221 return session.execute(
222 document,
223 variable_values=variable_values,
224 operation_name=operation_name,
225 serialize_variables=serialize_variables,
226 parse_result=parse_result,
227 get_execution_result=get_execution_result,
228 **kwargs,
229 )
File ~/.miniforge3/envs/dagstest/lib/python3.10/site-packages/gql/client.py:860, in SyncClientSession.execute(self, document, variable_values, operation_name, serialize_variables, parse_result, get_execution_result, **kwargs)
858 # Raise an error if an error is returned in the ExecutionResult object
859 if result.errors:
--> 860 raise TransportQueryError(
861 str(result.errors[0]),
862 errors=result.errors,
863 data=result.data,
864 extensions=result.extensions,
865 )
867 assert (
868 result.data is not None
869 ), "Transport returned an ExecutionResult without data or errors"
871 if get_execution_result:
TransportQueryError: {'message': 'pq: value too long for type character varying(255)', 'path': ['updateMetadata']}
This was the code:
annotations_file = 'labels.tsv'
all_metadata = []
with open(annotations_file) as f:
for row in tqdm(f.readlines()):
image, caption, score = row.split('\t')[:3]
all_metadata.append((image, {'caption': caption, 'score': score}))
total = len(all_metadata)
batch = 1000
for start in tqdm(range(0, total, batch)):
data = all_metadata[start:start+batch]
with ds.metadata_context() as ctx, open(annotations_file) as f:
for image, metadata in data:
ctx.update_metadata(image, metadata)
The workaround was to replace:
all_metadata.append((image, {'caption': caption, 'score': score}))
with:
all_metadata.append((image, {'caption': caption[:255], 'score': score}))
Interresting, would you consider it ok to clip the content of the value?
I think a caption is not exactly metadata as much as it's some kind of label. @kbolashev I think that's a good case in which maybe the backends needs to treat text larger than X as a blob type.
I think there's a fine line between metadata and labels 😄
My conceived use case was to be able to filter the images based on keywords in the caption. i.e. look for all images with the word "squirrel" in the caption. If you changed this to a blob type, I probably wouldn't be able to do that, right?
Another option I, as a user, would have, would be to run some NLP functions to pull out nouns, verbs, and adjectives from the caption and then just upload those as metadata.
I think that's a good case in which maybe the backends needs to treat text larger than X as a blob type.
What's the difference backend performance wise? Not wasting time LIKE
-ing it?
We can totally introduce a blob object and then I'll convert to it, but we'll need to explain this limitation to users maybe
I'm a bit wary about introducing blobs though, because THEN you can actually build a "versioning solution on top of Data Engine", e.g. have a .json file and the actual object being labeled is in metadata on said file for some reason, and it sounds like huge misuse and will probably tank the performance for real.
Don't know if it's worth it to think about preventing users from doing that though
My conceived use case was to be able to filter the images based on keywords in the caption. i.e. look for all images with the word "squirrel" in the caption. If you changed this to a blob type, I probably wouldn't be able to do that, right?
That's legit, I think it would require from us to have a dedicated kind of indexing over text blobs for that use case. Using "Contains" query to find datapoints in big datasets will probably be too slow.
What's the difference backend performance wise? Not wasting time LIKE-ing it?
Yes mostly.
Why would it tank performance?