After the extraction the index is not updated
Closed this issue · 4 comments
I'm not sure if I understand the handling correctly. As far as I can tell, thanks to celeryd all new uploads (e.g. of a PDF file) are automatically extracted. But then the result is not yet present in the search index. So to actually make use of the extracted fulltext for the search, I have to rebuild the index.
Is this correct? Or should the index eventually be updated?
I just saw that apparently the indexing fails with the following error:
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/lib/ckan/ckanext/ckanext-extractor/ckanext/extractor/tasks.py", line 82, in extract
index_for('package').update_dict(pkg_dict)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 103, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 215, in index_package
rel_dict[type].append(model.Package.get(rel['subject_package_id']).name)
KeyError: 'subject_package_id'
But paster --plugin=ckan search-index rebuild
works. Any idea what could cause this behavior?
I've never seen that error and currently have no idea regarding its cause. Does this happen reproducibly? Does it affect only certain resources/datasets or all of them? As far as I understand, subject_package_id
is used in package relations which we don't really use.
@torfsen after more investigation, I think this is not really a problem of this extension, but rather caused by ckan/ckan#2332. I have code that creates a relationship in the after_create hook, and this error prevented the indexing from running sucessfully.