stadt-karlsruhe/ckanext-extractor

After the extraction the index is not updated

Closed this issue · 4 comments

I'm not sure if I understand the handling correctly. As far as I can tell, thanks to celeryd all new uploads (e.g. of a PDF file) are automatically extracted. But then the result is not yet present in the search index. So to actually make use of the extracted fulltext for the search, I have to rebuild the index.

Is this correct? Or should the index eventually be updated?

I just saw that apparently the indexing fails with the following error:

Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/ckan/ckanext/ckanext-extractor/ckanext/extractor/tasks.py", line 82, in extract
    index_for('package').update_dict(pkg_dict)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 103, in update_dict
    self.index_package(pkg_dict, defer_commit)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 215, in index_package
    rel_dict[type].append(model.Package.get(rel['subject_package_id']).name)
KeyError: 'subject_package_id'

But paster --plugin=ckan search-index rebuild works. Any idea what could cause this behavior?

I've never seen that error and currently have no idea regarding its cause. Does this happen reproducibly? Does it affect only certain resources/datasets or all of them? As far as I understand, subject_package_id is used in package relations which we don't really use.

@torfsen after more investigation, I think this is not really a problem of this extension, but rather caused by ckan/ckan#2332. I have code that creates a relationship in the after_create hook, and this error prevented the indexing from running sucessfully.

@metaodi OK, thanks for digging into it!