syncutil - window size for reading datastream information can be too small

Question

syncutil - window size for reading datastream information can be too small

Closed this issue 8 years ago · 3 comments

Creating new issue based on conversations from Issue #15.

The problem arises when the datastream information is particularly long (e.g. labels), causing it be longer than the moving window for reading datastream information.

Bumping the window size on line 206 and lines 252-255 from 200 / 250 to something like 750 worked for a particular set of objects with long datastream labels, but might not be a permanent solution.

Answer 1 · 2016-03-16T16:05:03.000Z

@ghukill thanks for opening this; I think we may be adding some notes here soon with some other related issues and/or edge cases we've been running into.

Answer 2 · 2016-04-11T20:38:01.000Z

I'm just adding some errors I encountered:

Error importing emory:d743q to dev: 400 <?xml version="1.0" encoding="UTF-8"?><management:validation  xmlns:management="http://www.fedora.info/definitions/1/0/management/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fedora.info/definitions/1/0/management/ http://www.fedora.info/definitions/1/0/validation.xsd" pid="unknown"  valid="true">
  <management:contentModels>
  </management:contentModels>
  <management:problems>
    <management:problem>Schematron validation failed:org.xml.sax.SAXParseException; lineNumber: 921; columnNumber: 2; The value of attribute "REF" associated with an element type "foxml:contentLocation" must not contain the '<' character.</management:problem>
  </management:problems>
  <management:datastreamProblems>
  </management:datastreamProblems>
</management:validation>

ChecksumMismatch even with --archive-xml and --requires-auth eg:

repo-cp --archive-xml --requires-auth prod dev emory:pg3k9

Traceback (most recent call last):
  File "/home/jsvarn/eulf/bin/repo-cp", line 137, in <module>
    repo_copy()
  File "/home/jsvarn/eulf/bin/repo-cp", line 121, in repo_copy
    requires_auth=args.requires_auth)
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 104, in sync_object
    export_data = export.object_data().getvalue()
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 298, in object_data
    dsinfo = self.get_datastream_info(previous_section)
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 258, in get_datastream_info
    infomatch = self.dsinfo_regex.search(force_text(dsinfo))
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/util.py", line 44, in force_text
    s = six.text_type(bytes(s), encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

When I actually catch the error for the one above, i get:

Unexpected error on emory:bcd79: <type 'exceptions.ValueError'> __len__() should return >= 0

Answer 3 · 2016-07-28T15:40:47.000Z

I think setting a larger size for the chunk used for datastream info should be fine, and it shouldn't cause an issue with the regex since we're splitting on datastream start and end - that chunk shouldn't ever include datastream info for a previous datastream. My testing indicated that it worked fine for objects that can be successfully synced (excepting the problem record mentioned above, which seems to have other issues).