syncutil - window size for reading datastream information can be too small
Closed this issue · 3 comments
Creating new issue based on conversations from Issue #15.
The problem arises when the datastream information is particularly long (e.g. labels), causing it be longer than the moving window for reading datastream information.
Bumping the window size on line 206 and lines 252-255 from 200
/ 250
to something like 750
worked for a particular set of objects with long datastream labels, but might not be a permanent solution.
@ghukill thanks for opening this; I think we may be adding some notes here soon with some other related issues and/or edge cases we've been running into.
I'm just adding some errors I encountered:
Error importing emory:d743q to dev: 400 <?xml version="1.0" encoding="UTF-8"?><management:validation xmlns:management="http://www.fedora.info/definitions/1/0/management/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fedora.info/definitions/1/0/management/ http://www.fedora.info/definitions/1/0/validation.xsd" pid="unknown" valid="true">
<management:contentModels>
</management:contentModels>
<management:problems>
<management:problem>Schematron validation failed:org.xml.sax.SAXParseException; lineNumber: 921; columnNumber: 2; The value of attribute "REF" associated with an element type "foxml:contentLocation" must not contain the '<' character.</management:problem>
</management:problems>
<management:datastreamProblems>
</management:datastreamProblems>
</management:validation>
ChecksumMismatch
even with --archive-xml
and --requires-auth
eg:
repo-cp --archive-xml --requires-auth prod dev emory:pg3k9
Traceback (most recent call last):
File "/home/jsvarn/eulf/bin/repo-cp", line 137, in <module>
repo_copy()
File "/home/jsvarn/eulf/bin/repo-cp", line 121, in repo_copy
requires_auth=args.requires_auth)
File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 104, in sync_object
export_data = export.object_data().getvalue()
File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 298, in object_data
dsinfo = self.get_datastream_info(previous_section)
File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 258, in get_datastream_info
infomatch = self.dsinfo_regex.search(force_text(dsinfo))
File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/util.py", line 44, in force_text
s = six.text_type(bytes(s), encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte
When I actually catch the error for the one above, i get:
Unexpected error on emory:bcd79: <type 'exceptions.ValueError'> __len__() should return >= 0
I think setting a larger size for the chunk used for datastream info should be fine, and it shouldn't cause an issue with the regex since we're splitting on datastream start and end - that chunk shouldn't ever include datastream info for a previous datastream. My testing indicated that it worked fine for objects that can be successfully synced (excepting the problem record mentioned above, which seems to have other issues).