I suspect a possible bug
eschulte opened this issue · 13 comments
It appears that parsing is getting thrown off by the attached hello.gtirb.zip protobuf file (zipped for upload).
Specifically, using:
- This library https://github.com/grammatech/gtirb
- with this
.proto
file https://github.com/GrammaTech/gtirb/blob/master/proto/Section.proto - using this package https://github.com/GrammaTech/gtirb/blob/master/cl/gtirb.lisp (but the package shouldn't matter as the issue is in the protocol buffer parsing which is automatically generated by
protoc-gen-lisp
from the.proto
files).
The CL parser parses the top level IR
, then down to the containing Module
, into the Section
and then the last message parsed is the SectionFlag, after which point it is thrown off and then it throws an effort in skip-field
when it should have finished that section and moved on to the next section (the SectionFlag was the last message in that section). This same file appears to parse successfully using C++ and Python protobuf implementations.
This should be reproducible by:
- cloning the above gtirb project into your quicklisp local-projects
(ql:quickload :gtirb)
(gtirb:read-gtirb "/path/to/hello.gtirb")
Any advice?
FWIW, the following change (delimited with FIX HACK FIX
) allows the attached file to parse successfully. I don't know why :)
(cl:defmethod pb:merge-from-array ((self section) buffer start limit)
(cl:declare (cl:type com.google.base:octet-vector buffer)
(cl:type com.google.base:vector-index start limit))
(cl:do ((index start index))
((cl:>= index limit) index)
(cl:declare (cl:type com.google.base:vector-index index))
(cl:multiple-value-bind (tag new-index)
(varint:parse-uint32-carefully buffer index limit)
(cl:setf index new-index)
(cl:case tag
;; bytes uuid = 1[json_name = "uuid"];
((10)
(cl:multiple-value-bind (value new-index)
(wire-format:read-octets-carefully buffer index limit)
(cl:setf (cl:slot-value self 'uuid) value)
(cl:setf (cl:ldb (cl:byte 1 0) (cl:slot-value self '%has-bits%)) 1)
(cl:setf index new-index)))
;; string name = 2[json_name = "name"];
((18)
(cl:multiple-value-bind (value new-index)
(wire-format:read-octets-carefully buffer index limit)
(cl:setf (cl:slot-value self 'name) (pb:string-field value))
(cl:setf (cl:ldb (cl:byte 1 1) (cl:slot-value self '%has-bits%)) 1)
(cl:setf index new-index)))
;; repeated .proto.ByteInterval byte_intervals = 5[json_name = "byteIntervals"];
((42)
(cl:multiple-value-bind (length new-index)
(varint:parse-uint31-carefully buffer index limit)
(cl:when (cl:> (cl:+ new-index length) limit)
(cl:error "buffer overflow"))
(cl:let ((message (cl:make-instance 'proto::byte-interval)))
(cl:setf index (pb:merge-from-array message buffer new-index (cl:+ new-index length)))
(cl:when (cl:not (cl:= index (cl:+ new-index length)))
(cl:error "buffer overflow"))
(cl:vector-push-extend message (cl:slot-value self 'byte-intervals)))))
;; repeated .proto.SectionFlag section_flags = 6[json_name = "sectionFlags"];
((50)
(cl:multiple-value-bind (value new-index)
(varint:parse-int32-carefully buffer index limit)
;; XXXXX: when valid, set field, else add to unknown fields
;; FIX HACK FIX
(cl:case value
(3 (cl:incf new-index 3))
(4 (cl:incf new-index 4)))
;; FIX HACK FIX
(cl:vector-push-extend value (cl:slot-value self 'section-flags))
(cl:setf index new-index)))
(cl:t
(cl:when (cl:= (cl:logand tag 7) 4)
(cl:return-from pb:merge-from-array index))
(cl:setf index (wire-format:skip-field buffer index limit tag)))))))
The problem looks like lack of support for unknown fields. The proto file says that tags 3 and 4 are reserved. Most likely the encoded data you're trying to read contains those fields. The generated protobuf code is supposed to either skip over unknown fields or (much better) retain them. Unfortunately, my code implements neither of those. A workaround for now is to define fields with the unknown tag numbers. Look back in the git history to see what types they were before they were removed, then add back deprecated fields with the same tag numbers and types:
int32 DEPRECATED_FOO = 3;
string DEPRECATED_BAR = 4;
Ah, thanks for the explanation and suggested temporary work around.
(In this case I'm actually happy to know these fields are still being generated, because that should change upstream.)
I'll leave this issue open to track the lack of support for unknown fields.
Oh, actually the field numbers 3 and 4 in this case are not the deprecated IDs 3 and 4 on the Section
message, but are the valid IDs 3 and 4 on the SectionFlag
message. Given that I'm not sure what the problem is here.
Yeah, there is definitely something wrong with (at least) repeated enums. See this minimal attached example (run (ql:quickload 'pbtest) (pbtest:run)
in a REPL then make
at the shell).
Oh, after looking into this a little more the problem is that proto3 switched to a packed encoding for repeated enums by default and it looks like that is not currently supported.
https://developers.google.com/protocol-buffers/docs/encoding#optional
https://developers.google.com/protocol-buffers/docs/encoding#packed
This is a fairly serious bug which is fixed by my pull request. Do you anticipate having time to review this in the next couple of weeks?
As you have discovered, the problem has to do with the handling of packed repeated fields.
Unfortunately, your pull request is not a complete fix. For repeated fields of varint type, the protobuf reading code needs to be able to handle both packed and unpacked encodings. Currently, the generated protobuf message reading code dispatches on a combination of field tag and encoding method. Instead, it should dispatch only on field tag, and the code for a packable field should then look at the encoding method and handle both the packed and unpacked case. I have some code on a branch that's closer to what's required ... I'll try to commit it soonish.
I have implemented packed repeated fields. I have not tested your specific problem, but it's likely to be fixed. Please give it a try and let me know.
Confirmed, this is now fixed with your changes on the master branch. Thanks
That's great news. I will update this week--thanks for the note.