ChrisRx/quickavro

Optional enum encoding can fail

gth828r opened this issue · 8 comments

When an optional enum exists in a schema, part of the python_to_union call may pass in a None object into validate_enum. This will cause a segfault, because validate_enum expects to be able to do a lookup of the symbol name based on the provided value.

I'm less sure what The Right Way is for fixing this issue. For now, I put a check at the top of validate_enum of the form:

    if (obj == Py_None) {
        return -1;
    }

Another approach would be in your python_to_union function to just do the check for Py_None before you check for the enum type.

This one has some dependencies/potential merge conflicts with the other ones I am working on. I'm going to do this one in a second batch after the others get merged in.

This one turns out to only require another replacement of PyUnicode_Check with _PyUnicode_CheckExact in order to get enums variables in string form to work under Python2 (this time in the validate_enum function. No other change is needed, including the things I mentioned above in previous comments.

I am also working on a unit test, but that one conflicts with the unit test changes I made in #5. Once #11 is in I can start working on this one.

I actually did need to check for the None type in the end. That wasn't a problem for the unit tests here, but it was a problem for the schema we are working on in our project. I'm not really sure what the cause of that is, but the fix is to explicitly check for the none type in validate_enum. One difference in our schema is that any unions with a null value allowed always define the null value first. Our schema also has a bit of nesting of types, and it is far more complex. Unfortunately, I don't think I can share the schema. But I can confirm that the PR I will submit works with both my schema and with the unit tests.

I'm merging the rest of the commits and changing the version and pushing to PyPi. These have all looked great and I really appreciate including tests (I was going to ask for tests but you already had them)! Is there a reason you wanted to do all these as separate PRs? Nothing wrong with it, just wondering (I usually roll them into one myself).

As a side note, I'm thinking of changing the avro encoder to use Cython instead of AvroC/Janssen. This would allow it to compile on older versions of Windows trivially and I think might be a bit faster and easier to work with overall. I have a prototype locally that is already so much more simple but providing the same interface (it nearly drops right into the current files and works). But yeah this would be a separate branch as to not mess anything up and if you are interested in taking a look when I put that up it would be definitely very appreciated!

It auto-closed this issue and the others when I merged the PR BTW, so if you still had some things that were left feel free to open those back up or whatnot.

Is there a reason you wanted to do all these as separate PRs?

The main thing I really wanted to do was break out separate issues into independent pieces. I typically work on feature branches associated with issues, and then I merge those in / submit PRs based on those if there is no pre-defined process for a project. This ensures that individual commits can be pulled out later if a regression was introduced. It also allows for individual fixes to be iterated on during review without blocking the inclusion of other fixes which don't need as much review.

I suppose for this, I could have rolled up my feature branches into an integration branch and submitted that as a PR because the fixes were mostly one or two liners with an accompanying unit test change. In places where there were dependencies/conflicts (like the union unit test) it probably would have made more sense to do it that way.

I'm thinking of changing the avro encoder to use Cython instead of AvroC/Janssen

I'm certainly interested in trying it out locally to make sure it still works for our stuff. We are already using the deserialize path, so if it didn't work for us I'd definitely want to know that in advance.

That makes sense, I almost rolled them into a version branch, but ended up just doing them individually ultimately. Your point about being able to pull out regressions later is very agreeable, thanks for sharing that viewpoint!

I won't switch it to Cython while someone is still using it as is with the AvroC library. I would probably even just start another project since it is sufficiently different, so it shouldn't negatively affect anything.