pyca/pyopenssl

X.509Name.get_components() doesn't process Subject values like X.509Name.__getattr__() does with Unicode strings.

zeriny opened this issue · 2 comments

Hello,

I recently encountered a problem when parsing X.509 certificates with Unicode in the Subject DN fields.

An example of PEM cert to be parsed is (sha1=a324f399248e42e218ec40ae771bf27c4f5aea1d):
-----BEGIN CERTIFICATE-----
MIIFKjCCBBKgAwIBAgIBczANBgkqhkiG9w0BAQsFADBHMQswCQYDVQQGEwJVUzEW
MBQGA1UEChMNR2VvVHJ1c3QgSW5jLjEgMB4GA1UEAxMXR2VvVHJ1c3QgRVYgU1NM
IENBIC0gRzUwHhcNMTQxMTE2MTI0MzI1WhcNMTUwNzIxMDE1MzAyWjCB/TEdMBsG
A1UEDxMUUHJpdmF0ZSBPcmdhbml6YXRpb24xEzARBgsrBgEEAYI3PAIBAxMCREUx
GzAZBgsrBgEEAYI3PAIBAR4K/v8ASwD2AGwAbjESMBAGA1UEBRMJSFJCIDIxNjIw
MQswCQYDVQQGEwJERTEcMBoGA1UECBMTTm9yZHJoZWluLVdlc3RmYWxlbjEOMAwG
A1UEBxMFS29lbG4xODA2BgNVBAoTL1lhemFraSBFdXJvcGUgTGltaXRlZCwgWndl
aWduaWVkZXJsYXNzdW5nIEtvZWxuMSEwHwYDVQQDExhtYXRyaXgueWF6YWtpLWV1
cm9wZS5jb20wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDAqFmrBTfc
W2rj8JKBjp48snoSsWCUE/Sbt53eotH/LwngaLMoZpx4s2KD4G6SfD1NlooJaUAF
yDwT/2g4EaRCUN8RiRoPlilXGJosKi2evS3+rjCvd05Zy+v24hQR9MvH6ZRL2ArC
xG7yLl1WM2hNCBtytbuMyZoT4IToEqZl+mO1ev5eT2oiPRYnUT5r3Ok6LqiW12lp
b9L7rPBJJdrsdnA2FLJeC20/+rSWOryeDOYhdTPdNTReK1b4aNAGGYhkKdUQlba9
8dz0DJBi5MOkehkfYZTZqsNtFDE+rWDtuT5Q/5rGbAoUVhD1qdMiv5Tr6KcB2oe8
9WwhWdfQR4RLAgMBAAGjggFoMIIBZDAfBgNVHSMEGDAWgBQIJQchR2wx/ghYDqbq
L+fbG2JoyjBXBggrBgEFBQcBAQRLMEkwHwYIKwYBBQUHMAGGE2h0dHA6Ly9neC5z
eW1jZC5jb20wJgYIKwYBBQUHMAKGGmh0dHA6Ly9neC5zeW1jYi5jb20vZ3guY3J0
MA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIw
IwYDVR0RBBwwGoIYbWF0cml4LnlhemFraS1ldXJvcGUuY29tMCsGA1UdHwQkMCIw
IKAeoByGGmh0dHA6Ly9neC5zeW1jYi5jb20vZ3guY3JsMAwGA1UdEwEB/wQCMAAw
WQYDVR0gBFIwUDBOBgkrBgEEAfAiAQYwQTA/BggrBgEFBQcCARYzaHR0cHM6Ly93
d3cuZ2VvdHJ1c3QuY29tL3Jlc291cmNlcy9yZXBvc2l0b3J5L2xlZ2FsMA0GCSqG
SIb3DQEBCwUAA4IBAQCGqxvB42yVVQlneK7RNXM1pkFYYmwAnFbbLEPhOLoQOo/K
mk8k4X8pDEA6I6x73k7ejTDYdZUsEjEM3r1BJF2/XjPTB9rbfKqC518dyYVrtcdN
rUrb07ruRxS+scLFaYLztI42HQEeCVx+AaGWVrkZsz9oWY8k3WzCW8SQRQImLzVD
8z9rWEcCgDtGqjlrtmhlMFfVcP5bgBi5b8AbCDvhXJ3BThPGM7Ct/QCRzYXwr8WT
Tu9+isD+7UT+j9UzAhQKOw8jsaDblBG+ABNGJq1Egv19HxUpb+Toj5amY0NbZjbg
PRC+vKC1qyo5gXWj8ODHRvSLZ8aRueg5X4VdrvGN
-----END CERTIFICATE-----

In this certificate, the value of Subject.jurisdictionLocalityName field is '\ufeffMünchen'.

I initially try to parse the PEM cert using certobj.get_subject().jurisdictionL (which internally calls the__getattr__()function), and retrieve the correct value ('\ufeffMünchen'.encode('utf-8') is b'\xef\xbb\xbfK\xc3\xb6ln').

However, when I try to get this field with certobj.get_subject().get_components(). It returns a list of DNs, and the value of jurisdictionL field is b'\xfe\xff\x00K\x00\xf6\x00l\x00n', which cannot be decoded with "utf-8".

I checked this inconsistency through the source codes and find that:
In X.509Name.__getattr__() function, it handles strings with _lib.ASN1_STRING_to_UTF8. Instead, X.509Name.get_components() directly calls ASN1_STRING_get0_data and ASN1_STRING_length, which returns bytes that can not be decoded to 'utf-8'.

I'm not familiar with Unicode and I wonder whether this is an issue and which method is the correct way to parse X.509 subject.

Code:

certobj = crypto.load_certificate(crypto.FILETYPE_PEM, pem)
subject_obj = certobj.get_subject()

subject_jurisdictionLocalityName1 = subject_obj.jurisdictionL
print(subject_jurisdictionLocalityName1)

subjects = subject_obj.get_components()
for subject in subjects:
      try:
          key = subject[0].decode()
          if key == 'jurisdictionL':
                print(subject[1])
                subject_jurisdictionLocalityName2 = subject[1].decode("utf-8")
                print(subject_jurisdictionLocalityName2)
      except Exception as e:
          print(e)

Output:
'\ufeffKöln'
b'\xfe\xff\x00K\x00\xf6\x00l\x00n'
UnicodeDecodeError('utf-8', b'\xfe\xff\x00K\x00\xf6\x00l\x00n', 0, 1, 'invalid start byte')

alex commented

Our X509 APIs are deprecated with the intent to remove them, people should use pyca/cryptography's X509 APIs instead.