CaliDog/certstream-server

Process terminated after error and no new process is started

Closed this issue · 11 comments

Hello,

First, thanks for this project, it's a great tool for CT :)

We installed our own instance of certstream-server and after some time, we notice this message :
[error] GenServer #PID<0.300.0> terminating.
This error log happened after different errors which are :

  • ** (MatchError) no match of right hand side value
  • ** (RuntimeError) Unhandled CRL distrobution point
    The issue is when this error happens, the process ends and then the log server watched by this process is no longer monitored because no process is started.

Thanks in advance,
Best regards,

Marie-Luce.

Hi there, thanks for the report.

As I understand it, I've configured the supervisor chain to use a DynamicSupervisor with a :one_for_one, which should restart the watchers in the event that the watcher process crashes.

def init(:ok) do
children = [
# Web services
Certstream.WebsocketServer,
# Agents
Certstream.ClientManager,
Certstream.CertifcateBuffer,
# Watchers
{DynamicSupervisor, name: WatcherSupervisor, strategy: :one_for_one}
]
Supervisor.init(children, strategy: :one_for_one)
end

In this case the watcher process is crashing because EasySSL doesn't know how to parse a certificate's CRL distribution point (also 🤦‍♂️ for my typo there). That should have the dynamic supervisor kick in and restart it.

https://github.com/CaliDog/EasySSL/blob/1b5de1907b3a28185f8ee72a9c942729a09afca9/lib/easy_ssl.ex#L395

I think this will unfortunately mean we'll lose all certificates that that watcher was parsing and restart from the top though as we currently don't have any persistence through something like an ETS table. PRs are welcome though if you want to add it! :D

Oh also, this is where I add the watchers to the DynamicSupervisor's children/watch list. Definitely open to suggestions/bugfixes if it's not the proper way to do it!

def start_and_link_watchers(name: supervisor_name) do
Logger.info("Initializing CT Watchers...")
fetch_all_logs()
|> Enum.each(fn log ->
DynamicSupervisor.start_child(supervisor_name, child_spec(log))
end)
end

Hi Fitblip,
Thanks for the quick answer.

Little warning, I didn't known Elixir before I looked your code. So I am not sure to understand everything well.

When I read your code and the Supervisor documentation, I agree with you, normally DynamicSupervisor "should restart the watchers in the event that the watcher process crashes". But it seems it is not happen like this.
When I look the logs for my instance of certstream-server, no watcher restarts after the watcher process crashes. In additional, I launched two instances and after some hours (when a watcher process crashes on one instance) the two instances don't return the same new certificates. Because one instance no longer monitors a log server (related to the dead watcher process).

I was wondering how do you handle this issue on your server certstream.calidog.io ?

Hey @marieluce-allee,

Odd, I'll have to do some failure testing to see if I can reproduce it.

As for the certstream.calidog.io service, we run the same code as is here, so if these watchers are indeed dying, they're probably only resuming after Heroku restarts the instance (which happens automatically every night).

I'll dig into this and let you know what I find! Thanks for the info.

@marieluce-allee I spent some time this weekend toying with this to address the above issues. This took the form of 2 fixes:

I fixed EasySSL (the library that was barfing) to have a more sane failure case/error handling CaliDog/EasySSL#3. It feels more in-line with the elixir thinking not to have things explode anyway 🤷‍♂️

Secondly, I did some failure testing and fixed some invocation issues with the current DynamicSupervisor setup. This should fix the crash-and-die-forever issue. The screenshot below should be what happens from now on 🎉

image

Please pull the latest version (https://github.com/CaliDog/certstream-server/releases/tag/1.1.0), and let me know if you still see issues!

Thanks @Fitblip for your answers and your work.

Now I understand why the issue didn't see before. The issue happen often after many hours so if your certstream-server instance restart every night, the issue was "fixed" every day :)
I will update certstream-server in my test instance and I will let you know if I will still see issues.

Perfect, if I don't hear back from you in a while I'll close this out! Thanks again for the bug report :)

Hi,

Unfortunately, I always have the same error.

08:18:31.608 [info]  Worker #PID<0.256.0> with url ct.googleapis.com/logs/xenon2021/ found 1 certificates [58697 -> 58698].
08:18:31.608 [info]  GETing https://ct.googleapis.com/logs/xenon2021/ct/v1/get-entries?start=58697&end=58697
08:18:31.673 [error] GenServer #PID<0.256.0> terminating
 ** (MatchError) no match of right hand side value: {:error, {:asn1, {{:"Type not compatible with table constraint", {:error, {:asn1, {:wrong_tag, {{:expected, 5}, {:got, 21, {21, ""}}}}}}}, [{:"OTP-PUB-KEY", :dec_SignatureAlgorithm, 2, [file: 'OTP-PUB-KEY.erl', line: 15266]}, {:"OTP-PUB-KEY", :dec_OTPCertificate, 2, [file: 'OTP-PUB-KEY.erl', line: 15913]}, {:"OTP-PUB-KEY", :decode, 2, [file: 'OTP-PUB-KEY.erl', line: 1109]}, {:pubkey_cert_records, :decode_cert, 1, [file: 'pubkey_cert_records.erl', line: 40]}, {:public_key, :pkix_decode_cert, 2, [file: 'public_key.erl', line: 357]}, {EasySSL, :parse_der, 2, [file: 'lib/easy_ssl.ex', line: 64]}, {Certstream.CTParser, :parse_certificate_chain, 2, [file: 'lib/ct_parser.ex', line: 65]}, {Certstream.CTParser, :parse_entry, 1, [file: 'lib/ct_parser.ex', line: 27]}]}}}
     (public_key) public_key.erl:361: :public_key.pkix_decode_cert/2
     (easy_ssl) lib/easy_ssl.ex:64: EasySSL.parse_der/2
     (certstream) lib/ct_parser.ex:65: Certstream.CTParser.parse_certificate_chain/2
     (certstream) lib/ct_parser.ex:27: Certstream.CTParser.parse_entry/1
     (certstream) lib/ct_watcher.ex:153: anonymous fn/2 in Certstream.CTWatcher.broadcast_updates/2
     (elixir) lib/enum.ex:1314: Enum."-map/2-lists^map/1-0-"/2
     (certstream) lib/ct_watcher.ex:152: anonymous fn/2 in Certstream.CTWatcher.broadcast_updates/2
     (elixir) lib/enum.ex:765: Enum."-each/2-lists^foreach/1-0-"/2
 Last message: :update
Worker #PID<0.260.0> with url ct.googleapis.com/pilot/ found 16 certificates [476690279 -> 476690295].
...

As you can see, when a error happened (here about easySSL) the process terminated (GenServer #PID<0.256.0> terminating) but no new process is started.

Hey there, I think I've figured out the issue and fixed it in the latest master. Can you pull it and try again? Things seem much more stable now!

Hi,
Great, I will pull it and try again.
After some days, I'll let you know if it seems more stable.

Thanks !

Sorry for the delay.
But yes now it works fine. Every time a process is killed, a new one is created.

Thanks :)