podaac/swodlr

Documentation for OPS error tracing

Closed this issue · 4 comments

Documentation for OPS error tracing

@ymchenjpl doc is ready.

@viviant100 @joshgarde Good start on the doc. Here are my suggestions:

  1. Is there are way to add the user's Product_ID number to each of the different log group events ? When I look at the each Cloudwatch log group, I only see the Request ID numbers, and when I expand the details, I don't see the user's Product_ID which makes it very hard to troubleshoot.
  2. Under the Request section, please add the API URL and example of curl command
  3. Under the Response section, please add example of a fully completed Product_ID so we can see all the stages it goes through.
  4. For ERROR status, what are the corrective actions ?
  5. If a user asks about a Product_ID which was started a week ago, the CloudWatch log groups will have many new logs since then, so what is best way to find the old Product_ID and where it got stuck? Would the Kibana metrics dashboard be another way to troubleshoot?
  6. Also, how do we find the details for a user, and what requests they have submitted? Ok, I see the example in the Getting Started page.

@ymchenjpl

Is there are way to add the user's Product_ID number to each of the different log group events ? When I look at the each Cloudwatch log group, I only see the Request ID numbers, and when I expand the details, I don't see the user's Product_ID which makes it very hard to troubleshoot.

Yes, any action which is taken as a result of a user request contains a product_id prefix on the logs. The only time that field isn't added is during initialization when a user's request hasn't been started yet. Example + explanation added - https://github.com/podaac/swodlr/wiki/User-Request-Error-Tracing-in-Production-Environments#example-log-entry

https://github.com/podaac/swodlr/wiki/User-Request-Error-Tracing-in-Production-Environments#example-log-entry

SWODLR probably isn't the easiest to query via CURL because it's GraphQL driven. Maybe an example using Postman/Insomnia might be better? The API URL is also probably better suited for somewhere else in the docs.

Under the Response section, please add example of a fully completed Product_ID so we can see all the stages it goes through.

The product id provided as an example is a currently completed job in the UAT environment, but it's not guaranteed to be always there as we reset/update that UAT environment. It also won't appear in the OPS/SIT environments.

For ERROR status, what are the corrective actions ?

Depends on the error. In the description, I noted that ERRORs that result from SDS issues aren't correctable from the SWODLR system perspective. The actions from there will need to be to alert the dev team and then the SDS team to look into what occurred.

An error occurred during product generation; the user should be given some basic information about this error.
If the error was caused by the SDS, no further tracing can be performed on the SWODLR end of the system.

If a user asks about a Product_ID which was started a week ago, the CloudWatch log groups will have many new logs since then, so what is best way to find the old Product_ID and where it got stuck? Would the Kibana metrics dashboard be another way to troubleshoot?

The section on the page that goes into timestamping should help sort through where to start tracing those types of cases. The basic idea is to grab SWODLR's status logs and their timestamps, then use the timestamps to figure out what timestamps to filter through in the log groups - https://github.com/podaac/swodlr/wiki/User-Request-Error-Tracing-in-Production-Environments#efficiently-tracing-errors

Also, how do we find the details for a user, and what requests they have submitted? Ok, I see the example in the Getting Started page.

Yeah, that's a future feature we're planning to implement this PI. We need to add in administrator commands to allow that type of search. For now, the user should be able to get that information from the UI.

Hope this answered some of the concerns