contiv-experimental/cluster

Proposal: return logs and status of the provision job as REST endpoint

mapuri opened this issue · 7 comments

This is a proposal to address #103 .

I welcome comments on the proposed approach or any alternate approaches that might address this better.

Problem description:

At present the logs of a provisioning job are logged using a standard stdout/stderr logging facility. This makes them accessible through journalctl facility of the systemd.

This has a few issues:

  • journalctl prints all clusterm logs and user needs to manually filter out relevant ansible logs
  • there is no easy way to correlate the logs to a particular provisioning job that user ran. For instance, back to back run of two provisioning jobs may generate two sets of ansible logs that were very close to each other and hence a bit difficult to parse.
  • there is no easy way to just get a job's status (without generally parsing through a lot of logs). User can find that out from host's lifecycle state but that is slightly indirect and may not always a imply provisioning failure like say a failure to update inventory state after the provisioning was done can prevent host's lifecycle state to be updated.

Proposal:

  • logs for a provisioning job shall be accessible through a new clusterm REST endpoint
    • this will not be a streaming interface i.e. subsequent calls to the interface shall return all the logs for a job.

Log Size consideration:

  • A very conservative ansible log size, assuming 10 nodes per job:
    • provision: 1000 lines for 1 node
    • cleanup (triggered in case of failure): 500 lines for 1 node
    • each line : about 250 chars
    • assume each job provisions 10 nodes.
    • total: 1500 * 250 * 10 = 3.75M chars/bytes ~ 5MB

High-level changes:

  • given the log size estimate it might be ok to keep logs for one active and one last job in memory
    • in future if we need to store more historical logs, we can perhaps store the logs in a file (one per job) or just fall back to journalctl
    • one disadvantage of this however is that if clusterm restarts the logs will be lost. Is it critical to restore job logs on restart? May be not as these are more of a debug facility. The logs will still be available in journalctl anyways
  • Extend the manager.Job to store the logs associated with a job in memory
    • store the job structure for active and last job in the manager instance
    • going forward if we see a use for extending this to last few jobs we need to revisit this wrt log size requirements

UX consideration:

REST API:

  • add a new REST endpoint: GET: info/job/active
    • this shall return the current logs and status of an active job
    • Note that it is not a streaming endpoint, so calls to this endpoint shall return all logs available till a certain point in time of job execution
  • add a new REST endpoint: GET: info/job/last
    • this shall return the recorded logs and status of an last run job

CLI:

  • clusterctl job get [active|last]
vvb commented

@mapuri good proposal overall. Couple of things that come to my mind.

  • We also need a global timeline of events at a high-level at one common place. If it is available as a part of clusterm, instead of going through logs, It would be really good. May be another REST endpoint info/jobs/timeline. Which will list sequence of last X events that happened. It is like lastlog in linux.
  • About whether to restore logs on a clusterm restart; I think we can do without it. But may be as a part of restart/stop targets in clusterm.sh we could dump the in-memory logs to a files, that can be examined later manually.

@vvb

We also need a global timeline of events at a high-level at one common place. If it is available as a part of clusterm, instead of going through logs, It would be really good.

yeah, this is a good point. But I think this might be useful per node than per job. right?

It happens automatically in Collins already i.e. when state/status of a node changes in inventory it also get's logged. We lost this with boltdb based inventory though but it shouldn't be that hard to get it done there as well. We could then fetch these event logs along with node info (right now we just print the previous state and status of node). This will be useful info. Let me track it with a issue, if it sounds reasonable.

may be as a part of restart/stop targets in clusterm.sh we could dump the in-memory logs to a files

good idea but note that if clusterm crashes and restarts we loose these again. I think we can start simple and see how these logs get used. My feeling is usually these would only be useful when user wants to see what's going on a with a job or to find the failure in last job in order to rectify and retry, so loosing these on clusterm restart might not be a big deal.

vvb commented

@mapuri just to be clear, what I mean by cluster timeline history, I mean something that can present a sequence of events across nodes. The REST endpoint is more at a cluster level and not at any single node's level.

Time X:  node1 commissioned; result: pass
Time X+1: node2 commissioned; result: fail
Time X+2: node2 decommissioned; result: pass
Time X+3: node2 commissioned; result: pass
Time X+4: node3 commissioned; result: pass

what I mean by cluster timeline history, I mean something that can present a sequence of events across nodes

i see. This info can be derived if we have per node history, right? Also it still might not be suited to track at job level. As we can at most keep info on last N jobs(due to log size limits), but these event histories can potentially be kept for longer.

Node1:
  - Time X:  commissioned; result: pass
Node2:
 - Time X+1: commissioned; result: fail
 - Time X+2: decommissioned; result: pass
 - Time X+3: commissioned; result: pass
Node3:
- Time X+4: commissioned; result: pass
vvb commented

Also it still might not be suited to track at job level.

Agreed, this is not a Job level endpoint. This is something that clusterm can maintain. Agreed that data is available at a per node level, but sequencing them in the right order is the key. and since clusterm is a single point of execution, it can very easily keep track of this. We don't have to save a long history, just last 20-30 events should be good enough to present a picture. This is one of those things we can ask customers to attach, when they file an issue, just to know what all was done recently.

Yes, I am agreeing that presentation of data makes it useful.

However, since clusterm is primarily a RESTful service, we need to structure and organize the info around correct set of constructs/resources/endpoints (for instance job v/s node) inside clusterm then we have all the info at right places.

This will enable a user (or us) to feed it to some of the available (or custom) log and time-series analysis tools to make more sense of the data in different ways.

sequencing them in the right order is the key.

Continuing from above, here we just need to get the information structured correctly. For instance, node level event logs must have time-stamps which will allow them to be sorted.

This is one of those things we can ask customers to attach, when they file an issue, just to know what all was done recently.

Again following from above, we can just ask customer to give clusterctl nodes get and clusterctl job get <> outputs. As long as info is structured (JSON output can be useful here), we will have all the info that we can feed to correct set of presentation/analytics scripts thereafter.

addressed by #117