taskcluster/taskcluster-tools

two 3000+ task taskgroup inspectors hang nightly

Opened this issue · 8 comments

I'm hoping to be able to open 2-3 active nightly graphs without having a hung browser. It's possible this is a nightly issue, and possible it's a tools.tc.net issue, and certainly possible that both could use improvements.

11:27 <aki> two tabs with taskgraphs of 3000+tasks + treeherder seem to hang nightly
11:31 <aki> i wonder if we can lazy load the tasks in a task group, kind of like how gmaps doesn't load the entire world's tiles, just the ones in the area you're looking at
11:32 — aki waves hands vaguely
11:33 <@dustin> sounds like a good PR for you to make :)
11:33 <@dustin> I think it only loads the taskgraph, not fetching each individual task
11:34 <aki> i do have the use case where i'm following a single task, and it's keeping track of the taskgraph

We just unified the two views (task inspector / taskgroup inspector) which is an overall win for UX. Hanging when I'm viewing a single task + a second taskgroup isn't ideal, though; I'm wondering if there's anything we can do on the tools.tc.net side to make this better.

(I do like the current unified UX quite a bit: the task completed, and I was easily able to then easily look at the current running tasks that had been unscheduled because they depended on the first task. It's the hang that I'm trying to address, so if we can address that without affecting the UI, that's the ideal solution for me.)

Here's the code for the loading of tasks, which is a recursive call:

async loadTasks(props, token) {
const { taskGroupId, queue } = props;
if (!taskGroupId) {
return;
}
this.updateLocalHistory(taskGroupId, taskGroupItemKey);
this.createGroupListener(taskGroupId);
try {
const { tasks, continuationToken } = await queue
.listTaskGroup(taskGroupId, token ? { continuationToken: token, limit: 100 } : { limit: 100 });
this.setState({
tasks: this.state.tasks ? this.state.tasks.concat(tasks) : tasks
});
if (continuationToken) {
await this.loadTasks(props, continuationToken);
}
} catch (err) {
this.setState({ error: err });
}
}

Upon looking at this, I do notice a couple problems:

  • The updateLocalHistory and createGroupListener is done every iteration, and should probably only be done if there is no token.
  • We only load 100 tasks at a time, we could probably raise this or make it a little smarter.
  • All this code is being awaited. We could probably put each iteration in a requestIdleCallback and try not to hang as much.

There is probably more we could do, too.

Filed on the Fx side as bug 1399229 .

I've been noticing such an issue for about a month, but I hadn't thought this is because of Task Group inspector, until last Friday. I'm on Linux, I thought this was a system issue, until I realize a text editor didn't show the symptom below:

Symptoms

Type some in text in a text field of whatever page and notice the input stops from time to times. From times to times, the input hangs. Sometimes because of this hang, the key gets repeated more than 10 times (as in "buggggggggggggggggggg").

Data set

I ran a profile, and I notice every spike happen in the same content process. This content process points to tools.taskcluster.net, on task-group-inspector like: https://tools.taskcluster.net/groups/NXi31S58R6OEAtWdahm9LQ (it contains more than 4400 tasks). Please note the whole task group had been loaded several minutes ago.

Interpretation

@julienw from the profiler team helped me to understand the issue. He narrowed down to this profile https://perfht.ml/2uLosxM. We see 3 issues:

  1. rendering the task list takes a lot of time even though nothing changes from a user POV (> 350 ms)
  2. componentWillReceiveProps() takes a significant amount of time (> 200 ms)
  3. zooming on another part of the profile, namely https://perfht.ml/2q2eza9, we can see the reduce call in Inspector is also costly (180ms)

General rendering time

There are several issues here:

  • rendering the whole list is long, maybe using something like https://github.com/bvaughn/react-virtualized could help here.
  • we noticed that in the reduce call in
    reduction[index] = task; // eslint-disable-line no-param-reassign
    we unconditionally store a new object, so the component will always render. We may want to check the status and keep the previous object in the array if the status hasn't changed.
  • The tasks array always changes even if nothing needs to change inside it. A solution could be to keep a boolean indicating that something changed. If false we'd keep the old tasks array instead of using the new one, cutting out the rendering process.

componentWillReceiveProps() too long

Sorting tasks by names takes > 200 ms. We may want to generate the sorted (by name) array as soon as we build in it (in the reduce call, for instance)

The reduce call itself

findIndex() has a complexity of n*m/2, n being > 4400 total tasks and m being 200 fetched tasks. This may be avoided by using a map, indexed by the slugId.

Another optimization (we don't know if that's already in place) could be to strip the task data to what's needed for the group details, i.e.: {slugId: {name, status}}.

Impact

Such a run happens every 2 seconds and causes a hang in the content process of approximately 800ms (each). Therefore, if somebody has like 10 taskgroups open (which may happen in releng), we may see more frequent hangs.

To sum up

@julienw and I investigated these potential explanation. We're not too familiar with this code, though. Feel free to ask us anything that isn't clear :)

What do you think @eliperelman ?

Thanks for the detailed information! You're right that a lot of this comes down to fetching and rendering too much data. While we could spend a lot of time trying to optimize this for displaying all the data we are fetching, we are actually going to tackle this another way: fetching less data.

We are iterating a new version of the web app [1] that speaks with our own gateway to the Taskcluster APIs via GraphQL [2], instead of talking to the APIs directly. Right now the APIs return an abundance of data, much more than the UI needs, and so we will wrap these APIs in a GraphQL layer that will cut this down significantly. My initial back-of-the-napkin estimates shows a data reduction of about 90%.

Second, this new implementation will stop loading the entire task group in a single pass. Instead, we are going to start paginating this data. We are also going to stop loading the task group when only viewing the task.

We currently use react-virtualized for the logviewer [3] implementation, but I don't think we will need to use it for the task group once we start paginating.

Feel free to follow along our progress in the links below, and let us know if you have any further advice or questions!

[1] https://github.com/taskcluster/taskcluster-web
[2] https://github.com/taskcluster/taskcluster-web-server
[3] https://github.com/mozilla-frontend-infra/react-lazylog

Hey @eliperelman !

Clearly rendering less data will make the problem less present.

May I suggest to at least implement the change in

reduction[index] = task; // eslint-disable-line no-param-reassign
?

I think you'd need to extract this part into a PureComponent too:

<tr key={`inspector-task-row-${index}`}>
<td>
<Label bsSize="sm" bsStyle={labels[task.status.state]}>
{task.status.state}
</Label>
</td>
<td>
<Link
to={`/groups/${taskGroupId}/tasks/${
task.status.taskId
}/details`}
replace>
{task.task.metadata.name}&nbsp;&nbsp;&nbsp;<Icon name="long-arrow-right" />
</Link>
</td>
</tr>
))

Then checking that the status actually changed before changing the task would IMO help a lot with the rendering, as no Task would actually be rendered if the "task" object doesn't change and the (new) TaskInspector component is pure.

I certainly think that's reasonable! @helfi92 thoughts?

For

reduction[index] = task; // eslint-disable-line no-param-reassign

I believe it was done purposely so that we can keep the ordering of tasks whenever the listener calls loadTasks. When something changes in the response, we don't want to change the ordering of the tasks that are already there as this leads the user to a lot of confusion.

@helfi92 I'm not opposing doing it when the status actually changes :) The problem is that we always change it, even when not needed.