eclipse-langium/langium

Any possibility of letting documents building process become synchronous conditionally?

jindong-zhannng opened this issue · 4 comments

The problem

I wanna parse inputs in a class constructor but currently the parserHelper and underlying DocumentBuilder.build are asynchronous (as discussion).

The cause

The reason why DocumentBuilder.build is asynchronous is by these 2 lines:

await this.emitUpdate(documents.map(e => e.uri), []);
await this.buildDocuments(documents, options, cancelToken);

The reason of first line is there could be asynchronous event listeners registered from outside.

And the purpose of second line is for interruption and throttling during executing tasks. It will interrupt current task and give the control back to event loop if takes too long time, so that other pending tasks take priority.

The idea

For line 1. I'm curious does event listeners' result really matter here. Generally speaking, listeners should not affect main workflow IMO.

For line 2. It is pretty smart to use native event loop to implement interruptions. And I can imagine how efficient and easy it is in a highly asynchronous environment, for example, a language server.

But I have to say it's a little tricky and implicit, because the control of tasks scheduling belongs to system instead of us.

And besides, the nature of building documents should be synchronous IMO (I didn't find IO behaviors there, pls correct me if I'm wrong), therefore the appearance of async is confusing.

If I were the designer, I would prefer to use generator functions (aka function*) to implement interruptions. It is very suitable for expressing interruptible tasks naturally:

// define a generator
function* buildDocument() {
  yield step1()
  yield step2()
  return step3()
}

// execute it step by step
const gen = buildDocument()
gen.next()  // finish step 1
gen.next()  // finish step 2
gen.next()  // finish all

It gives the control of scheduling subtasks to outer. So that the caller can decide how to schedule those tasks, either synchronously or asynchronously. It will be very easy to configure behaviors in the outermost layer.

An example async scheduler:

async function runAsync(gen, token) {
  let lastTick= Date.now()
  for (const stepResult of gen) {  // generators are iterable
    if (token === CancellationToken.None) {
       continue;
    }
    const current = Date.now();
    if (current - lastTick >= globalInterruptionPeriod) {
        lastTick = current;
        await delayNextTick();
    }
    if (token.isCancellationRequested) {
        throw OperationCancelled;
    }    
  }
}

Generators can also be easily nested as this example.

I know it's hard or even impossible to refactor the whole project to apply new pattern, and I'm also not sure is it possible to do a partial renovation. I just share my thoughts here for reference, and any comments are welcome.

Hey @jindong-zhannng, thank you for the input.

While I would welcome an optional sync document builder model, I'm not sure it's technically feasible. There are quite a few use cases which actually require real async behavior. I see two main points that we have seen in adopter projects or here in the forums:

  1. We have had questions/requirements like #1308, which needed to load additional documents during the document build lifecycle, abort the current build, and rebuild with the new documents.
  2. We have found that single-threaded parsing is a bottleneck for Langium on large projects (multiple millions LoC). Parsing 1mil+ LoC takes roughly 15-20 seconds for Langium. The performance isn't the main issue, but rather that during this time, the main thread is completely unresponsive. Meaning that we cannot even abort the parsing process. That's why we recently introduced parser workers in #1352. Since this makes the parsing async, the document builder needs to be async as well.

I might be missing something here, but AFAIK a sync generator based approach wouldn't be able to handle those use cases. What do you think about this?

Ok I got it. Maybe I can think of the entire "documents building" process as a pipeline with some tasks.

As the comments:

// 0. Parse content
await this.runCancelable(documents, DocumentState.Parsed, cancelToken, doc =>
this.langiumDocumentFactory.update(doc, cancelToken)
);
// 1. Index content
await this.runCancelable(documents, DocumentState.IndexedContent, cancelToken, doc =>
this.indexManager.updateContent(doc, cancelToken)
);
// 2. Compute scopes
await this.runCancelable(documents, DocumentState.ComputedScopes, cancelToken, async doc => {
const scopeComputation = this.serviceRegistry.getServices(doc.uri).references.ScopeComputation;
doc.precomputedScopes = await scopeComputation.computeLocalScopes(doc, cancelToken);
});
// 3. Linking
await this.runCancelable(documents, DocumentState.Linked, cancelToken, doc => {
const linker = this.serviceRegistry.getServices(doc.uri).references.Linker;
return linker.link(doc, cancelToken);
});
// 4. Index references
await this.runCancelable(documents, DocumentState.IndexedReferences, cancelToken, doc =>
this.indexManager.updateReferences(doc, cancelToken)
);
// 5. Validation
const toBeValidated = documents.filter(doc => this.shouldValidate(doc));
await this.runCancelable(toBeValidated, DocumentState.Validated, cancelToken, doc =>
this.validate(doc, cancelToken)
);

It can be summarized as pseudocode like this:

pipe(
  parse,
  index,
  computeScope,
  link,
  indexReferences,
  validate,
)

AFAIK most of these tasks are in-memory operation and therefore are synchronous by nature.

About 2 cases you mentioned above, both of them are caused by trying to introduce I/O operation in few tasks:

  1. They'd like to import remote resources in their language. It would be better to add a new async step like resolveRemoteResources after parsing step IMO.
  2. Parsing becomes async because of introducing workers for better performance.

Both of these asynchronization can be optional and configurable. If the whole architecture could be more modular, the outer interface can be async only when there are async tasks, and be non-async if there isn't.

Seems generator is not the best option for this scenario lol.

Both of these asynchronization can be optional and configurable. If the whole architecture could be more modular, the outer interface can be async only when there are async tasks, and be non-async if there isn't.

Right, I'm not against refactoring some of the document builder API - but I don't want to make it too complicated either. I'm fairly happy with the current state of most APIs in Langium (except for the completion and formatting APIs, but that's for different reasons).

Note that Langium actually used to have a sync document builder, back when we were still at version 0.1, see #244. While the initial PR only introduced asynchronous document building to interrupt purposes, there are now more use cases for async behavior, as outlined above.

Sure, I understand the difficulty it has and agree with too disruptive changes are not worth it at current stage. Thanks for your inputs.