kotlinx/ast

Parsing large files is too slow

drieks opened this issue · 7 comments

These files are currently not included in SelfTest.kt because the processing does not finish within a reasonable time:

  • KotlinLexer.kt
  • KotlinParser.kt
  • UnicodeClasses.kt

Hi @martinflorek,

please try version fd6123da02. Can you tell me the required parsing time of the old and the new version? Thank you very much!

I am not able to properly measure the parsing time only, because I process several source code repositories at once and I am looking for specific files only before parsing them.

But the new version runs a bit faster. All my processing went from ~33 seconds to ~32 seconds. Version with Kastree runs in 1.7 seconds.

I refactored kotlinx.ast so that it is now possible to use both antlr-kotlin and antlr-java to parse kotlin sources.
Example: https://github.com/kotlinx/ast/blob/master/grammar-kotlin-parser-antlr-java/src/test/kotlin/kotlinx/ast/example/ExampleMain.kt

But sadly, it seems that antlr-kotlin is not much slower than antlr-java. I will try to figure out how to speed up parsing.

@ShikaSD pointed me to antlr-optimized, so I implemented support for this antlr fork in kotlinx.ast. But sadly, it is not as fast as hoped.
I will try to implement a lexer and parser using antlr4 grammar files, only supporting the features that are required to parse kotlin files.
I already added support to parse antlr4 grammar files for this use case in kotlinx.ast:grammar-antlr4-parser-antlr-java.

The time for ./gradlew clean check was reduced from 3min 30s in commit c7dd6bb to 2min 30s in commit f088b3c.

because of this, all kotlin files will now be scanned in the self test.

it is still required to speed this up, I think we need some patch to the kotlin parser/lexer for this.

build time for commit 95db180 is 44s, so we can assume that testing the previusly excluded files takes around 1 minute 45s.

  • KotlinLexer.kt
  • KotlinParser.kt
  • UnicodeClasses.kt

Can you have a look at my comment in #50 ? Why is a large garbage-string faster than a large string containing json?