/indexer

Primary LanguageKotlinMIT LicenseMIT

What

A basic (text) file indexer library in Kotlin. Given a character sequence, an indexer finds all the occurrences of this sequence as a substring in a given set of files. Primary use case is indexing a moderately-sized codebase for subsequent searches.

How to build

gradle(w) clean shadowJar kotlinSourcesJar Place indexer-$version-all.jar somewhere on your classpath. Point your IDE towards indexer-$version-sources.jar if needed.

How to use

Point an indexer towards the directory you need indexed

val index = IndexBuilderCoroutines()
    .with(dirName)
    .buildAsync().await()

shoot your queries at it after it's done

// this gets you just a set of filenames that contain this query string
// on the plus side, it doesn't have to read the files for that
val entry = index.query("foobar")

to get more details

// this gets you lines and line positions for each file
// on the flip side, index has to re-read the files for that
val richEntry = index.queryAndScan("lorem ipsum")

to update the index

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)

var index = indexBuilder.buildAsync.await()

// something something something

runBlocking {
    //update is a suspend fun. Control flow is completely up to you.
    index = indexBuilder.update()
}

Advanced usage

Want more filesystem roots? Sure. As many as you would reasonably want.

val indexBuilder = IndexBuilderCoroutines()
    .with(dirNameA)
    .with(dirNameB)
    .with(listOf(dirNameC, dirNameD, dirNameE))

Want only specific files? Apply file filter. A default behaviour is to accept every file. Directories are always accepted.

val filter = object:java.io.FileFilter { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .filter(filter)

Want only large files? Or only small files? Or want to run a complex heuristic on file contents? There's an extension point for that. See javadoc for the details. A default behaviour is to accept everything; there's a sample whitelist-based implementation that discards files as soon as it encounters too many non-whitelisted characters.

val inspector = object: org.maurezen.indexer.ContentInspector { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .inspectedBy(inspector)

Want to deal with non-standard file formats or encodings? Implement your own reader. See javadoc for the details. A default behaviour is to assume files are UTF-8 encoded.

val reader = object: org.maurezen.indexer.FileReader { /* whatever */ }
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
    .readBy(reader)

Want to share index between threads? Share a builder instance and request an index.

//thread A
val indexBuilder = IndexBuilderCoroutines()
    .with(dirName)
//thread B
val index = indexBuilder.get()

Have a more prolonged lifecycle? Want an update? Keep a builder instance to yourself and trigger a build again when needed. indexBuilder.get() will be returning the previous index version until the new computation completes.

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName) 

var index = indexBuilder.buildAsync().await()

//things happen here
//...
//and now it's time for a refresh

index = indexBuilder.buildAsync().await()

Changed your mind and don't want that refresh anymore?

val indexBuilder = IndexBuilderCoroutines()
    .with(dirName) 

var indexDeferred = indexBuilder.buildAsync()

indexDeferred.cancel()

Performance

While a robust performance setup doesn't exist as of now, here is the anecdotal data for indexing all the files of an intellij-community-master snapshot dated late 2020 on mostly-available (sub-10% idle usage) 5950x:

Size: 563 MB (590,787,068 bytes)
Contains: 120,686 Files, 26,343 Folders
Created: Saturday, December 12, 2020
IndexBuilderCoroutines()
    .with(INTELLIJ_COMMUNITY_MASTER)
    .inspectedBy(WhitelistCharacterInspector(5))
    .filter(ACCEPTS_EVERYTHING)
    .buildAsync().await()
-Xmx Time
12g 26.7s
2g 41.3s

While, again, a robust memory footprint measurement doesn't exist, a ready-to-query index of intellij-community-master has a memory footprint of ~230Mb. The indexing process itself, though, requires anywhere from 12Gb to as little as 1Gb, depending on indexing pipeline settings and trading throughput for memory footprint.