ostafen/clover

Most efficient solution for paging query results with a big collection

axllent opened this issue · 16 comments

Hi there. Firstly thank you for this cool project - it seems quite well-suited for what I am building! I am hitting some performance issues though, so thought I'd explain what I'm doing (to give some context) and where the two bottlenecks are, and hopefully there is a different approach which I have simply overlooked.

So first of all, I'm using CloverDB to store parsed emails. Each document contains some basic info about the email, including from, to, subject, timerstamp it was received (time.Time), as well as the entire raw email. Then I have a frontend which displays a paginated overview of the message "summaries" (ie: the basic data excluding the raw email itself), from newest to old (25 per page). The bottlenecks are caused by two things here, namely:

  1. Sorting by received timestamp (.Sort(clover.SortOption{Field: "Created", Direction: -1}))
  2. A count of all records (db.Count(mailbox))

With about 20k emails in a collection a typical request for 25 records (let's say the latest 25) takes about 9 seconds, which includes a Count() of all documents. Removing the Count() of the all documents in the collection reduces the request to about 5 seconds, and then when also removing the sort I get the results in about 0.7 seconds.

What is the most efficient manner to reverse sort the order that documents are stored in the collection?
What is the most efficient way to count the total number of documents in a collection?

In relation to the the above, I am considering spitting the email "summary fields" (to, from, subject etc) from the actual raw email content (which can also contains attachments), and storing them separately in two separate collections. Before I refactor all my code, do you think this approach is better? Preliminary tests (which simply exclude storing the raw email data in my collection) appear to more than halve the execution time above, although I don't know whether by storing the raw data in a separate collection it would slow things down again (I don't know quite how badger handles collections).

Thank you for your time.

Hi, @axllent. Initially, clover has been designed for very simple workloads, with a few thousand of documents.
This is why, it currently lacks some optimization which could be performed in specific situations.

That's also why I'm working on a v2 version of the library which add several important additions (index support, for example).

Counting all records in a collection is a query which could be performed very efficiently, simply by storing internally a count of all collection documents (actually, time is wasted because documents are always iterated).

Regarding sorting, If I understand correctly, in your specific use case, you need to get your documents in reverse insertion order. Actually, when iterating on a collection, documents are returned in sort order with respect to the "_id" special field, which is a v4 uuid. This means that insertion order is not preserved.
Since this can be desirable in some situations, like your one, I'll replace uuids with ulids, which preserve insertion order.

I'll work on both this aspects during this week. Your perfomance issues should totally disappear after this.

@ostafen you understood perfectly what I am trying to do, and the ordering of documents in the reverse order they were stored. That's amazing, I'll definitely wait for the v2 release as it sounds like it will solve the performance issues. Thanks!

Just to have all the information, do you run the query each time the user switches to a new page, using skip and limit?

P.S. If you try the v2 branch, you should already see an improvement, since I implemented the Count() optimization

Yes @ostafen, I do run the query each time using skip and limit (and the database is left open while the daemon is running). I have also just tried the v2 branch and can confirm that the Count() issue is completely fixed ;-)

After my initial bug report yesterday I did end up separating the raw email data into a separate collection from the "summary collection" which made a significant speedup: on that laptop the requests dropped from 9s down to about 6s per request. Clearly the amount of data in each document does reflect in the sorting processing time (keeping in mind that some documents have anywhere between 2-9MB's worth of attachments encoded in in the raw email data). That raw data is not required when listing the emails (to, from, subject etc) so it made sense to split this out.

I'm currently at work on a different machine, and I have a slightly different email dataset, but the general conclusions are the same: (With the separate summary/raw-data collections) - the v1 of the branch (including the total count) takes around 3.2s per query. With the v2 branch it takes 1.6s (including the total count)! That's great progress.

Another point noting is that in one your earlier (deleted) comments, your example converted the time.Time to an int64 - whereas I am storing it as a time.Time. I though this may make a speed difference if I stored it as as an int64 instead, however that makes little difference to the current sort speed (so I've gone back to storing it as a time.Time).

@axllent, after thinking a bit about it, I came out with a solution for your problem, which I think is the best possible, since it allows you to perform exactly n disks accesses, where n is the size of your page.

Basically, we create an index on the ""Created" field with the following statement:

db.CreateIndex("mailbox", "Created")

Supposing that we retrieve page sequentially, one by one, we can use the following query to retrieve the i-th page:

db.Query("mailbox").Where(clover.Field("Created").Lt(instant)).Sort(clover.SortOption{Field: "Created", Direction: -1}).Limit(25)

where instant is set to time.Now() if for i = 0, and to the timestamp of the last document of page i-1 if i > 0.
Fetching a page takes less than a millisecond on my PC with 100k documents. Using indexing, the condition inside the Where() can be exploited to seek directly at the beginning of the page we want to fetch, and, since the index stores values in sorted order, sorting is implicit, so the query is satisfied in just 25 steps.

Using skip() without a Where() clause is not equally efficient, because skip() simply discard the first n documents, but is terrible in terms of disk accesses as soon as you get closer and closer to the last page (when fetching the last page, you skip almost the entire collection size, because you skip, in your case, 20k-25 documents).

This script simulates fetching pages sequentially. Note that it is important for this to work to use timestamps at nano precision (it should also work with time.Time), so that each document gets an unique timestamp.

Remember to fetch the last commit of the v2 branch, before running the script, because I implemented sorting with indexes yersterday.

Let me know how this works on your side :=)

package main

import (
	"log"
	"time"

	"github.com/ostafen/clover"
)

func checkErr(err error) {
	if err != nil {
		log.Fatal(err)
	}
}

func main() {
	db, err := clover.Open("./db")
	defer db.Close()
	checkErr(err)

	has, err := db.HasCollection("test")
	checkErr(err)

	if !has {
		err = db.CreateCollection("test")
		checkErr(err)
		err = db.CreateIndex("test", "timestamp")

		n := 100000
		for i := 0; i < n; i++ {
			doc := clover.NewDocument()
			doc.Set("timestamp", time.Now().UnixNano())
			_, err := db.InsertOne("test", doc)
			checkErr(err)
		}
	}

	avgQueryTime := 0.0
	count := 0
	var lastDoc *clover.Document = nil
	for {
		count++

		var instant int64
		if lastDoc == nil {
			instant = time.Now().UnixNano()
		} else {
			instant = lastDoc.Get("timestamp").(int64)
		}

		sortOpt := clover.SortOption{Field: "timestamp", Direction: -1}
		now := time.Now()
		all, err := db.Query("test").Where(clover.Field("timestamp").Lt(instant)).Sort(sortOpt).Limit(25).FindAll()
		checkErr(err)

		avgQueryTime += time.Since(now).Seconds() * 1000

		if len(all) > 0 {
			lastDoc = all[len(all)-1]
		} else {
			break
		}
	}

	log.Println("query time: ", avgQueryTime/float64(count), "milliseconds")
}

P.S. I deleted the previous comment, because I was about to post this solution
P.P.S I changed the issue title to make it more easier to find for users which jump into the same isssue

@ostafen This is very interesting - I would not have thought that solution would be more efficient, but (using variations of your example) that is exactly what it is - MUCH faster, specifically at the low "page numbers"!

It does get slower the higher the page number you want (iterations of your solution, which is by the sounds of it just like how Skip() works internally anyway), for instance is you are looking for "page 1" the query using this method is vs 2ms vs 4s (using the standard Skip()), but if you're looking for page 1500 is takes 1.9s vs 4s. So with 100k documents / 25 per page, there is a total of 4,000 pages - and when looking for page 4,000 (ie: the last results) there is basically no time difference between the two methods.

Personally this is not an issue at all for me, as the low page numbers (eg: "page 1" - or Skip(0)) is by far the most common. Tomorrow afternoon I will refactor my code to use this method and report back. Thanks again for your suggestions (and software)!

@axllent, yes, it is true that if you jump to a page which is not consecutive with respect to the current one, you have to iterate the query intermediate pages. This is unavoidable.

However, as soon as you move to the next page (or to the previous, if you run the inverse query) query time remains constant, which is not true with the Skip() based method (if you want page n, you have to always skip all documents of the previous n-1 pages, regardless of your current position).
Moreover, if you cache the values of lastDocument.timestamp, for the pages you already visited, you can avoid iterating the process for already visited pages.

Going further, you could also store the timestamps marking the beginning of a each page to a separate collection (this can be done at insertion time, every 25 documents), and use this intermediate collection to know exactly the seek point of each page.

Anyway, the most important part here is the use of the index, which allows us to save sorting time.

Naturally, feel free to adopt the solution you feel more confident with and which better suits your needs :=)
For further feedback, just let me know

@axllent, just so you know, I found another bottleneck, which is the gob encoding, which used internally to store documents.
I tried to switch to msgpack encoding, and tested the following query:

_, err = db.FindAll(clover.NewQuery("test").Sort(clover.SortOption{Field: "timestamp", Direction: -1}).Skip(20000).Limit(25))

On 100k documents, query time decreased from 2.5 secs to 1.2 secs

There's some fairly large functional changes in v2, but once I made the necessary updates I can (very) happily report that I'm now getting page loads in < 10ms (including the HTTP request / network latency)! Obviously this increases slowly as explained before when dealing with much larger offsets (which aren't an issue in my case, and even at their worst they are already significantly faster than what I had before), but this definitely solves my issues. Basically I'm down from 9 seconds to < 10ms! That is absolutely brilliant, thank you!

Obviously I have more refactoring to do, and I assume you'll be making more major changes in v2 before the release, so I'm not sure whether to leave this issue open for now until v2 is released as there may be other areas of database slowness I'll come across as I put together the various parts of the app.

Thanks again @ostafen for an awesome project!

@axllent, very happy to know that you solved your issue!

Obviously I have more refactoring to do, and I assume you'll be making more major changes in v2 before the release, so I'm not sure whether to leave this issue open for now until v2 is released as there may be other areas of database slowness I'll come across as I put together the various parts of the app.

Yes, let's leave this open until the v2 version will be released. In every case, if you experiment other problems, which are not directly related to this one, I suggest you to open new issues

@axllent, just to let you know, performance should have increased significantly with the latest commit :=)

@ostafen That's awesome! I can definitely confirm a speedup (though I am not sure my case/tests are necessarily best-suited because they also involve email parsing and saving to two collections for each "save" and "delete").

I have two different benchmarks: insert plain text emails, and insert MIME emails containing an inline image and an attachment (looping over the default Go test for i := 0; i < b.N; i++ {....). For each "save", the email is parsed (to, from, subject, etc) and the summary is stored in one collection, and then the raw email "as-is" is stored in another collection (same database). For each delete, all messages from both collections are deleted in a loop (db.Delete(clover.NewQuery(mailbox).Limit(2500)) & db.Delete(clover.NewQuery(mailbox+"_data").Limit(2500))).

Then I have two separate tests - to time how long it takes to insert and then delete 1000 emails for both the plain text email as well as the one with attachments.

This will however give you an indication, and as you'll note, definitely a speed-up:

Before the upgrade:

go test ./storage -bench=.
...
cpu: AMD FX(tm)-8350 Eight-Core Processor           
BenchmarkImportText-8             592           2033505 ns/op
BenchmarkImportMime-8             183           7183201 ns/op
PASS
go test ./storage -v
=== RUN   TestTextEmailInserts
    database_test.go:36: inserted 1,000 text emails in 2.032587771s
    database_test.go:52: deleted 1,000 text emails in 169.32953ms
--- PASS: TestTextEmailInserts (2.26s)
=== RUN   TestMimeEmailInserts
    database_test.go:76: inserted 1,000 emails with mime attachments in 6.387462882s
    database_test.go:92: deleted 1,000 emails with mime attachments in 214.877622ms
--- PASS: TestMimeEmailInserts (6.95s)
=== RUN   TestRetrieveMimeEmail
--- PASS: TestRetrieveMimeEmail (0.03s)
PASS

After the upgrade:

go test ./storage -bench=.
...
cpu: AMD FX(tm)-8350 Eight-Core Processor           
BenchmarkImportText-8             649           1785256 ns/op
BenchmarkImportMime-8             207           6637866 ns/op
PASS
ok      github.com/axllent/mailpit/storage      12.160s
go test ./storage -v
=== RUN   TestTextEmailInserts
    database_test.go:36: inserted 1,000 text emails in 1.739683013s
    database_test.go:52: deleted 1,000 text emails in 55.179396ms
--- PASS: TestTextEmailInserts (1.86s)
=== RUN   TestMimeEmailInserts
    database_test.go:76: inserted 1,000 emails with mime attachments in 6.089239978s
    database_test.go:92: deleted 1,000 mime emails in 81.383402ms
--- PASS: TestMimeEmailInserts (6.38s)
=== RUN   TestRetrieveMimeEmail
--- PASS: TestRetrieveMimeEmail (0.02s)
PASS

Great work! I'm slowly finishing off my project which I hope to release (open source of course) in the next week or so (there are still a few things I need to finish before it is "ready enough"). I don't mean to rush you though, but do you have any plans for the v2 release? If not then I can continue to link to the latest commit of the v2 branch, that's not a problem - I was more curious than anything ;-)

Hi, @axllent. I think I'll release the v2 by the end of next week. I'll release the v2 in a beta version, in order to allow the community to try it and discover potential bugs, before releasing a full stable version.

EDIT: I just released the v2 under the tag v2.0.0-alpha.1

@ostafen So did I (beta) in case you're interested what I'm using this for https://github.com/axllent/mailpit :)
I could not work out how to link the module to your alpha tag so I'm using the git commit hash instead.

Very nice project!
By the way, you should be able to link the library with the command:

go get github.com/ostafen/clover/v2@v2.0.0-alpha.1

@axllent, I'm closing this, since the issue is pretty much solved