Find X-largest values in a large file
Get the X-largest values in a large file. Each line in the file contains and .
Example:
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
To find top-X values, we can use a heap.
First, we will maintain a min-heap of size X.
Then, we can put elements into the heap.
If the size of the heap is bigger than X, we can remove the top element of the heap.
So the time complexity and space complexity depends on the size of the heap.
Complexity | Value |
---|---|
Time | O(NlogX) |
Space | O(X) |
If we have a huge file, we can not read all the lines into memory.
We can split the huge file into many smaller files by hash.
Then we can read each file and put the elements into the heap.
Now we have the top x elements of each small file.
Finally, we can merge the top x elements and get the top x elements of the huge file.
By the way, I prepared a tool to generate a huge file.
It will generate a unique identifier by snowflake and a random number into the huge file.
The tool will pick the line whose number >= 99900 and write it into a new file.
This file named gen_top_x.txt
can help us to test more conveniently.
Goroutine can use the cores of the machine.
So we can calculate the number of files by the number of cores.
Assume a core can take 16
times the memory of the original dataset.
And we can use 1GB of memory.
So our formula is:
tmpFileSize := MaxMemorySize / (coreCount * 16)
processNum := fileSize / tmpFileSize
e.g. we have 1 GB of the huge file, and we have 4 cores.
temp file size = 1 GB / (16 * 4) = 0.015625 GB = 16 MB
process num = 1 GB / 16 MB = 64
First, make sure you installed the golang: golang install
Second, you can run the following command to run:
make build
You will get two bin files in the dist
folder.
- generate.bin can generate a huge file to help you test.
- main.bin is a cli tool to get the top x elements.
- run
./dist/main.bin -h
to get help for this command. - run
./dist/main.bin -file=<file path> -x=<top xx>
to get the top x elements of the file. - run
./dist/main.bin -x=<top x>
to get the top x elements of the stdin.
- main.bin will print the top x elements to stdout.
To test the program, you can run the following command:
make test
The test will generate a huge file and then get the top x elements of the huge file.
It will take a long time to generate a huge file and split it, just relax ☕️.
When everything is done, you can get the coverage of the program.
The test report is:
PASS
coverage: 66.8% of statements
total (statements) 87.8%
The test program can cover the main process of the program.
Integration test is main_test.go
.
Unit tests are topx/*_test.go
.
GitHub action is a great tool to test the program.
On my personal computer, I got the following performance using 1GB of huge file:
./dist/main.bin -file=tmp/gen_records.txt -x=394
Step | Time |
---|---|
Split 165 Files | 4m1.5s |
Top X | 21.75s |
My computer info:
- OS: macOS 12.2.1 21D62 arm64
- CPU: Apple M1 Pro
- Memory: 32768MiB
Due to limited time, I didn't do much optimization on splitting and test coverage.