Topn implemented by Go.
Topn-go
is used to find the top n
most frequently occurring urls from a large file. It can process huge files with
limited memory efficiently.
To handle a large file, we can firstly split it into small files, then deal with them one by one easily. So the pipeline is like this:
- Partition a large file into small files that can be placed in memory. (Map phase)
- Merge and sort urls in small files, and find the top n candidates from each file. Then sort all candidates and write to file. (Reduce phase)
The Reduce
phase is a classic parallel problem. Since the sort operation on small file are independent of each other,
we run them in parallel naturally. I leverage min-heap
to store the top n candidates from each file, then merge
them in another min-heap and write it to file.
So the biggest problem is how to partition file efficiently in Map
phase. In the preliminary implementation, i think
of the relation between file reader and writers like producer and consumers. The producer reads file line by line, then
send the url to corresponding chan
in Go-lang. The consumers use select
to listen its chan, receive url, and append
it to file. It looks a reasonable pattern in this scenario. However, it is inefficient due to the feature of Hard Disk
Drive(HDD).
For each read and write to the disk, the latency highly depends on the track location. In the Producer-Consumer
pattern,
consumers are writing url simultaneously. Every time the data is received, a write operation is triggered. Lots of time
is wasted in seeking tracks. So the efficiency is very low.
Inspired by Combiner
in map-reduce model, we can apply Combiner function that does partial merging of url before it
write to the disk. I maintain a buffer map to to keep url pairs, and combine the url pair if they have the same address.
When map size reached a certain value, pairs in map will be flush to corresponding file.
Experimental results show that partial combining significantly reduces redundant disk IO and speeds up the MapReduce
operation.
How to run topn-go:
make test_topn
How to clean up all test data:
make clean
How to generate test data:
make gendata
NOTE: go 1.12 is required
MIT License