Proposal: Dragonfly supports P2P based on Streaming to reduce disk IO
lowzj opened this issue · 6 comments
Backgrounds
Now the client of Dragonfly will random read and write disk multiple times during the downloading process.
For directly using dfget
to download a file:
dfget
random writes a piece into disk after downloading itdfget server
random reads the piece from disk to share itdfget
sequential reads the file from disk after downloading to do checksum
And for using dfdaemon
to pull images, there're extra disk IO by Dragonfly:
dfdaemon
sequential reads the file from disk to send it todockerd
It's not a problem when the host has a local disk. But it will be a potential
bottleneck when the Dragonfly client runs on a virtual machine with a cloud disk, all the disk IO will become network IO which has a bad performance when read/write at the same time.
So a solution is needed to reduce the IO times generated by Dragonfly.
Idea
P2P Streaming is a P2P based on Streaming, which sends the data downloaded by using p2p pattern to the user directly, in order to achieve the purpose of reading and writing to disk as few as possible.
P2P Streaming Data Flow
This diagram describes the p2p streaming data flow.
Piece Data Cache
stores the pieces' data in memory that can be shared to the other peers. A piece's data should be putted into this cache after downloading, and be evicted according to theLRU
strategy when the cache is full.StreamIO
sends pieces' data to callers in ascending order ofpiece's number
.- In the scenario of using
dfdaemon
to pulling images and others , thedfdaemon
anddfget
should be merged into one process that can reduce the time of startingdfget
process. - Also
dfget
can be as an individual process to download files directly.
P2P Streaming Sliding Window
The P2P Streaming Sliding Window
is designed to control the number of pieces of a file that can be scheduled and downloaded to avoid unlimited memory usage. This idea comes from tcp sliding window, but its minimal transmission unit is a piece
not a byte
.
Memory Cache
is thePiece Data Cache
to share pieces in the p2p network. The larger the cache, the higher the p2p transmission efficiency.
Great job! I am willing to help.
- start long live peerserver in dfdaemon
- move download logic in DownloadContext, change result type to http.Resonse
- use library to download piece instead of dfget binary
- supernode support generate taskID with http header range
- extract new Writer interface for ClientWriter
- new ClientStreamWriter that implement Writer to write download piece to stream instead of file
- ClientStreamWriter implement both io.ReaderCloser and io.Writer
- when register task to supernode,return file metadata, eg: file length, dfdaemon should request source directly
Hello @lowzj! I'd like to pick this up as part of GSoC.
Hi, @lowzj I would like to work on this as a part of ASoC2020. Could you guide with what is the current state of development since it appears it has been worked on by others as well?
Hi, @lowzj I would like to work on this as a part of ASoC2020. Could you guide with what is the current state of development since it appears it has been worked on by others as well?
It has multiple tasks. We provide one task of it for ASoC2020: optimizing the scheduling algorithm of supernode for p2p-streaming. You can work on it.