quantcast/qfs

Where is the function 'Evicting Chunks' implemented in code?

Closed this issue · 7 comments

Hi ,I have a question.

Where is the function 'Evicting Chunks' implemented in code?I did not find it.
The function 'Evicting Chunks' is in the chapter ’2.3.4 Evicting Chunks‘ of paper "The Quantcast File System"

Thanks.

Hi,

You can find the related functions in src/cc/chunk/ChunkManager.cc. For example, the following function schedules an evacuation operation on a chunkdir: https://github.com/quantcast/qfs/blob/master/src/cc/chunk/ChunkManager.cc#L6739

Evacuation of a chunkdir is triggered by creating a "evacuate" file under that chunkdir path. A chunkserver periodically checks the existence of the "evacuate" file, and if there is one, it triggers the evacuation action. Check out https://github.com/quantcast/qfs/wiki/Administrator's-Guide#evacuation to see how evacuation on a chunk directory is triggered.

Best,

Mehmet

Hi Mehmet,

Thanks for your reply! I have another related question.

How does QFS handle a chunk server failure? I mean, is there a specific repair protocol for a chunk server?
In my opinion, I guess QFS can trigger a chunk-server-failure repair using the evacuation operation. Here, the "chunk-server-failure repair" means that all the chunks of the failed chunk server are recovered in a new chunk server with replicates or RS recovery. Is that right?

Thank you!

BTW, our goal is to evaluate the repair time for a chunk server failure in QFS.

Yours sincerely,
Xiaoyang

Hello again,

Actually, evacuation is used for intended downtime such as repairs or upgrades as indicated in the paper. For chunkserver recovery, you're right. Once the metaserver discovers that a chunkserver has gone and probably will not come back, it recovers the chunks in the failed chunkserver either with RS recovery or replication in another chunkserver.

That's interesting. Are you guys planning to make a comparison with other filesystems? In any case, we'd be very interested in hearing back on your progress & results.

I also encourage you to sign up to QFS JIRA: https://quantcast.atlassian.net
You can post any questions/feature requests/bugs there as well.

Let me know if you have any further questions.

Best,

Mehmet

Dear Mehmet,

Nice to meet you! I am with the same team of the xiaoyangzhang1993.

Thank you for your reply. It really helps! We will sign up to QFS JIRA soon.

Yes, we are trying to compare QFS with HDFS-RAID (hadoop 20) and implement more coding schemes (like LRC, RC etc.) apart from erasure coding. We think the implementation of QFS is pretty neat.

And we have a further question: could you tell us who (metaserver or the new chunk server) does the RS recovery when recovering the lost chunks?

Thanks.

Yours sincerely,
Yuchong

Hi @YuchongHu,

updated:

Metaserver initiates the recovery for the disconnected/failed chunkserver after a timeout period. Timeout period makes sure that chunkserver is not coming back. For each chunk of the failed chunkserver, metaserver picks a chunkserver among remaining ones and the chosen chunkserver performs the RS recovery for that chunk. In other words, metaserver orchestrates the recovery, but it's a collaborative effort by the remaining chunkservers.

For a more detailed discussion, you can check the following group post: https://groups.google.com/forum/#!searchin/qfs-devel/recovery/qfs-devel/AXbgqXkMY7s/ZwLf7J2SeRoJ

@xiaoyangzhang1993 after reading the posts again, yes, I believe you can trigger the repair behavior by using evacuation since evacuation uses recovery, but disconnecting chunkserver from metaserver should also work fine.

Hope this helps,

Mehmet

Dear Mehmet,

We have got a lot of useful information from you :)
Really appreciate your help!

Yours sincerely,
Yuchong

Sure. Looking forward to seeing your results.
Mehmet