The Tutorial of Research on the Testing and Analysis of Distributed Software Systems

Update Time: 2021-10-15

Document Maintenance Logs:

Date Author Content
2021/09/25 Yang Feng Initialize the document: add the required courses and reading list;
2021/10/04 Zheyuan Lin Add paper GFS and Raft to reading list.
2021/10/08 Zheyuan Lin Add Failify project and its tests.
2021/10/15 Zheyuan Lin Edit the name of tutorial. Modify project location.

1. Courses

Required Courses

MIT 6.824: Distributed Systems

Homepage Link: https://pdos.csail.mit.edu/6.824/

Bilibili Link: https://www.bilibili.com/video/BV1CU4y1P7PE?share_source=copy_web

UIUC CS 425: Distributed Systems

Homepage Link: https://courses.engr.illinois.edu/cs425/fa2020/index.html

Cornell CS5414 (Fall 2012) Distributed Computing Principles:

Homepage Link: https://www.cs.cornell.edu/courses/cs5414/2012fa/02.outline.html

Brown CSCI-1380, Spring 2021: Distributed Computer Systems

Homepage Link: [http://cs.brown.edu/courses/csci1380/#:~:text=Overview,replication%2C%20security%2C%20etc](http://cs.brown.edu/courses/csci1380/#:~:text=Overview,replication%2C security%2C etc).)

Bilibili Link: https://www.bilibili.com/video/BV1ds411T7pp/

2. Reading List

[Book-A] Coulouris, George, Jean Dollimore, Tim Kindberg, and Gordon Blair. "分布式系统概念与设计." 计算机教育 10 (2013).

[Chang2008] Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26, no. 2 (2008): 1-26.

[Lamport2019] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." In Concurrency: the Works of Leslie Lamport, pp. 179-196. 2019.

[Abadi2012] Abadi, Daniel. "Consistency tradeoffs in modern distributed database system design: CAP is only part of the story." Computer 45, no. 2 (2012): 37-42.

[Aguilera2009] Aguilera, Marcos K., Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamanolis. "Sinfonia: A new paradigm for building scalable distributed systems." ACM Transactions on Computer Systems (TOCS) 27, no. 3 (2009): 1-48.

[Burrows2006] Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." In Proceedings of the 7th symposium on Operating systems design and implementation, pp. 335-350. 2006.

[Corbett2013] Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 1-22.

[Dean2008] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51, no. 1 (2008): 107-113.

[ghemawat2003google] Ghemawat S, Gobioff H, Leung S T. "The Google file system " Proceedings of the nineteenth ACM symposium on Operating systems principles, (2003): 29-43.

[ongaro2014search]Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14), (2014): 107-113.

[DeCandia2007] DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. "Dynamo: Amazon's highly available key-value store." ACM SIGOPS operating systems review 41, no. 6 (2007): 205-220.

[Freiling2011] Freiling, Felix C., Rachid Guerraoui, and Petr Kuznetsov. "The failure detector abstraction." ACM Computing Surveys (CSUR) 43, no. 2 (2011): 1-40.

[Du2017] Du, Min, Feifei Li, Guineng Zheng, and Vivek Srikumar. "Deeplog: Anomaly detection and diagnosis from system logs through deep learning." In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1285-1298. 2017.

[Nagaraj2012] Nagaraj, Karthik, Charles Killian, and Jennifer Neville. "Structured comparative analysis of systems logs to diagnose performance problems." In 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp. 353-366. 2012.

[He2021] He, Shilin, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. "A survey on automated log analysis for reliability engineering." ACM Computing Surveys (CSUR) 54, no. 6 (2021): 1-37.

[Fu2009] Fu, Qiang, Jian-Guang Lou, Yi Wang, and Jiang Li. "Execution anomaly detection in distributed systems through unstructured log analysis." In 2009 ninth IEEE international conference on data mining, pp. 149-158. IEEE, 2009.

[ASPLOS2020] Yuan, Xinhao, and Junfeng Yang. "Effective Concurrency Testing for Distributed Systems." In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1141-1156. 2020.

[Dogga2021] Dogga, Pradeep, Karthik Narasimhan, Anirudh Sivaraman, Shiv Kumar Saini, George Varghese, and Ravi Netravali. "Revelio: ML-Generated Debugging Queries for Distributed Systems." arXiv preprint arXiv:2106.14347 (2021).

3. Projects

Failify: https://github.com/failify/failify

Example-hdfs: https://github.com/MartyLinZY/example-hdfs

4. Datasets

to be added