/dedup_simulation

A deduplication simulator which was used for several research papers.

Primary LanguageGo

dedup_simulation

This deduplication simulator was used for two research papers [1,2]. An earlier, java-based version of this simulator was used in an other paper [5]. It uses deduplication traces as generated by Meister's fs-c tool (https://github.com/dmeister/fs-c) as input.

The implemented deduplication approaches are:

  • container caching The basic deduplication approach introduced by Zhu et. al [3]
  • Sparse Indexing The basic deduplication approach introduced by Lillibridge et. al [4]
  • Block Locality Caching The basic deduplication approach introduced by Meister et. al [5]
  • Sorted Chunk Indexing The basic deduplication approach introduced by Kaiser et. al [2]

As discussed in [1], the first three approaches are possible in a version performing exact deduplication as well as a version performing approximate deduplication. Both versions are included. The approximate versions of container caching and Block Locality Caching are called sampling. There is no extra sampling version of Sparse Indexing as the design for its approximate and exact version is too small to implement them in two files. In fact, only the exact version is simulated and it's possible to extract the results for the approximate version based on the result files. Please see [1] for a discussion about the design.

Implementation

The simulator is written in Go (https://golang.org/). It depends on two third-party libraries:

  • protocol buffers: Meister's fs-c tool encodes the deduplication traces as protobuf messages. Therefore, this simulator must decode this messages to perform its task. It uses the gogoprotobuf library (https://github.com/gogo/protobuf) for this tasks because of performance reasons. The .proto message definitions are in the de_pc2_dedup_fschunk directory.
  • seelog : A simple logging library for cmd-line outputs. (https://github.com/cihub/seelog)

References

[1] Jürgen Kaiser, Dirk Meister, André Brinkmann, and Tim Süß. Deriving and Comparing Deduplication Techniques Using a Model-Based Classification. In Proceedings of the European Conference on Computer Systems (EuroSys). ACM, 2015.

[2] Jürgen Kaiser, Tim Süß, Lars Nagel, and André Brinkmann. Sorted Deduplication: How to Process Thousands of Backup Streams. In Proceedings of the 32th IEEE Conference on Massive Storage Systems and Technologies (MSST). IEEE, May 2016.

[3] Benjamin Zhu, Kai Li, and R. Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST)

[4] Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST)

[5] Dirk Meister, Jürgen Kaiser, André Brinkmann. Block Locality Caching for Data Deduplication. In Proceedings of the 6th International Systems and Storage Conference. ACM, 2013