publication . Preprint . 2017

Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice

Hu, Yuchong; Li, Xiaolu; Zhang, Mi; Lee, Patrick P. C.; Zhang, Xiaoyang; Zhou, Pan; Feng, Dan;
Open Access English
  • Published: 12 Apr 2017
Abstract
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In this paper, we address this issue by proposing a practical repair layering framework called DoubleR. We design two families of practical double regenerating codes (DRC), which not only minimize the cross-rack repair traffic, but also have several practical properties that...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing
Related Organizations
Download from
51 references, page 1 of 4

[1] Facebook's Hadoop 20. https://github.com/facebookarchive/hadoop-20.

[2] HDFS RAID. http://wiki.apache.org/hadoop/HDFS-RAID.

[3] ISA-L. https://github.com/01org/isa-l, 2017.

[4] M. K. Aguilera. Geo-distributed Storage in Data Centers. In Slides presented at OPODIS, 2013.

[5] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters. In Proc. of USENIX ATC, 2014.

[6] T. Benson, A. Akella, and D. A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Proc. of ACM IMC, 2010. [OpenAIRE]

[7] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. Voelker. Total Recall: System Support for Automated Availability Management. In Proc. of USENIX NSDI, 2004.

[8] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143--157. ACM, 2011.

[9] H. C. H. Chen, Y. Hu, P. P. C. Lee, and Y. Tang. NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds. IEEE Trans. on Computers, 63(1):31--44, January 2014.

[10] B. Cho and M. K. Aguilera. Surviving Congestion in Geo-distributed Storage Systems. In Proc. of USENIX ATC, 2012.

[11] M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-intensive Clusters. In Proc. of ACM SIGCOMM, 2013. [OpenAIRE]

[12] A. Cidon, R. Escriva, S. Katti, M. Rosenblum, and E. G. Sirer. Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication. In Proc. of USENIX ATC, pages 31--43, 2015.

[13] Cisco Systems. Scalable Fabric Design - Oversubscription and Density Best Practices. http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/ storage-networking-solution/net_implementation_white_paper0900aecd800f592f. html, Retrieved 2016.

[14] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of USENIX OSDI, Dec 2004.

[15] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network Coding for Distributed Storage Systems. IEEE Trans. on Info. Theory, 56(9):4539--4551, Sep 2010.

51 references, page 1 of 4
Abstract
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In this paper, we address this issue by proposing a practical repair layering framework called DoubleR. We design two families of practical double regenerating codes (DRC), which not only minimize the cross-rack repair traffic, but also have several practical properties that...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing
Related Organizations
Download from
51 references, page 1 of 4

[1] Facebook's Hadoop 20. https://github.com/facebookarchive/hadoop-20.

[2] HDFS RAID. http://wiki.apache.org/hadoop/HDFS-RAID.

[3] ISA-L. https://github.com/01org/isa-l, 2017.

[4] M. K. Aguilera. Geo-distributed Storage in Data Centers. In Slides presented at OPODIS, 2013.

[5] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters. In Proc. of USENIX ATC, 2014.

[6] T. Benson, A. Akella, and D. A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Proc. of ACM IMC, 2010. [OpenAIRE]

[7] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. Voelker. Total Recall: System Support for Automated Availability Management. In Proc. of USENIX NSDI, 2004.

[8] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143--157. ACM, 2011.

[9] H. C. H. Chen, Y. Hu, P. P. C. Lee, and Y. Tang. NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds. IEEE Trans. on Computers, 63(1):31--44, January 2014.

[10] B. Cho and M. K. Aguilera. Surviving Congestion in Geo-distributed Storage Systems. In Proc. of USENIX ATC, 2012.

[11] M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-intensive Clusters. In Proc. of ACM SIGCOMM, 2013. [OpenAIRE]

[12] A. Cidon, R. Escriva, S. Katti, M. Rosenblum, and E. G. Sirer. Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication. In Proc. of USENIX ATC, pages 31--43, 2015.

[13] Cisco Systems. Scalable Fabric Design - Oversubscription and Density Best Practices. http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/ storage-networking-solution/net_implementation_white_paper0900aecd800f592f. html, Retrieved 2016.

[14] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of USENIX OSDI, Dec 2004.

[15] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network Coding for Distributed Storage Systems. IEEE Trans. on Info. Theory, 56(9):4539--4551, Sep 2010.

51 references, page 1 of 4
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue