作者机构:
[Chen, Dan; Wang, Lizhe] China Univ Geosci, Sch Comp, Wuhan 430074, Peoples R China.;[Wang, Lizhe] Chinese Acad Sci, Ctr Earth Observat & Digital Earth, Beijing 100864, Peoples R China.;[Streit, Achim; Tao, Jie; Marten, Holger] Karlsruhe Inst Technol, Steinbuch Ctr Comp, D-76021 Karlsruhe, Germany.;[Ranjan, Rajiv] CSIRO, ICT Ctr, Informat Engn Lab, Canberra, ACT, Australia.;[Chen, Jingying] Cent China Normal Univ, Natl Engn Ctr E Learning, Beijing, Peoples R China.
通讯机构:
[Wang, Lizhe] C;Chinese Acad Sci, Ctr Earth Observat & Digital Earth, Beijing 100864, Peoples R China.
关键词:
Cloud computing;Data-intensive computing;Hadoop;MapReduce;Massive data processing
摘要:
Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data is processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters.