Abstract:
MapReduce is well-applied in high performance computing for large scale data processing. However, as long as the clusters grow, handling with huge amount of intermediate data produced in the shuffle and reduce phases (middle step of Map Reduce) have impacts heavily upon the performance. With local aggregation (either combiners or in-mapper), shuffling large amounts of data can be reduced which alleviates the reduce straggler problem. The proposed modified B+ tree based indexing algorithm is applied to reduce intermediate data amount for output retrieval fast as well as scalable data storage.