Seminar — Parallel and Distributed Architectures

Martin Land
Department of Computer Science
Hadassah Academic College
  h0000000.gif  
 
Contents
 
Seminar Home
Lecture Schedule
Bulletin Board
Course Syllabus
Office Hours
 
Suggested Research Papers
A Novel Parallel Deadlock Detection Algorithm and Hardware for Multiprocessor System-on-a-Chip
Xiang Xiao and Jaehwan John Lee
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
Abstract — Given the projected dramatic increase in the number of processors and resources in a system-on-a-chip, a quadratic increase in the likelihood of deadlock is predicted due to complex system behavior. To deal with this issue, we here present a novel parallel hardware-oriented deadlock detection algorithm with O(1) deadlock detection and O(min(m,n)) preparation, where m and n are the numbers of processes and resources, respectively. Our contributions are (i) the first O(1) deadlock detection hardware implementation and (ii) a new algorithmic method of achieving O(min(m,n)) overall run-time complexity. We implement our algorithm in Verilog HDL and demonstrate that deadlock detection always takes only two clock cycles regardless of the size of a system (i.e., m and n).
A Parallel Computational Model for Heterogeneous Clusters
Jose Luis Bosque and Luis Pastor
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 12, DECEMBER 2006
Abstract — Heterogeneous clusters claim for new models and algorithms. In this paper, a new parallel computational model is presented. The model, based on the LogGP model, has been extended to be able to deal with heterogeneous parallel systems. For that purpose, the LogGP’s scalar parameters have been replaced by vector and matrix parameters to take into account the different nodes’ features. The work presented here includes the parametrization of a real cluster, which illustrates the impact of node heterogeneity over the model’s parameters. Finally, the paper presents some experiments that can be used for assessing the method’s validity, together with the main conclusions and future work.
An Analysis of EDF Schedulability on a Multiprocessor
Theodore P. Baker, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 8, AUGUST 2005
Abstract — A new schedulability test is derived for preemptive deadline scheduling of periodic or sporadic real-time tasks on a singlequeue m-server system. The new test allows the task deadline to be more or less than the task period, and is based on a new analysis concept, called a μ-busy interval. This generalizes a result of Goossens et al. [11] that a system of periodic tasks with maximum individual task utilization umax is EDF-schedulable on m processors if the total utilization does not exceed m(1 - umax) + max. The new test allows the analysis of hybrid EDF-US [x] scheduling, and the conclusion that EDF-US[1/2] is optimal, with a guaranteed worstcase schedulable utilization of (m + 1)/2.
An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence Search on a Massively Parallel System
Karl Jiang, Oystein Thorsen, Amanda Peters, Brian Smith, Member, IEEE, and Carlos P. Sosa, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
Abstract — Bioinformatics databases used for sequence comparison and sequence alignment are growing exponentially. This has popularized programs that carry out database searches. Current implementations of sequence alignment methods based on hidden Markov models (HMM) have proven to be computationally intensive and, hence, amenable to architectures with multiple processors. In this paper, we describe a modified version of the original parallel implementation of HMMs on a massively parallel system. This is part of the HMMER bioinformatics code. HMMER 2.3.2 uses profile HMMs for sensitive database searching based on statistical descriptions of a sequence family’s consensus [1]. Two of the nine programs were further parallelized to take advantage of the large number of processors, namely, hmmsearch and hmmpfam. For our study, we start by porting the parallel virtual machine (PVM) versions of these two programs currently available as part of the HMMER suite of programs. We report the performance of these nonoptimized versions as baselines. Our work also includes the introduction of an alternate sequence file indexing, multiple-master configuration, dynamic data collection and, finally, load balancing via the indexed sequence files. This set of optimizations constitutes our modified version for massively parallel systems. Our results show parallel performance improvements of more than one order of magnitude (16 times) for hmmsearch and hmmpfam.
An Efficient, Practical Parallelization Methodology for Multicore Architecture Simulation
James Donald and Margaret Martonosi
IEEE Computer Architecture Letters Vol. 5, 2006
Abstract — Multiple core designs have become commonplace in the processor market, and are hence a major focus in modern computer architecture research. Thus, for both product development and research, multiple core processor simulation environments are necessary. A well-known positive feedback property of computer design is that we use today’s computers to design tomorrow’s. Thus, with the emergence of chip multiprocessors, it is natural to re-examine simulation environments written to exploit parallelism. In this paper we present a programming methodology for directly converting existing uniprocessor simulators into parallelized multiple-core simulators. Our method not only takes significantly less development effort compared to some prior used programming techniques, but also possesses advantages by retaining a modular and comprehensible programming structure. We demonstrate our case with actual developed products after applying this method to two different simulators, one developed from IBM Turandot and the other from the SimpleScalar tool set. Our SimpleScalar-based framework achieves a parallel speedup of 2.2X on a dual-CPU dual-core (4-way) Opteron server.
An Optical Interconnection Network and a Modified Snooping Protocol for the Design of Large-Scale Symmetric Multiprocessors (SMPs)
Ahmed Louri, Senior Member, IEEE, and Avinash Karanth Kodi, Student Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 12, DECEMBER 2004
Abstract — In Symmetric Multiprocessors (SMPs), the cache coherence overhead and the speed of the shared buses limit the address/ snoop bandwidth needed to broadcast transactions to all processors. As a solution, a scalable address subnetwork called Symmetric Multiprocessor Network (SYMNET) is proposed in which address requests and snoop responses of SMPs are implemented optically. SYMNET not only uses passive optical interconnects that increases the speed of the proposed network, but also pipelines address requests at a much faster rate than electronics. This increases the address bandwidth for snooping, but the preservation of cache coherence can no longer be maintained with the usual snooping protocols. A modified coherence protocol, Coherence in SYMNET (COSYM), is introduced to solve the coherence problem. COSYM was evaluated with a subset of Splash-2 benchmarks and compared with the electrical bus-based MOESI protocol. The simulation studies have shown a 5-66 percent improvement in execution time for COSYM as compared to MOESI for various applications. Simulations have also shown that the average latency for a transaction to complete using COSYM protocol was 5-78 percent better than the MOESI protocol. It is also seen that SYMNET can scale up to hundreds of processors while still using fast snooping-based cache coherence protocols, and additional performance gains may be attained with further improvement in optical device technology.
An XML-Based ADL Framework for Automatic Generation of Multithreaded Computer Architecture Simulators
Christopher Barnes, Pranav Vaidya, and Jaehwan John Lee
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
Abstract — Computer architecture simulation has always played a pivotal role in continuous innovation of computers. However, constructing or modifying a high quality simulator is time consuming and error-prone. Thus, often Architecture Description Languages (ADLs) are used to provide an abstraction layer for describing the computer architecture and automatically generating corresponding simulators. Along the line of such research, we present a novel XML-based ADL, its compiler, and a generation methodology to automatically generate multithreaded simulators for computer architecture. We utilize the industry-standard extensible markup language XML to describe the functionality and architecture of a modeled processor. Our ADL framework allows users to easily and quickly modify the structure, register set, and execution of a modeled processor. To prove its validity, we have generated several multithreaded simulators with different configurations based on the MIPS five-stage processor, and successfully tested with two programs.
Branch Misprediction Prediction: Complementary Branch Predictors
Resit Sendag, Joshua J. Yi, Peng-fei Chuang
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
Abstract — In this paper, we propose a new class of branch predictors, complementary branch predictors, which can be easily added to any branch predictor to improve the overall prediction accuracy. This mechanism differs from conventional branch predictors in that it focuses only on mispredicted branches. As a result, this mechanism has the advantages of scalability and flexibility (can be implemented with any branch predictor), but is not on the critical path. More specifically, this mechanism improves the branch prediction accuracy by predicting which future branch will be mispredicted next and when that will occur, and then it changes the predicted direction at the predicted time. Our results show that a branch predictor with the branch misprediction predictor achieves the same prediction accuracy as a conventional branch predictor that is 4 to 16 times larger, but without significantly increasing the overall complexity or lengthening the critical path.
Characterization of Problem Stores
Allison L. Holloway and Gurindar S. Sohi
IEEE Computer Architecture Letters Vol. 3, 2004
Abstract — This paper introduces the concept of problem stores: static stores whose dependent loads often miss in the cache. Accurately identifying problem stores allows the early determination of addresses likely to cause later misses, potentially allowing for the development of novel, proactive prefetching and memory hierarchy management schemes. We present a detailed empirical characterization of problem stores using the SPEC2000 CPU benchmarks. The data suggests several key observations about problem stores. First, we find that the number of important problem stores is typically quite small; the worst 100 problem stores write the values that will lead to about 90% of non-cold misses for a variety of cache configurations. We also find that problem stores only account for 1 in 8 dynamic stores, though they result in 9 of 10 misses. Additionally, the problem stores’ dependent loads miss in the L2 cache a larger fraction of the time than loads not dependent on problem stores. We also observe the set of problem stores is stable across a variety of cache configurations. Finally, we found that the instruction distance from problem store to miss and problem store to evict is often greater than one million instructions, but the value is often needed within 100,000 instructions of the eviction.
Communication Contention in Task Scheduling
Oliver Sinnen and Leonel A. Sousa, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 6, JUNE 2005
Abstract — Task scheduling is an essential aspect of parallel programming. Most heuristics for this NP-hard problem are based on a simple system model that assumes fully connected processors and concurrent interprocessor communication. Hence, contention for communication resources is not considered in task scheduling, yet it has a strong influence on the execution time of a parallel program. This paper investigates the incorporation of contention awareness into task scheduling. A new system model for task scheduling is proposed, allowing us to capture both end-point and network contention. To achieve this, the communication network is reflected by a topology graph for the representation of arbitrary static and dynamic networks. The contention awareness is accomplished by scheduling the communications, represented by the edges in the task graph, onto the links of the topology graph. Edge scheduling is theoretically analyzed, including aspects like heterogeneity, routing, and causality. The proposed contention-aware scheduling preserves the theoretical basis of task scheduling. It is shown how classic list scheduling is easily extended to this more accurate system model. Experimental results show the significantly improved accuracy and efficiency of the produced schedules.
Corollaries to Amdahl's Law for Energy
Sangyeun Cho, Member, IEEE Rami G. Melhem, Fellow, IEEE
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 7, NO. 1, JANUARY-JUNE 2008
Abstract — This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahl’s Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formulas we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing.
CPU Accounting in CMP Processors
Carlos Luque, Miquel Moreto1, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Mateo Valero
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
Abstract — Chip-MultiProcessors (CMP) introduce complexities when accounting CPU utilization to processes because the progress done by a process during an interval of time highly depends on the activity of the other processes it is co-scheduled with. We propose a new hardware accounting mechanism to improve the accuracy when measuring the CPU utilization in CMPs and compare it with the previous accounting mechanisms. Our results show that currently known mechanisms could lead to a 12% average error when it comes to CPU utilization accounting. Our proposal reduces this error to less than 1% in a modeled 4-core processor system.
Data Parallel Address Architecture
Jung Ho Ahn and William J. Dally
IEEE Computer Architecture Letters Vol. 5, 2006
Abstract — Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access patterns due to the time required to precharge and activate banks and to switch between read and write access.

To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads.
Design Space Exploration of a Software Speculative Parallelization Scheme
Marcelo Cintra, Member, IEEE, and Diego R. Llanos, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 6, JUNE 2005
Abstract — With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors.
Dynamic Predication of Indirect Jumps
José A. Joao, Onur Mutlu, Hyesoon Kim, and Yale N. Patt
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
Abstract — Indirect jumps are used to implement increasinglycommon programming language constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. Unfortunately, the prediction accuracy of indirect jumps has remained low because many indirect jumps have multiple targets that are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The microarchitecture predicates the instructions between different targets of the jump and its CFM point if the jump turns out to be hardto- predict at run time. We describe the new indirect jump predication architecture, provide code examples showing why it could reduce the performance impact of jumps, derive an analytical cost-benefit model for deciding which jumps and targets to predicate, and present preliminary evaluation results.
Explaining Dynamic Cache Partitioning Speed Ups
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez and Mateo Valero
IEEE Computer Architecture Letters Vol. 6, 2007
Abstract — Cache Partitioning has been proposed as an interesting alternative to traditional eviction policies of shared cache levels in modern CMP architectures: throughput is improved at the expense of a reasonable cost. However, these new policies present different behaviors depending on the applications that are running in the architecture. In this paper, we introduce some metrics that characterize applications and allow us to give a clear and simple model to explain final throughput speed ups.
Game-Theoretic Approach for Load Balancing in Computational Grids
Riky Subrata, Member, IEEE, Albert Y. Zomaya, Fellow, IEEE, and Bjorn Landfeldt, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
Abstract — Load balancing is a very important and complex problem in computational grids. A computational grid differs from traditional high-performance computing systems in the heterogeneity of the computing nodes, as well as the communication links that connect the different nodes together. There is a need to develop algorithms that can capture this complexity yet can be easily implemented and used to solve a wide range of load-balancing scenarios. In this paper, we propose a game-theoretic solution to the grid load-balancing problem. The algorithm developed combines the inherent efficiency of the centralized approach and the faulttolerant nature of the distributed, decentralized approach. We model the grid load-balancing problem as a noncooperative game, whereby the objective is to reach the Nash equilibrium. Experiments were conducted to show the applicability of the proposed approaches. One advantage of our scheme is the relatively low overhead and robust performance against inaccuracies in performance prediction information.
Hierarchical Scheduling for Symmetric Multiprocessors
Abhishek Chandra, Member, IEEE, and Prashant Shenoy, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
Abstract — Hierarchical scheduling has been proposed as a scheduling technique to achieve aggregate resource partitioning among related groups of threads and applications in uniprocessor and packet scheduling environments. Existing hierarchical schedulers are not easily extensible to multiprocessor environments because 1) they do not incorporate the inherent parallelism of a multiprocessor system while resource partitioning and 2) they can result in unbounded unfairness or starvation if applied to a multiprocessor system in a naive manner. In this paper, we present hierarchical multiprocessor scheduling (H-SMP), a novel hierarchical CPU scheduling algorithm designed for a symmetric multiprocessor (SMP) platform. The novelty of this algorithm lies in its combination of space and time multiplexing to achieve the desired bandwidth partition among the nodes of the hierarchical scheduling tree. This algorithm is also characterized by its ability to incorporate existing proportional-share algorithms as auxiliary schedulers to achieve efficient hierarchical CPU partitioning. In addition, we present a generalized weight feasibility constraint that specifies the limit on the achievable CPU bandwidth partitioning in a multiprocessor hierarchical framework and propose a hierarchical weight readjustment algorithm designed to transparently satisfy this feasibility constraint. We evaluate the properties of H-SMP using hierarchical surplus fair scheduling (H-SFS), an instantiation of H-SMP that employs surplus fair scheduling (SFS) as an auxiliary algorithm. This evaluation is carried out through a simulation study that shows that H-SFS provides better fairness properties in multiprocessor environments as compared to existing algorithms and their naive extensions.
In-Network Cache Coherence
Noel Eisley, Li-Shiuan Peh, and Li Shang
IEEE Computer Architecture Letters Vol. 5, 2006
Abstract — We propose implementing cache coherence protocols within the network, demonstrating how an in-network implementation of the MSI directory-based protocol allows for in-transit optimizations of read and write delay. Our results show 15% and 24% savings on average in memory access latency for SPLASH-2 parallel benchmarks running on a 4x4 and a 16x16 multiprocessor respectively.
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C. Weiser
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
Abstract — We study the tradeoffs between Many-Core machines like Intel’s Larrabee and Many-Thread machines like Nvidia and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines deliver inferior performance. We study the shape of this “performance valley” and provide insights on how it can be avoided.
Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices
Carlos Madriles, Carlos García-Quiñones, Jesús Sánchez, Member, IEEE, Pedro Marcuello, Member, IEEE Computer Society, Antonio González, Member, IEEE, Dean M. Tullsen, Senior Member, IEEE, Hong Wang, Member, IEEE, and John P. Shen, Fellow, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 7, JULY 2008
Abstract — This paper presents the Mitosis framework, which is a combined hardware-software approach to speculative multithreading, even in the presence of frequent dependences among threads. Speculative multithreading increases single-threaded application performance by exploiting thread-level parallelism speculatively, that is, executing code in parallel, even when the compiler or runtime system cannot guarantee that the parallelism exists. The proposed approach is based on predicting/computing thread input values via software through a piece of code that is added at the beginning of each thread (the precomputation slice). A precomputation slice is expected to compute the correct thread input values most of the time but not necessarily always. This allows aggressive optimization techniques to be applied to the slice to make it very short. This paper focuses on the microarchitecture that supports this execution model. The primary novelty of the microarchitecture is the hardware support for the execution and validation of precomputation slices. Additionally, this paper presents new architectures for the register file and the cache memory in order to support multiple versions of each variable and allow for efficient rollback in case of misspeculation. We show that the proposed microarchitecture, together with the compiler support, achieves an average speedup of 2.2 for applications that conventional nonspeculative approaches are not able to parallelize at all.
Multitoroidal Interconnects for Tightly Coupled Supercomputers
Yariv Aridor, Tamar Domany, Oleg Goldshmidt, Member, IEEE, Yevgeny Kliteynik, Edi Shmueli, and José E. Moreira, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
Abstract — The processing elements of many modern tightly coupled multicomputers are connected via mesh or toroidal networks. Such interconnects are simple and highly scalable, but suffer from high fragmentation, low utilization, and insufficient fault tolerance when the resources allocated to each job are dedicated. High-dimensional interconnects may be more efficient in certain cases, but are based on complex and expensive components and scale poorly. We present a novel hardware/software architectural approach that detaches the processing elements of the system from the interconnect and augments the traditional toroidal topology to provide additional connectivity options and additional link redundancy. We explore the properties of the new “multitoroidal” topology and the improvements it offers in resource utilization and failure tolerance. We present the results of extensive simulation studies to show that for practically important types of workloads, the resource utilization may be increased by 50 percent and, in certain cases, as much as 100 percent compared to toroidal machines and is, in fact, close to the theoretically optimal case of a full crossbar interconnect. The combined hardware/software architectural innovation is a major significant improvement in resource utilization on top of the state of the art in scheduling algorithm research. Also, multitoroidal multicomputers are able to work under link failure rates of 0.002 failures per week that would shut down toroidal machines. A variant of the multitoroidal architecture is implemented in the Blue Gene/L supercomputer.
Nahalal: Cache Organization for Chip Multiprocessors
Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser
IEEE Computer Architecture Letters Vol. 6, 2007
Abstract — This paper addresses cache organization in Chip Multiprocessors (CMPs). We show that in CMP systems it is valuable to distinguish between shared data, which is accessed by multiple cores, and private data accessed by a single core. We introduce Nahalal, an architecture whose novel floorplan topology partitions cached data according to its usage (shared versus private data), and thus enables fast access to shared data for all processors while preserving the vicinity of private data to each processor. Nahalal exhibits significant improvements in cache access latency compared to a traditional cache design.
Overhead Analysis of Scientific Workflows in Grid Environments
Radu Prodan, Member, IEEE, and Thomas Fahringer, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
Abstract — Scientific workflows are a topic of great interest in the Grid community that sees in the workflow model an attractive paradigm for programming distributed wide-area Grid infrastructures. Traditionally, the Grid workflow execution is approached as a pure best effort scheduling problem that maps the activities onto the Grid processors based on appropriate optimization or local matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results, the execution in dynamic and unpredictable Grid environments is prone to severe performance losses that must be understood for minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur when executing scientific workflows in dynamic Grid environments. We introduce an ideal model for the lowest execution time that can be achieved by a workflow and explain the difference to the real measured Grid execution time based on a hierarchy of performance overheads for Grid computing. We describe how to systematically measure and compute the overheads from individual activities to larger workflow regions and adjust well-known parallel processing metrics to the scope of Grid computing, including speedup and efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance analysis of two real-world workflow applications executed in the Austrian Grid environment.
Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting
Christian Tenllado, Member, IEEE Computer Society, Javier Setoain, Manuel Prieto, Member, IEEE, Luis Piñuel, Member, IEEE, and Francisco Tirado, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
Abstract — The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.
Performance Modeling Using Monte Carlo Simulation
Ram Srinivasan, Jeanine Cook, Olaf Lubeck
IEEE Computer Architecture Letters Vol. 5, 2006
Abstract — Cycle accurate simulation has long been the primary tool for micro-architecture design and evaluation. Though accurate, the slow speed often imposes constraints on the extent of design exploration. In this work, we propose a fast, accurate Monte-Carlo based model for predicting processor performance. We apply this technique to predict the CPI of in-order architectures and validate it against the Itanium-2. The Monte Carlo model uses micro-architecture independent application characteristics, and cache, branch predictor statistics to predict CPI with an average error of less than 7%. Since prediction is achieved in a few seconds, the model can be used for fast design space exploration that can efficiently cull the space for cycle-accurate simulations. Besides accurately predicting CPI, the model also breaks down CPI into various components, where each component quantifies the effect of a particular stall condition (branch mis-prediction, cache miss, etc.) on overall CPI. Such a CPI decomposition can help processor designers quickly identify and resolve critical performance bottlenecks.
Performance Models for Network Processor Design
Tilman Wolf, Member, IEEE, and Mark A. Franklin, Fellow, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 6, JUNE 2006
Abstract — To provide a variety of new and advanced communications services, computer networks are required to perform increasingly complex packet processing. This processing typically takes place on network routers and their associated components. An increasingly central component in router design is a chip-multiprocessor (CMP) referred to as “network processor” or NP. In addition to multiple processors, NPs have multiple forms of on-chip memory, various network and off-chip memory interfaces, and other specialized logic components such as CAMs (Content Addressable Memories). The design space for NPs (e.g., number of processors, caches, cache sizes, etc.) is large due to the diverse workload, application requirements, and system characteristics. System design constraints relate to the maximum chip area and the power consumption that are permissible while achieving defined line rates and executing required packet functions. In this paper, an analytic performance model that captures the processing performance, chip area, and power consumption for a prototypical NP is developed and used to provide quantitative insights into system design trade offs. The model, parameterized with a networking application benchmark, provides the basis for the design of a scalable, high-performance network processor and presents insights into how best to configure the numerous design elements associated with NPs.
Reducing Queue Oscillation at a Congested Link
Jong-hwan Kim and Ikjun Yeom, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
Abstract — Queue length oscillation at a congested link causes many undesirable properties such as large delay jitter, underutilization of the link and packet drops in burst. The main reason of this oscillation is that most queue management schemes determine the drop probability based on the current traffic without consideration on the impact of that drop probability on the future traffic. In this paper, we propose a new active queue (AQM) scheme to reduce queue oscillation and realize stable queue length. The proposed scheme measures the current arrival and drop rates, and uses them to estimate the next arrival rate. Based on this estimation, the scheme calculates the drop probability which is expected to realize stable queue length. We present extensive simulation with various topologies and offered traffic to evaluate performance of the proposed scheme. The results show that the proposed scheme remarkably reduces queue length oscillation compared to other well-known AQMs. It is also shown that the proposed scheme improves fairness among TCP flows due to the stable drop probability, and maintains high utilization with small queue length.
The Complexity of Verifying Memory Coherence and Consistency
Jason F. Cantin, Student Member, IEEE, Mikko H. Lipasti, Member, IEEE, and James E. Smith, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 7, JULY 2005
Abstract — The problem of testing shared memories for memory coherence and consistency is studied. First, it is proved that detecting violations of coherence in an execution is NP-Complete, and it remains NP-Complete for a number of restricted instances. This result leads to a proof that all known consistency models are NP-Hard to verify. The complexity of verifying consistency models is not a mere consequence of coherence, and verifying sequential consistency remains NP-Complete even after coherence has been verified.
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture
Resit Sendag, Member, IEEE, Ying Chen, Student Member, IEEE, and David J. Lilja, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 3, MARCH 2005
Abstract — The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually will be needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative Wrong Execution Cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses.
The Server Reassignment Problem for Load Balancing in Structured P2P Systems
Chyouhwa Chen and Kun-Cheng Tsai
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2008
Abstract — Application-layer peer-to-peer (P2P) networks are considered to be the most important development for next-generation Internet infrastructure. For these systems to be effective, load balancing among the peers is critical. Most structured P2P systems rely on ID-space partitioning schemes to solve the load imbalance problem and have been known to result in an imbalance factor of Θ(logN) in the zone sizes. This paper makes two contributions. First, we propose addressing the virtual-server-based load balancing problem systematically using an optimization-based approach and derive an effective algorithm to rearrange loads among the peers. We demonstrate the superior performance of our proposal in general and its advantages over previous strategies in particular. We also explore other important issues vital to the performance in the virtual server framework, such as the effect of the number of directories employed in the system and the performance ramification of user registration strategies. Second, and perhaps more significantly, we systematically characterize the effect of heterogeneity on load balancing algorithm performance and the conditions in which heterogeneity may be easy or hard to deal with based on an extensive study of a wide spectrum of load and capacity scenarios.
The UCSC Kestrel Parallel Processor
Andrea Di Blas, Member, IEEE, David M. Dahle, Mark Diekhans, Leslie Grate, Jeffrey Hirschberg, Kevin Karplus, Senior Member, IEEE, Hansjörg Keller, Mark Kendrick, Francisco J. Mesa-Martinez, David Pease, Eric Rice, Angela Schultz, Member, IEEE, Don Speck, and Richard Hughey, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 1, JANUARY 2005
Abstract — The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multipledata stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This paper presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.
A NUCA Substrate for Flexible CMP Cache Sharing
Jaehyuk Huh, Changkyu Kim, Student Member, IEEE, Hazim Shafi, Lixin Zhang, Doug Burger, Senior Member, IEEE, and Stephen W. Keckler, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Wepropose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.
CMP Support for Large and Dependent Speculative Threads
Christopher B. Colohan, Anastasia Ailamaki, Member, IEEE Computer Society, J. Gregory Steffan, Member, IEEE, and Todd C. Mowry
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9.
Understanding the Thermal Implications of Multicore Architectures
Pedro Chaparro, José González, Member, IEEE Computer Society, Grigorios Magklis, Member, IEEE Computer Society, Qiong Cai, and Antonio González, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, today’s higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.
Power-Efficient Approaches to Redundant Multithreading
Niti Madan, Student Member, IEEE, and Rajeev Balasubramonian, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread.
Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery
Yi Ma, Hongliang Gao, Martin Dimitrov, and Huiyang Zhou, Member, IEEE Computer Society
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of single-threaded applications. Previous research has shown that DCE provides a complexity-effective approach to building a highly scalable instruction window and achieves significant latency-hiding capabilities. In this paper, we propose to optimize DCE for power efficiency and/or transient-fault recovery. In DCE, a program is first processed (speculatively) in the front processor and then reexecuted by the back processor. Such reexecution is the key to eliminating the centralized structures that are normally associated with very large instruction windows. In this paper, we exploit the computational redundancy in DCE to improve its reliability and its power efficiency. The main contributions include: 1) DCE-based redundancy checking for transient-fault tolerance and a complexity-effective approach to achieving full redundancy coverage and 2) novel techniques to improve the power/energy efficiency of DCE-based execution paradigms. Our experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transientfault tolerance or significant performance enhancement in a power/energy-efficient way. Compared to the original DCE, the optimized DCE has similar speedups (34 percent on average) over single-core processors while reducing the energy overhead from 93 percent to 31 percent.
Accelerating Sequential Applications on CMPs Using Core Spilling
Jason Cong, Fellow, IEEE, Guoling Han, Ashok Jagannathan, Glenn Reinman, Member, IEEE, and Krzysztof Rutkowski
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
Abstract — Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level parallelism for multitasking or multithreaded applications. However, single-threaded applications will have difficulty dynamically leveraging the statically partitioned resources in a CMP. Such sequential applications may be difficult to statically decompose into threads or may simply be a legacy code where recompilation is not possible or cost-effective. We present a novel approach to dynamically accelerate the performance of sequential application(s) on multiple cores. Execution is allowed to spill from one core to another when resources on one core have been exhausted. We propose two techniques to enable low-overhead migration between cores: prespilling and locality-based filtering. We develop and analyze an arbitration mechanism to intelligently allocate cores among a set of sequential applications on a CMP. On average, core spilling on an eight-coreCMP can accelerate single-threaded performance by 35 percent.Wefurther explore an eight-core CMP running a multiple application workload composed of the entire SPEC 2000 benchmark suite in various combinations and arrival times. Using core spilling to accelerate the current set of running applications in cases where there are idle cores, we achieve up to a 40 percent improvement in performance.
Compiler Techniques for Efficient Communications in Circuit Switched Networks for Multiprocessor Systems
Shuyi Shao, Student Member, IEEE, Alex K. Jones, Senior Member, IEEE, and Rami Melhem, Fellow, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 3, MARCH 2009
Abstract — In this paper, we explore compiler techniques for achieving efficient communications on circuit switching interconnection networks. We propose a compilation framework for identifying communication patterns and compiling these patterns as network configuration directives. This has the potential of providing significant performance benefits when connections can be established in the network prior to the actual communications. The framework includes a flexible and powerful communication pattern representation scheme that captures the property of communication patterns and allows manipulation of these patterns. In this way, communication phases can be identified within the application. Additionally, we extend the classification of static and dynamic communications to include persistent communications. Persistent communications are a subclass of dynamic communications that remain unchanged for large segments of the application execution. An experimental compiler has been developed to implement the framework. This compiler is capable of detecting both static and persistent communications within an application. We show that for the NAS Parallel Benchmarks, 100 percent of the point-to-point communications can be classified as either static or persistent, and 100 percent of the collectives are either static or persistent with the exception of IS. Simulation-based performance analysis demonstrates the benefit of using our compiler techniques for achieving efficient communications in multiprocessor systems.
Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures
Guangming Tan, Member, IEEE, Ninghui Sun, Member, IEEE, and Guang R. Gao, Fellow, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2009
Abstract — Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.
The Design of OpenMP Tasks
Eduard Ayguadé, Nawal Copty, Member, IEEE Computer Society, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Member, IEEE, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 3, MARCH 2009
Abstract — OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the primary goals of OpenMP 3.0 was to define a standard dialect to express and to exploit unstructured parallelism efficiently. This paper presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking subcommittee which was formed for this purpose. This paper summarizes the efforts of the subcommittee (spanning over two years) in designing, evaluating, and seamlessly integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability.
Garbage Collector Scheduling in Dynamic, Multiprocessor Real-Time Systems
Hyeonjoong Cho, Member, IEEE, Binoy Ravindran, Senior Member, IEEE, and Chewoo Na, Student Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 6, JUNE 2009
Abstract — We consider garbage collection (GC) in dynamic, multiprocessor real-time systems. We consider the time-based, concurrent GC approach and focus on real-time scheduling to obtain mutator timing assurances, despite memory allocation and garbage collection. We present a scheduling algorithm called GCMUA. The algorithm considers mutator activities that are subject to time/utility function time constraints, stochastic execution-time and memory demands, and overloads. We establish that GCMUA probabilistically lower bounds each mutator activity’s accrued utility, lower bounds the system-wide total accrued utility, and upper bounds the timing assurances’ sensitivity to variations in mutator execution-time and memory demand estimates. Our simulation experiments validate our analytical results and confirm GCMUA’s effectiveness.
On Simulation and Design of Parallel-Systems Schedulers: Are We Doing the Right Thing?
Edi Shmueli and Dror G. Feitelson, Senior Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 7, JULY 2009
Abstract — It is customary to use open-system trace-driven simulations to evaluate the performance of parallel-system schedulers. As a consequence, all schedulers have evolved to optimize the packing of jobs in the schedule, as a means to improve a number of performance metrics that are conjectured to be correlated with user satisfaction, with the premise that this will result in a higher productivity in reality. We argue that these simulations suffer from severe limitations that lead to suboptimal scheduler designs and to even dismissing potentially good design alternatives. We propose an alternative simulation methodology called site-level simulation, in which the workload for the evaluation is generated dynamically by user models that interact with the system. We present a novel scheduler called CREASY that exploits knowledge on user behavior to directly improve user satisfaction and compare its performance to the original packing-based EASY scheduler. We show that user productivity improves by up to 50 percent under the user-aware design, while according to the conventional metrics, performance may actually degrade.
Replication-Based Fault Tolerance for MPI Applications
John Paul Walters and Vipin Chaudhary, Member, IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 7, JULY 2009
Abstract — As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.
 
     
Lecture Schedule
h0000000.gif
   
 
 
     
Bulletin Board
h0000000.gif
   
22 Nov 2009 The seminar will meet on Thursday 26.11.09 from 10:00 to 10:50.