| |
|
|
|
Suggested Research Papers
|
|
|
A Novel Parallel Deadlock Detection Algorithm and
Hardware for Multiprocessor System-on-a-Chip
|
|
|
Xiang Xiao and Jaehwan John Lee
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
|
|
|
Abstract — Given the projected dramatic increase in the number
of processors and resources in a system-on-a-chip, a quadratic
increase in the likelihood of deadlock is predicted due to complex
system behavior. To deal with this issue, we here present a
novel parallel hardware-oriented deadlock detection algorithm
with O(1) deadlock detection and O(min(m,n)) preparation,
where m and n are the numbers of processes and resources,
respectively. Our contributions are (i) the first O(1) deadlock
detection hardware implementation and (ii) a new algorithmic
method of achieving O(min(m,n)) overall run-time complexity.
We implement our algorithm in Verilog HDL and demonstrate
that deadlock detection always takes only two clock cycles
regardless of the size of a system (i.e., m and n).
|
|
|
A Parallel Computational Model for
Heterogeneous Clusters
|
|
|
Jose Luis Bosque and Luis Pastor
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 12, DECEMBER 2006
|
|
|
Abstract — Heterogeneous clusters claim for new models and algorithms. In this paper, a new parallel computational model is
presented. The model, based on the LogGP model, has been extended to be able to deal with heterogeneous parallel systems. For
that purpose, the LogGP’s scalar parameters have been replaced by vector and matrix parameters to take into account the different
nodes’ features. The work presented here includes the parametrization of a real cluster, which illustrates the impact of node
heterogeneity over the model’s parameters. Finally, the paper presents some experiments that can be used for assessing the method’s
validity, together with the main conclusions and future work.
|
|
|
An Analysis of EDF Schedulability
on a Multiprocessor
|
|
|
Theodore P. Baker, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 8, AUGUST 2005
|
|
|
Abstract — A new schedulability test is derived for preemptive deadline scheduling of periodic or sporadic real-time tasks on a singlequeue
m-server system. The new test allows the task deadline to be more or less than the task period, and is based on a new analysis
concept, called a μ-busy interval. This generalizes a result of Goossens et al. [11] that a system of periodic tasks with maximum
individual task utilization umax is EDF-schedulable on m processors if the total utilization does not exceed
m(1 - umax) + max. The new
test allows the analysis of hybrid EDF-US [x] scheduling, and the conclusion that EDF-US[1/2] is optimal, with a guaranteed worstcase
schedulable utilization of (m + 1)/2.
|
|
|
An Efficient Parallel Implementation of the
Hidden Markov Methods for Genomic Sequence
Search on a Massively Parallel System
|
|
|
Karl Jiang, Oystein Thorsen, Amanda Peters,
Brian Smith, Member, IEEE, and Carlos P. Sosa, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
|
|
|
Abstract — Bioinformatics databases used for sequence comparison and sequence alignment are growing exponentially. This has
popularized programs that carry out database searches. Current implementations of sequence alignment methods based on hidden
Markov models (HMM) have proven to be computationally intensive and, hence, amenable to architectures with multiple processors. In
this paper, we describe a modified version of the original parallel implementation of HMMs on a massively parallel system. This is part
of the HMMER bioinformatics code. HMMER 2.3.2 uses profile HMMs for sensitive database searching based on statistical
descriptions of a sequence family’s consensus [1]. Two of the nine programs were further parallelized to take advantage of the large
number of processors, namely, hmmsearch and hmmpfam. For our study, we start by porting the parallel virtual machine (PVM)
versions of these two programs currently available as part of the HMMER suite of programs. We report the performance of these
nonoptimized versions as baselines. Our work also includes the introduction of an alternate sequence file indexing, multiple-master
configuration, dynamic data collection and, finally, load balancing via the indexed sequence files. This set of optimizations constitutes
our modified version for massively parallel systems. Our results show parallel performance improvements of more than one order of
magnitude (16 times) for hmmsearch and hmmpfam.
|
|
|
An Efficient, Practical Parallelization Methodology
for Multicore Architecture Simulation
|
|
|
James Donald and Margaret Martonosi
|
|
|
IEEE Computer Architecture Letters Vol. 5, 2006
|
|
|
Abstract — Multiple core designs have become commonplace
in the processor market, and are hence a major focus in
modern computer architecture research. Thus, for both product
development and research, multiple core processor simulation
environments are necessary. A well-known positive feedback
property of computer design is that we use today’s computers
to design tomorrow’s. Thus, with the emergence of chip multiprocessors,
it is natural to re-examine simulation environments
written to exploit parallelism.
In this paper we present a programming methodology for
directly converting existing uniprocessor simulators into parallelized
multiple-core simulators. Our method not only takes
significantly less development effort compared to some prior
used programming techniques, but also possesses advantages by
retaining a modular and comprehensible programming structure.
We demonstrate our case with actual developed products after
applying this method to two different simulators, one developed
from IBM Turandot and the other from the SimpleScalar tool set.
Our SimpleScalar-based framework achieves a parallel speedup
of 2.2X on a dual-CPU dual-core (4-way) Opteron server.
|
|
|
An Optical Interconnection Network and a
Modified Snooping Protocol for the Design of
Large-Scale Symmetric Multiprocessors (SMPs)
|
|
|
Ahmed Louri, Senior Member, IEEE, and Avinash Karanth Kodi, Student Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 12, DECEMBER 2004
|
|
|
Abstract — In Symmetric Multiprocessors (SMPs), the cache coherence overhead and the speed of the shared buses limit the address/
snoop bandwidth needed to broadcast transactions to all processors. As a solution, a scalable address subnetwork called Symmetric
Multiprocessor Network (SYMNET) is proposed in which address requests and snoop responses of SMPs are implemented optically.
SYMNET not only uses passive optical interconnects that increases the speed of the proposed network, but also pipelines address
requests at a much faster rate than electronics. This increases the address bandwidth for snooping, but the preservation of cache
coherence can no longer be maintained with the usual snooping protocols. A modified coherence protocol, Coherence in SYMNET
(COSYM), is introduced to solve the coherence problem. COSYM was evaluated with a subset of Splash-2 benchmarks and compared
with the electrical bus-based MOESI protocol. The simulation studies have shown a 5-66 percent improvement in execution time for
COSYM as compared to MOESI for various applications. Simulations have also shown that the average latency for a transaction to
complete using COSYM protocol was 5-78 percent better than the MOESI protocol. It is also seen that SYMNET can scale up to
hundreds of processors while still using fast snooping-based cache coherence protocols, and additional performance gains may be
attained with further improvement in optical device technology.
|
|
|
An XML-Based ADL Framework for Automatic
Generation of Multithreaded Computer Architecture
Simulators
|
|
|
Christopher Barnes, Pranav Vaidya, and Jaehwan John Lee
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
|
|
|
Abstract — Computer architecture simulation has always played a pivotal role in continuous innovation of computers.
However, constructing or modifying a high quality simulator is time consuming and error-prone. Thus, often Architecture
Description Languages (ADLs) are used to provide an abstraction layer for describing the computer architecture and
automatically generating corresponding simulators. Along the line of such research, we present a novel XML-based ADL, its
compiler, and a generation methodology to automatically generate multithreaded simulators for computer architecture. We
utilize the industry-standard extensible markup language XML to describe the functionality and architecture of a modeled
processor. Our ADL framework allows users to easily and quickly modify the structure, register set, and execution of a
modeled processor. To prove its validity, we have generated several multithreaded simulators with different configurations
based on the MIPS five-stage processor, and successfully tested with two programs.
|
|
|
Branch Misprediction Prediction: Complementary Branch Predictors
|
|
|
Resit Sendag, Joshua J. Yi, Peng-fei Chuang
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
|
|
|
Abstract — In this paper, we propose a new class of branch
predictors, complementary branch predictors, which can be
easily added to any branch predictor to improve the overall
prediction accuracy. This mechanism differs from
conventional branch predictors in that it focuses only on
mispredicted branches. As a result, this mechanism has the
advantages of scalability and flexibility (can be implemented
with any branch predictor), but is not on the critical path.
More specifically, this mechanism improves the branch
prediction accuracy by predicting which future branch will
be mispredicted next and when that will occur, and then it
changes the predicted direction at the predicted time. Our
results show that a branch predictor with the branch
misprediction predictor achieves the same prediction
accuracy as a conventional branch predictor that is 4 to 16
times larger, but without significantly increasing the overall
complexity or lengthening the critical path.
|
|
|
Characterization of Problem Stores
|
|
|
Allison L. Holloway and Gurindar S. Sohi
|
|
|
IEEE Computer Architecture Letters Vol. 3, 2004
|
|
|
Abstract — This paper introduces the concept of problem
stores: static stores whose dependent loads often miss in the
cache. Accurately identifying problem stores allows the early
determination of addresses likely to cause later misses, potentially
allowing for the development of novel, proactive prefetching and
memory hierarchy management schemes.
We present a detailed empirical characterization of problem
stores using the SPEC2000 CPU benchmarks. The data suggests
several key observations about problem stores. First, we find
that the number of important problem stores is typically quite
small; the worst 100 problem stores write the values that will
lead to about 90% of non-cold misses for a variety of cache
configurations. We also find that problem stores only account
for 1 in 8 dynamic stores, though they result in 9 of 10 misses.
Additionally, the problem stores’ dependent loads miss in the L2
cache a larger fraction of the time than loads not dependent on
problem stores. We also observe the set of problem stores is stable
across a variety of cache configurations. Finally, we found that
the instruction distance from problem store to miss and problem
store to evict is often greater than one million instructions, but the
value is often needed within 100,000 instructions of the eviction.
|
|
|
Communication Contention in Task Scheduling
|
|
|
Oliver Sinnen and Leonel A. Sousa, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 6, JUNE 2005
|
|
|
Abstract — Task scheduling is an essential aspect of parallel programming. Most heuristics for this NP-hard problem are based on a
simple system model that assumes fully connected processors and concurrent interprocessor communication. Hence, contention for
communication resources is not considered in task scheduling, yet it has a strong influence on the execution time of a parallel program.
This paper investigates the incorporation of contention awareness into task scheduling. A new system model for task scheduling is
proposed, allowing us to capture both end-point and network contention. To achieve this, the communication network is reflected by a
topology graph for the representation of arbitrary static and dynamic networks. The contention awareness is accomplished by
scheduling the communications, represented by the edges in the task graph, onto the links of the topology graph. Edge scheduling is
theoretically analyzed, including aspects like heterogeneity, routing, and causality. The proposed contention-aware scheduling
preserves the theoretical basis of task scheduling. It is shown how classic list scheduling is easily extended to this more accurate
system model. Experimental results show the significantly improved accuracy and efficiency of the produced schedules.
|
|
|
Corollaries to Amdahl's Law for Energy
|
|
|
Sangyeun Cho, Member, IEEE Rami G. Melhem, Fellow, IEEE
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 7, NO. 1, JANUARY-JUNE 2008
|
|
|
Abstract — This paper studies the important interaction between parallelization and energy consumption in a parallelizable application.
Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies
allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is
preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with
the increasing number of processors than the speed improvement function given by the well-known Amdahl’s Law. Furthermore, we
determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement.
The formulas we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be
directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions
in the area of power and energy aware parallel processing.
|
|
|
CPU Accounting in CMP Processors
|
|
|
Carlos Luque, Miquel Moreto1, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Mateo Valero
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
|
|
|
Abstract — Chip-MultiProcessors (CMP) introduce complexities when accounting CPU utilization to processes because the progress done by a
process during an interval of time highly depends on the activity of the other processes it is co-scheduled with. We propose a new hardware
accounting mechanism to improve the accuracy when measuring the CPU utilization in CMPs and compare it with the previous accounting
mechanisms. Our results show that currently known mechanisms could lead to a 12% average error when it comes to CPU utilization accounting.
Our proposal reduces this error to less than 1% in a modeled 4-core processor system.
|
|
|
Data Parallel Address Architecture
|
|
|
Jung Ho Ahn and William J. Dally
|
|
|
IEEE Computer Architecture Letters Vol. 5, 2006
|
|
Abstract — Data parallel memory systems must maintain a
large number of outstanding memory references to fully use
increasing DRAM bandwidth in the presence of increasing
latency. At the same time, the throughput of modern DRAMs
is very sensitive to access patterns due to the time required to
precharge and activate banks and to switch between read and
write access.
To achieve memory reference parallelism a system may simultaneously
issue references from multiple reference threads.
Alternatively multiple references from a single thread can be
issued in parallel. In this paper we examine this tradeoff and show
that allowing only a single thread to access DRAM at any given
time significantly improves performance by increasing the locality
of the reference stream and hence reducing precharge/activate operations
and read/write turnaround. Simulations of scientific and
multimedia applications show that generating multiple references
from a single thread gives, on average, 17% better performance
than generating references from two parallel threads.
|
|
|
Design Space Exploration of a
Software Speculative Parallelization Scheme
|
|
|
Marcelo Cintra, Member, IEEE, and Diego R. Llanos, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 6, JUNE 2005
|
|
|
Abstract — With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in
parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software
schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved
with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the
design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently
proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software
schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results
show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time,
leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also,
scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally,
experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and
without RAW data dependences, for 4 to 32 processors.
|
|
|
Dynamic Predication of Indirect Jumps
|
|
|
José A. Joao, Onur Mutlu, Hyesoon Kim, and Yale N. Patt
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 6, NO. 2, JULY-DECEMBER 2007
|
|
|
Abstract — Indirect jumps are used to implement increasinglycommon
programming language constructs such as virtual function
calls, switch-case statements, jump tables, and interface calls.
Unfortunately, the prediction accuracy of indirect jumps has
remained low because many indirect jumps have multiple targets
that are difficult to predict even with specialized hardware.
This paper proposes a new way of handling hard-to-predict
indirect jumps: dynamically predicating them. The compiler
identifies indirect jumps that are suitable for predication along
with their control-flow merge (CFM) points. The microarchitecture
predicates the instructions between different targets of the
jump and its CFM point if the jump turns out to be hardto-
predict at run time. We describe the new indirect jump
predication architecture, provide code examples showing why
it could reduce the performance impact of jumps, derive an
analytical cost-benefit model for deciding which jumps and
targets to predicate, and present preliminary evaluation results.
|
|
|
Explaining Dynamic Cache Partitioning Speed Ups
|
|
|
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez and Mateo Valero
|
|
|
IEEE Computer Architecture Letters Vol. 6, 2007
|
|
|
Abstract — Cache Partitioning has been proposed as an interesting
alternative to traditional eviction policies of shared cache
levels in modern CMP architectures: throughput is improved at
the expense of a reasonable cost. However, these new policies
present different behaviors depending on the applications that
are running in the architecture. In this paper, we introduce some
metrics that characterize applications and allow us to give a clear
and simple model to explain final throughput speed ups.
|
|
|
Game-Theoretic Approach for Load
Balancing in Computational Grids
|
|
|
Riky Subrata, Member, IEEE, Albert Y. Zomaya, Fellow, IEEE, and
Bjorn Landfeldt, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
|
|
|
Abstract — Load balancing is a very important and complex problem in computational grids. A computational grid differs from
traditional high-performance computing systems in the heterogeneity of the computing nodes, as well as the communication links that
connect the different nodes together. There is a need to develop algorithms that can capture this complexity yet can be easily
implemented and used to solve a wide range of load-balancing scenarios. In this paper, we propose a game-theoretic solution to the
grid load-balancing problem. The algorithm developed combines the inherent efficiency of the centralized approach and the faulttolerant
nature of the distributed, decentralized approach. We model the grid load-balancing problem as a noncooperative game,
whereby the objective is to reach the Nash equilibrium. Experiments were conducted to show the applicability of the proposed
approaches. One advantage of our scheme is the relatively low overhead and robust performance against inaccuracies in performance
prediction information.
|
|
|
Hierarchical Scheduling for
Symmetric Multiprocessors
|
|
|
Abhishek Chandra, Member, IEEE, and Prashant Shenoy, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
|
|
|
Abstract — Hierarchical scheduling has been proposed as a scheduling technique to achieve aggregate resource partitioning among
related groups of threads and applications in uniprocessor and packet scheduling environments. Existing hierarchical schedulers are
not easily extensible to multiprocessor environments because 1) they do not incorporate the inherent parallelism of a multiprocessor
system while resource partitioning and 2) they can result in unbounded unfairness or starvation if applied to a multiprocessor system in
a naive manner. In this paper, we present hierarchical multiprocessor scheduling (H-SMP), a novel hierarchical CPU scheduling
algorithm designed for a symmetric multiprocessor (SMP) platform. The novelty of this algorithm lies in its combination of space and
time multiplexing to achieve the desired bandwidth partition among the nodes of the hierarchical scheduling tree. This algorithm is also
characterized by its ability to incorporate existing proportional-share algorithms as auxiliary schedulers to achieve efficient hierarchical
CPU partitioning. In addition, we present a generalized weight feasibility constraint that specifies the limit on the achievable CPU
bandwidth partitioning in a multiprocessor hierarchical framework and propose a hierarchical weight readjustment algorithm designed
to transparently satisfy this feasibility constraint. We evaluate the properties of H-SMP using hierarchical surplus fair scheduling
(H-SFS), an instantiation of H-SMP that employs surplus fair scheduling (SFS) as an auxiliary algorithm. This evaluation is carried out
through a simulation study that shows that H-SFS provides better fairness properties in multiprocessor environments as compared to
existing algorithms and their naive extensions.
|
|
|
In-Network Cache Coherence
|
|
|
Noel Eisley, Li-Shiuan Peh, and Li Shang
|
|
|
IEEE Computer Architecture Letters Vol. 5, 2006
|
|
|
Abstract — We propose implementing cache coherence protocols
within the network, demonstrating how an in-network
implementation of the MSI directory-based protocol allows
for in-transit optimizations of read and write delay. Our
results show 15% and 24% savings on average in memory
access latency for SPLASH-2 parallel benchmarks running
on a 4x4 and a 16x16 multiprocessor respectively.
|
|
|
Many-Core vs. Many-Thread Machines:
Stay Away From the Valley
|
|
|
Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C. Weiser
|
|
|
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, JANUARY-JUNE 2009
|
|
|
Abstract — We study the tradeoffs between Many-Core machines like Intel’s Larrabee and Many-Thread machines like Nvidia
and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify
operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines
deliver inferior performance. We study the shape of this “performance valley” and provide insights on how it can be avoided.
|
|
|
Mitosis: A Speculative Multithreaded Processor
Based on Precomputation Slices
|
|
|
Carlos Madriles, Carlos García-Quiñones, Jesús Sánchez, Member, IEEE,
Pedro Marcuello, Member, IEEE Computer Society, Antonio González, Member, IEEE,
Dean M. Tullsen, Senior Member, IEEE, Hong Wang, Member, IEEE, and John P. Shen, Fellow, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 7, JULY 2008
|
|
|
Abstract — This paper presents the Mitosis framework, which is a combined hardware-software approach to speculative
multithreading, even in the presence of frequent dependences among threads. Speculative multithreading increases single-threaded
application performance by exploiting thread-level parallelism speculatively, that is, executing code in parallel, even when the compiler
or runtime system cannot guarantee that the parallelism exists. The proposed approach is based on predicting/computing thread input
values via software through a piece of code that is added at the beginning of each thread (the precomputation slice). A precomputation
slice is expected to compute the correct thread input values most of the time but not necessarily always. This allows aggressive
optimization techniques to be applied to the slice to make it very short. This paper focuses on the microarchitecture that supports
this execution model. The primary novelty of the microarchitecture is the hardware support for the execution and validation of
precomputation slices. Additionally, this paper presents new architectures for the register file and the cache memory in order to
support multiple versions of each variable and allow for efficient rollback in case of misspeculation. We show that the proposed
microarchitecture, together with the compiler support, achieves an average speedup of 2.2 for applications that conventional
nonspeculative approaches are not able to parallelize at all.
|
|
|
Multitoroidal Interconnects for
Tightly Coupled Supercomputers
|
|
|
Yariv Aridor, Tamar Domany, Oleg Goldshmidt, Member, IEEE,
Yevgeny Kliteynik, Edi Shmueli, and José E. Moreira, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 1, JANUARY 2008
|
|
|
Abstract — The processing elements of many modern tightly coupled multicomputers are connected via mesh or toroidal networks. Such
interconnects are simple and highly scalable, but suffer from high fragmentation, low utilization, and insufficient fault tolerance when the
resources allocated to each job are dedicated. High-dimensional interconnects may be more efficient in certain cases, but are based on
complex and expensive components and scale poorly. We present a novel hardware/software architectural approach that detaches the
processing elements of the system from the interconnect and augments the traditional toroidal topology to provide additional connectivity
options and additional link redundancy. We explore the properties of the new “multitoroidal” topology and the improvements it offers in
resource utilization and failure tolerance. We present the results of extensive simulation studies to show that for practically important
types of workloads, the resource utilization may be increased by 50 percent and, in certain cases, as much as 100 percent compared to
toroidal machines and is, in fact, close to the theoretically optimal case of a full crossbar interconnect. The combined hardware/software
architectural innovation is a major significant improvement in resource utilization on top of the state of the art in scheduling algorithm
research. Also, multitoroidal multicomputers are able to work under link failure rates of 0.002 failures per week that would shut down
toroidal machines. A variant of the multitoroidal architecture is implemented in the Blue Gene/L supercomputer.
|
|
|
Nahalal: Cache Organization for Chip
Multiprocessors
|
|
|
Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser
|
|
|
IEEE Computer Architecture Letters Vol. 6, 2007
|
|
|
Abstract — This paper addresses cache organization in Chip
Multiprocessors (CMPs). We show that in CMP systems it is
valuable to distinguish between shared data, which is accessed by
multiple cores, and private data accessed by a single core. We
introduce Nahalal, an architecture whose novel floorplan
topology partitions cached data according to its usage (shared
versus private data), and thus enables fast access to shared data
for all processors while preserving the vicinity of private data to
each processor. Nahalal exhibits significant improvements in
cache access latency compared to a traditional cache design.
|
|
|
Overhead Analysis of Scientific Workflows in
Grid Environments
|
|
|
Radu Prodan, Member, IEEE, and Thomas Fahringer, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
|
|
|
Abstract — Scientific workflows are a topic of great interest in the Grid community that sees in the workflow model an attractive
paradigm for programming distributed wide-area Grid infrastructures. Traditionally, the Grid workflow execution is approached as a
pure best effort scheduling problem that maps the activities onto the Grid processors based on appropriate optimization or local
matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results,
the execution in dynamic and unpredictable Grid environments is prone to severe performance losses that must be understood for
minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic
approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur
when executing scientific workflows in dynamic Grid environments. We introduce an ideal model for the lowest execution time that can
be achieved by a workflow and explain the difference to the real measured Grid execution time based on a hierarchy of performance
overheads for Grid computing. We describe how to systematically measure and compute the overheads from individual activities to
larger workflow regions and adjust well-known parallel processing metrics to the scope of Grid computing, including speedup and
efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event
correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the
workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance
analysis of two real-world workflow applications executed in the Austrian Grid environment.
|
|
|
Parallel Implementation of the 2D Discrete
Wavelet Transform on Graphics
Processing Units: Filter Bank versus Lifting
|
|
|
Christian Tenllado, Member, IEEE Computer Society, Javier Setoain, Manuel Prieto, Member, IEEE,
Luis Piñuel, Member, IEEE, and Francisco Tirado, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
|
|
|
Abstract — The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and
their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as
Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no
such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these
devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The
opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new
performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our
experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of
the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.
|
|
|
Performance Modeling Using Monte Carlo Simulation
|
|
|
Ram Srinivasan, Jeanine Cook, Olaf Lubeck
|
|
|
IEEE Computer Architecture Letters Vol. 5, 2006
|
|
|
Abstract — Cycle accurate simulation has long been the primary
tool for micro-architecture design and evaluation. Though
accurate, the slow speed often imposes constraints on the extent
of design exploration. In this work, we propose a fast, accurate
Monte-Carlo based model for predicting processor performance.
We apply this technique to predict the CPI of in-order architectures
and validate it against the Itanium-2. The Monte
Carlo model uses micro-architecture independent application
characteristics, and cache, branch predictor statistics to predict
CPI with an average error of less than 7%. Since prediction
is achieved in a few seconds, the model can be used for fast
design space exploration that can efficiently cull the space for
cycle-accurate simulations. Besides accurately predicting CPI, the
model also breaks down CPI into various components, where each
component quantifies the effect of a particular stall condition
(branch mis-prediction, cache miss, etc.) on overall CPI. Such a
CPI decomposition can help processor designers quickly identify
and resolve critical performance bottlenecks.
|
|
|
Performance Models for
Network Processor Design
|
|
|
Tilman Wolf, Member, IEEE, and Mark A. Franklin, Fellow, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 6, JUNE 2006
|
|
|
Abstract — To provide a variety of new and advanced communications services, computer networks are required to perform
increasingly complex packet processing. This processing typically takes place on network routers and their associated components.
An increasingly central component in router design is a chip-multiprocessor (CMP) referred to as “network processor” or NP. In
addition to multiple processors, NPs have multiple forms of on-chip memory, various network and off-chip memory interfaces, and
other specialized logic components such as CAMs (Content Addressable Memories). The design space for NPs (e.g., number of
processors, caches, cache sizes, etc.) is large due to the diverse workload, application requirements, and system characteristics.
System design constraints relate to the maximum chip area and the power consumption that are permissible while achieving defined
line rates and executing required packet functions. In this paper, an analytic performance model that captures the processing
performance, chip area, and power consumption for a prototypical NP is developed and used to provide quantitative insights into
system design trade offs. The model, parameterized with a networking application benchmark, provides the basis for the design of a
scalable, high-performance network processor and presents insights into how best to configure the numerous design elements
associated with NPs.
|
|
|
Reducing Queue Oscillation at a
Congested Link
|
|
|
Jong-hwan Kim and Ikjun Yeom, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 3, MARCH 2008
|
|
|
Abstract — Queue length oscillation at a congested link causes many undesirable properties such as large delay jitter, underutilization
of the link and packet drops in burst. The main reason of this oscillation is that most queue management schemes determine the drop
probability based on the current traffic without consideration on the impact of that drop probability on the future traffic. In this paper, we
propose a new active queue (AQM) scheme to reduce queue oscillation and realize stable queue length. The proposed scheme
measures the current arrival and drop rates, and uses them to estimate the next arrival rate. Based on this estimation, the scheme
calculates the drop probability which is expected to realize stable queue length. We present extensive simulation with various
topologies and offered traffic to evaluate performance of the proposed scheme. The results show that the proposed scheme
remarkably reduces queue length oscillation compared to other well-known AQMs. It is also shown that the proposed scheme
improves fairness among TCP flows due to the stable drop probability, and maintains high utilization with small queue length.
|
|
|
The Complexity of Verifying
Memory Coherence and Consistency
|
|
|
Jason F. Cantin, Student Member, IEEE, Mikko H. Lipasti, Member, IEEE, and
James E. Smith, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 7, JULY 2005
|
|
|
Abstract — The problem of testing shared memories for memory coherence and consistency is studied. First, it is proved that detecting
violations of coherence in an execution is NP-Complete, and it remains NP-Complete for a number of restricted instances. This result
leads to a proof that all known consistency models are NP-Hard to verify. The complexity of verifying consistency models is not a mere
consequence of coherence, and verifying sequential consistency remains NP-Complete even after coherence has been verified.
|
|
|
The Impact of Incorrectly Speculated Memory
Operations in a Multithreaded Architecture
|
|
|
Resit Sendag, Member, IEEE, Ying Chen, Student Member, IEEE, and
David J. Lilja, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 3, MARCH 2005
|
|
|
Abstract — The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread
execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually will be needed by
the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths
impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level
provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load
instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the
correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that
these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative Wrong
Execution Cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions.
Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent
on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses.
|
|
|
The Server Reassignment Problem for Load
Balancing in Structured P2P Systems
|
|
|
Chyouhwa Chen and Kun-Cheng Tsai
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2008
|
|
|
Abstract — Application-layer peer-to-peer (P2P) networks are considered to be the most important development for next-generation
Internet infrastructure. For these systems to be effective, load balancing among the peers is critical. Most structured P2P systems rely
on ID-space partitioning schemes to solve the load imbalance problem and have been known to result in an imbalance factor of
Θ(logN) in the zone sizes. This paper makes two contributions. First, we propose addressing the virtual-server-based load balancing
problem systematically using an optimization-based approach and derive an effective algorithm to rearrange loads among the peers.
We demonstrate the superior performance of our proposal in general and its advantages over previous strategies in particular. We also
explore other important issues vital to the performance in the virtual server framework, such as the effect of the number of directories
employed in the system and the performance ramification of user registration strategies. Second, and perhaps more significantly, we
systematically characterize the effect of heterogeneity on load balancing algorithm performance and the conditions in which
heterogeneity may be easy or hard to deal with based on an extensive study of a wide spectrum of load and capacity scenarios.
|
|
|
The UCSC Kestrel Parallel Processor
|
|
|
Andrea Di Blas, Member, IEEE, David M. Dahle, Mark Diekhans, Leslie Grate, Jeffrey Hirschberg,
Kevin Karplus, Senior Member, IEEE, Hansjörg Keller, Mark Kendrick, Francisco J. Mesa-Martinez,
David Pease, Eric Rice, Angela Schultz, Member, IEEE, Don Speck, and
Richard Hughey, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 1, JANUARY 2005
|
|
|
Abstract — The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel
systems to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and
uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a
wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor
strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multipledata
stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on
efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational
chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility
due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD
architectures can excel in both programmability and performance. This paper presents the architecture, implementation, applications,
and observations of the Kestrel project at the University of California at Santa Cruz.
|
|
|
A NUCA Substrate for
Flexible CMP Cache Sharing
|
|
|
Jaehyuk Huh, Changkyu Kim, Student Member, IEEE, Hazim Shafi, Lixin Zhang,
Doug Burger, Senior Member, IEEE, and Stephen W. Keckler, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Wepropose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte
pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network
embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which
each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the
entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank
mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with
sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory
dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity,
especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in
terms of network traffic, bank accesses, and external memory accesses.
|
|
|
CMP Support for Large and Dependent
Speculative Threads
|
|
|
Christopher B. Colohan, Anastasia Ailamaki, Member, IEEE Computer Society,
J. Gregory Steffan, Member, IEEE, and Todd C. Mowry
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific
workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have
minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when
targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences
between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative
state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we
present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS
hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing
speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the
transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by
a factor of 1.9 to 2.9.
|
|
|
Understanding the Thermal Implications of
Multicore Architectures
|
|
|
Pedro Chaparro, José González, Member, IEEE Computer Society,
Grigorios Magklis, Member, IEEE Computer Society, Qiong Cai, and Antonio González, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that
multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level
parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can
be integrated on a chip. On the other hand, today’s higher than ever power densities have made temperature one of the main
limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have
addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives
for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are
explored and improvements for thread migration schemes are proposed.
|
|
|
Power-Efficient Approaches to
Redundant Multithreading
|
|
|
Niti Madan, Student Member, IEEE, and Rajeev Balasubramonian, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few
years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection
and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute
redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same
program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically
check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a
high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms
that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread
benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different
strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order
execution, and parallelization of the trailing thread.
|
|
|
Optimizing Dual-Core Execution for
Power Efficiency and Transient-Fault Recovery
|
|
|
Yi Ma, Hongliang Gao, Martin Dimitrov, and Huiyang Zhou, Member, IEEE Computer Society
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of
single-threaded applications. Previous research has shown that DCE provides a complexity-effective approach to building a highly
scalable instruction window and achieves significant latency-hiding capabilities. In this paper, we propose to optimize DCE for power
efficiency and/or transient-fault recovery. In DCE, a program is first processed (speculatively) in the front processor and then reexecuted
by the back processor. Such reexecution is the key to eliminating the centralized structures that are normally associated with very large
instruction windows. In this paper, we exploit the computational redundancy in DCE to improve its reliability and its power efficiency. The
main contributions include: 1) DCE-based redundancy checking for transient-fault tolerance and a complexity-effective approach to
achieving full redundancy coverage and 2) novel techniques to improve the power/energy efficiency of DCE-based execution paradigms.
Our experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transientfault
tolerance or significant performance enhancement in a power/energy-efficient way. Compared to the original DCE, the optimized
DCE has similar speedups (34 percent on average) over single-core processors while reducing the energy overhead from 93 percent to
31 percent.
|
|
|
Accelerating Sequential Applications
on CMPs Using Core Spilling
|
|
|
Jason Cong, Fellow, IEEE, Guoling Han, Ashok Jagannathan,
Glenn Reinman, Member, IEEE, and Krzysztof Rutkowski
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007
|
|
|
Abstract — Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level parallelism for multitasking or
multithreaded applications. However, single-threaded applications will have difficulty dynamically leveraging the statically partitioned
resources in a CMP. Such sequential applications may be difficult to statically decompose into threads or may simply be a legacy code
where recompilation is not possible or cost-effective. We present a novel approach to dynamically accelerate the performance of
sequential application(s) on multiple cores. Execution is allowed to spill from one core to another when resources on one core have been
exhausted. We propose two techniques to enable low-overhead migration between cores: prespilling and locality-based filtering. We
develop and analyze an arbitration mechanism to intelligently allocate cores among a set of sequential applications on a CMP. On
average, core spilling on an eight-coreCMP can accelerate single-threaded performance by 35 percent.Wefurther explore an eight-core
CMP running a multiple application workload composed of the entire SPEC 2000 benchmark suite in various combinations and arrival
times. Using core spilling to accelerate the current set of running applications in cases where there are idle cores, we achieve up to a
40 percent improvement in performance.
|
|
|
Compiler Techniques for Efficient
Communications in Circuit Switched Networks
for Multiprocessor Systems
|
|
|
Shuyi Shao, Student Member, IEEE, Alex K. Jones, Senior Member, IEEE, and
Rami Melhem, Fellow, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 3, MARCH 2009
|
|
|
Abstract — In this paper, we explore compiler techniques for achieving efficient communications on circuit switching interconnection
networks. We propose a compilation framework for identifying communication patterns and compiling these patterns as network
configuration directives. This has the potential of providing significant performance benefits when connections can be established in
the network prior to the actual communications. The framework includes a flexible and powerful communication pattern representation
scheme that captures the property of communication patterns and allows manipulation of these patterns. In this way, communication
phases can be identified within the application. Additionally, we extend the classification of static and dynamic communications to
include persistent communications. Persistent communications are a subclass of dynamic communications that remain unchanged for
large segments of the application execution. An experimental compiler has been developed to implement the framework. This compiler
is capable of detecting both static and persistent communications within an application. We show that for the NAS Parallel
Benchmarks, 100 percent of the point-to-point communications can be classified as either static or persistent, and 100 percent of the
collectives are either static or persistent with the exception of IS. Simulation-based performance analysis demonstrates the benefit of
using our compiler techniques for achieving efficient communications in multiprocessor systems.
|
|
|
Improving Performance of Dynamic
Programming via Parallelism and Locality on
Multicore Architectures
|
|
|
Guangming Tan, Member, IEEE, Ninghui Sun, Member, IEEE, and Guang R. Gao, Fellow, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2009
|
|
|
Abstract — Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems.
This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data
dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with
small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain
parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for
programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data
dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and
percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an
analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the
160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.
|
|
|
The Design of OpenMP Tasks
|
|
|
Eduard Ayguadé, Nawal Copty, Member, IEEE Computer Society, Alejandro Duran, Jay Hoeflinger,
Yuan Lin, Federico Massaioli, Member, IEEE, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 3, MARCH 2009
|
|
|
Abstract — OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application
complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is
evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the
primary goals of OpenMP 3.0 was to define a standard dialect to express and to exploit unstructured parallelism efficiently. This paper
presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking subcommittee which was formed for this
purpose. This paper summarizes the efforts of the subcommittee (spanning over two years) in designing, evaluating, and seamlessly
integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking
model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a
prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison
shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability.
|
|
|
Garbage Collector Scheduling in Dynamic,
Multiprocessor Real-Time Systems
|
|
|
Hyeonjoong Cho, Member, IEEE, Binoy Ravindran, Senior Member, IEEE, and
Chewoo Na, Student Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 6, JUNE 2009
|
|
|
Abstract — We consider garbage collection (GC) in dynamic, multiprocessor real-time systems. We consider the time-based,
concurrent GC approach and focus on real-time scheduling to obtain mutator timing assurances, despite memory allocation and
garbage collection. We present a scheduling algorithm called GCMUA. The algorithm considers mutator activities that are subject to
time/utility function time constraints, stochastic execution-time and memory demands, and overloads. We establish that GCMUA
probabilistically lower bounds each mutator activity’s accrued utility, lower bounds the system-wide total accrued utility, and upper
bounds the timing assurances’ sensitivity to variations in mutator execution-time and memory demand estimates. Our simulation
experiments validate our analytical results and confirm GCMUA’s effectiveness.
|
|
|
On Simulation and Design of Parallel-Systems
Schedulers: Are We Doing the Right Thing?
|
|
|
Edi Shmueli and Dror G. Feitelson, Senior Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 7, JULY 2009
|
|
|
Abstract — It is customary to use open-system trace-driven simulations to evaluate the performance of parallel-system schedulers. As
a consequence, all schedulers have evolved to optimize the packing of jobs in the schedule, as a means to improve a number of
performance metrics that are conjectured to be correlated with user satisfaction, with the premise that this will result in a higher
productivity in reality. We argue that these simulations suffer from severe limitations that lead to suboptimal scheduler designs and to
even dismissing potentially good design alternatives. We propose an alternative simulation methodology called site-level simulation, in
which the workload for the evaluation is generated dynamically by user models that interact with the system. We present a novel
scheduler called CREASY that exploits knowledge on user behavior to directly improve user satisfaction and compare its performance
to the original packing-based EASY scheduler. We show that user productivity improves by up to 50 percent under the user-aware
design, while according to the conventional metrics, performance may actually degrade.
|
|
|
Replication-Based Fault Tolerance
for MPI Applications
|
|
|
John Paul Walters and Vipin Chaudhary, Member, IEEE
|
|
|
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 7, JULY 2009
|
|
|
Abstract — As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to
minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This
results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated
checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference
implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of
fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage,
a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they
are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the
NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that
checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show
that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.
|
|
|
|
|