Skip navigation
Brigham Young University
Department of Electrical & Computer Engineering

Dr. Penry's Publications

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

If you have institutional or personal access to the ACM Digital Library, IEEE Xplore, and/or SpringerLink, the DOI links will give you the official versions of papers.

(hide abstracts)

Improving the Interface Performance of Synthesized Structural FAME Simulators through Scheduling [abstract] (PDF)
David A. Penry
Proceedings of the 2015 IEEE International Conference on Computer Design (ICCD), October 2015.

Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs (also known as FAME simulators) can overcome the speed limitations of software-only simulators. However such simulators must be automatically synthesized or the time to design them becomes prohibitive. Previous work has shown that synthesized simulators should use a latency-insensitive design style in the hardware and a concurrent interface with the software.

We show that the performance of the interface in such a simulator can be improved significantly by scheduling all communication between hardware and software. Scheduling reduces the amount of hardware/software communication and reduces software overhead. Scheduling is made possible by exploiting the properties of the latency-insensitive design technique recommended in previous work. We observe speedups of up to 1.54 versus the former interface for a multi-core simulator.

ADL-Based Specification of Implementation Styles for Functional Simulators [abstract] (DOI, PDF)
David A. Penry and Kurtis D. Cahill
The International Journal of Parallel Programming (IJPP), Volume 41, Number 2, April 2013. Invited.

Functional simulators find widespread use as subsystems within microarchitectural simulators. The speed of a functional simulator is strongly influenced by its implementation style, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. Such a tradeoff has become particularly unfortunate as multicore processor designs proliferate and multi-threaded benchmarks must be simulated.

We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.

Interface Design for Synthesized Structural Hybrid Microarchitectural Simulators [abstract] (DOI, PDF)
Zhuo Ruan and David A. Penry
Proceedings of the 2012 IEEE International Conference on Computer Design (ICCD), October 2012.

Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs overcome the speed limitations of software-only simulators as systems become more complex, however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. The performance of a hybrid simulator is significantly affected by how the interface between software and hardware is constructed. We characterize the design space of interfaces for synthesized structural hybrid microarchitectural simulators, provide implementations for several such interfaces, and determine the tradeoffs involved in choosing an efficient design candidate.

Techniques for LI-BDN Synthesis for Hybrid Microarchitectural Simulation [abstract] (DOI, PDF)
Tyler S. Harris, Zhuo Ruan, and David A. Penry
Proceedings of the 2011 IEEE International Conference on Computer Design (ICCD), October 2011.

Computer designers rely upon near-cycle-accurate microarchitectural simulation to explore the design space of new systems. Unfortunately, such simulators are becoming increasingly slow as systems become more complex. Hybrid simulators which offload some of the simulation work onto FPGAs can increase the speed; however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. Furthermore, FPGA implementations of simulators may require multiple FPGA clock cycles to implement behavior that takes place within one simulated clock cycle, making correct arbitrary composition of simulator components impossible and limiting the amount of hardware concurrency which can be achieved.

Latency-Insensitive Bounded Dataflow Networks (LI-BDNs) have been suggested as a means to permit composition of simulator components in FPGAs. However, previous work has required that LI-BDNs be created manually. This paper introduces techniques for automated synthesis of LI-BDNs from the processes of a System-C microarchitectural model. We demonstrate that LI-BDNs can be successfully synthesized. We also introduce a technique for reducing the overhead of LI-BDNs when the latency-insensitive property is unnecessary, resulting in up to a 60% reduction in FPGA resource requirements.

ADL-Based Specification of Implementation Styles for Functional Simulators [abstract] (DOI, PDF)
David A. Penry and Kurtis Cahill
Proceedings of the 11th International Conference on Embedded Computer Systems: Archietctures, Modeling, and Simulation (SAMOS), July 2011.

Functional simulators find widespread use as subsystems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time.

We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.

Liberty Simulation Environment, Version 2.0
David A. Penry, Manish Vachharajani, Neil Vachharajani, Jason A. Blome, and David I. August
Available at http://bardd.ee.byu.edu/Software/LSE, July 2011.

A Single-Specification Principle for Functional-to-Timing Simulator Interface Design [abstract] (DOI, PDF)
David A. Penry
Proceedings of the 2011 International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2011.

Microarchitectural simulators are often partitioned into separate, but interacting, functional and timing simulators. These simulators interact through some interface whose level of detail depends upon the needs of the timing simulator. The level of detail supported by the interface profoundly affects the speed of the functional simulator, therefore, it is desirable to provide only the detail that is actually required. However, as the microarchitectural design space is explored, these needs may change, requiring corresponding time-consuming and error-prone changes to the interface. Thus simulator developers are tempted to include extra detail in the interface "just in case" it is needed later, trading off simulator speed for development time.

We show that this tradeoff is unnecessary if a single-specification design principle is practiced: write the simulator \emph{once} with an extremely detailed interface and then derive less-detailed interfaces from this detailed simulator. We further show that the use of an Architectural Description Language (ADL) with constructs for interface specification makes it possible to synthesize simulators with less-detailed interfaces from a highly-detailed specification with only a few lines of code and minimal effort. The speed of the resulting low-detail simulators is up to 14.4 times the speed of high-detail simulators.

Elaboration-time Synthesis of High-level Language Constructs in SystemC-based Microarchitectural Simulators [abstract] (DOI, PDF)
Zhuo Ruan, Kurtis Cahill, and David A. Penry
Proceedings of the 2010 IEEE International Conference on Computer Design (ICCD), October 2010.

Structural modeling serves as an efficient method for creating detailed microarchitectural models of complex microprocessors. High-level language constructs such as templates and object polymorphism are used to achieve a high degree of code reuse, thereby reducing development time. However, these modeling frameworks are currently too slow to evaluate future design of multicore microprocessors. The synthesis of portions of these models into hardware to form hybrid simulators promises to improve their speed substantially. Unfortunately, the high-level language constructs used in structural simulation frameworks are not typically synthesizable. One factor which limits their synthesis is that it is very difficult to determine statically what exactly the code and data to synthesize are. We propose an \emph{elaboration-time synthesis} method for SystemC-based microarchitectural simulators. As part of the runtime environment of our infrastructure, the synthesis tool extracts architectural information after elaboration, binds dynamic information to a low-level intermediate representation (IR), and synthesizes the IR to VHDL. We show that this approach permits the synthesis of high-level language constructs which could not be easily synthesized before.

Partitioning and Synthesis for Hybrid Architecture Simulators [abstract] (DOI, PDF)
Zhuo Ruan and David A. Penry
Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS), June 2010.
Finalist for Best Student Paper Award.

Pure software simulators are too slow to simulate modern complex computer architectures and systems. Hybrid software/hardware simulators have been proposed to accelerate architecture simulation. However, the design of the hardware portions and hardware/software interface of the simulator is time-consuming, making it difficult to modify and improve these simulators. We here describe the Simulation Partitioning Research Infrastructure (SPRI), an infrastructure which partitions the software architectural model under user guidance and automatically synthesizes hybrid simulators. We also present a case study using SPRI to investigate the performance limitations and bottlenecks of the generated hybrid simulators.

Exposing Parallelism and Locality in a Runtime Parallel Optimization Framework [abstract] (DOI, PDF)
David A. Penry, Daniel J. Richins, Tyler S. Harris, David Greenland, and Koy D. Rehme
Proceedings of the 2010 ACM International Conference on Computing Frontiers (CF), May 2010.

The widespread use of tens to hundreds of processor cores in commodity systems will require widespread deployment of parallel applications. Despite advances in parallel programming models, it seems unlikely that the average programmer will be able to negotiate the twin shoals of understanding how to map parallelism well on a particular architecture and the likelihood that the particular architecture will not even be known at development time. Furthermore, for many important applications, a good mapping depends upon data or application characteristics not known until runtime.

Runtime parallel optimization has been suggested as a means to overcome these difficulties. For runtime parallel optimization to be effective, parallelism and locality which are expressed in the programming model need to be communicated to the runtime system. We suggest that the compiler should expose this information to the runtime using a representation which is independent of the programming model. We term such a representation an exposed parallelism and locality (EPL) representation. An EPL representation allows a single runtime environment to support many different models and architectures and to perform automatic parallelization optimization.

In order to accomplish these goals, an EPL representation needs to be task-based, multi-relational, hierarchical, and concise. This paper describes these four properties. It also presents an optimizing runtime, ADOPAR, which uses an EPL representation.

Issues in Hybrid Simulator Synthesis [abstract] (PDF)
Zhuo Ruan, Koy Rehme, and David A. Penry
Proceedings of the 4th Workshop on Architectural Research Prototyping (WARP), June 2009.

The Simulator Partitioning Research Infrastructure (SPRI) is a project to automate the generation of hybrid architectural simulators. In this paper, we examine the interesting issues and challenges in hybrid simulator synthesis.

Multicore Diversity: A Software Developer's Nightmare [abstract] (DOI, PDF)
David A. Penry
ACM SIGOPS Operating Systems Review (OSR), April 2009.

Commodity microprocessors with tens to hundreds of processor cores will require the widespread deployment of parallel programs. This deployment will be hindered by the architectural and environmental diversity introduced by multicore processors. To overcome diversity, the operating system must change its interactions with the program runtime and parallel runtime systems must be developed that can automatically adapt programs to the architecture and usage environment.

SPRI: Simulator Partitioning Research Infrastructure [abstract] (PDF)
Zhuo Ruan, Koy Rehme, and David A. Penry
Proceedings of the 3rd Workshop on Architectural Research Prototyping (WARP), June 2008.

Using FPGAs as architectural simulation accelerators has been widely discussed in the computer architecture design community. We previously proposed a hybrid SW/HW simulation infrastructure named SPRI (Simulator Partitioning Research Infrastructure) which automatically partitions the general timing model into the software and hardware portions for simulation speedup, conforming to the set-based partitioning specification. The SPRI platform takes two main inputs—partitioning specification and the architectural model; it then produces a modified SW architectural binary and a HW-accelerated RTL description which can communicate with each other, called hybrid SW/HW co-simulator—the final output of SPRI. Various experiment cases have been also run through the SPRI infrastructure to test its partitioning functionality and API wrapper generation.

UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development [abstract] (DOI, PDF, PostScript)
David I. August, Jonathan Chang, Sylvain Girbal, Daniel Gracia-Perez, Gilles Mouchard, David Penry, Olivier Temam, and Neil Vachharajani
IEEE Computer Architecture Letters (CAL), September 2007.

Simulator development is already a huge burden for many academic and industry research groups; future complex or heterogeneous multi-cores, as well as the multiplicity of performance metrics and required functionality, will make matters worse. We present a new simulation environment, called UNISIM, which is designed to rationalize simulator development by making it possible and efficient to distribute the overall effort over multiple research groups, even without direct cooperation. UNISIM achieves this goal with a combination of modular software development, distributed communication protocols, multi-level abstract modeling, interoperability capabilities, a set of simulator services APIs, and an open library/repository for providing a consistent set of simulator modules.

An Infrastructure for HW/SW Partitioning and Synthesis of Architectural Simulators [abstract] (PDF)
David A. Penry, Zhuo Ruan, and Koy Rehme
Proceedings of the 2nd Workshop on Architectural Research Prototyping (WARP), June 2007.

Many researchers are interested in using FPGAs to accelerate architectural simulation. Partitioning of the simulator between hardware and software is an important problem which has not been explored because of the enormous effort required to develop different RTL and communication infrastructure for each potential partition. We are developing a hybrid HW/SW simulation infrastructure which will provide tools for partitioning architectural simulators and synthesizing RTL for the hardware portions. This infrastructure will allow the community to explore and understand the partitioning problem and will eventually lead to automated partitioning algorithms.

You Can't Parallelize Just Once: Managing Manycore Diversity [abstract] (PDF)
David A. Penry
Position paper for the Workshop on Manycore Computing at ICS'07, June 2007.

One of the greatest challenges for the use of manycore architectures will be the growing diversity of manycore systems. This diversity will come in many forms: architecture, goals, programming languages, pre-parallelization, and dynamicisim. We argue that the most managable approach to such diversity is to delay optimization and parallelization until runtime.

The Acceleration of Structural Microarchitectural Simulation via Scheduling [abstract] (PDF, PostScript)
David A. Penry
Ph.D. Thesis, Department of Computer Science, Princeton University, November 2006.

Microarchitects rely upon simulation to evaluate design alternatives, yet constructing an accurate simulator by hand is a difficult and time-consuming process because simulators are usually written in sequential languages while the system being modeled is concurrent. Structural modeling can mitigate this difficulty by allowing the microarchitect to specify the simulation model in a concurrent, structural form; a simulator compiler then generates a simulator from the model. However, the resulting simulators are generally slower than those produced by hand. The thesis of this dissertation is that simulation speed improvements can be obtained by careful scheduling of the work to be performed by the simulator onto single or multiple processors.

For scheduling onto single processors, this dissertation presents an evaluation of previously proposed scheduling mechanisms in the context of a structural microarchitectural simulation framework which uses a particular model of computation, the Heterogeneous Synchronous Reactive (HSR) model, and improvements to these mechanisms which make them more effective or more feasible for microarchitectural models. A static scheduling technique known as partitioned scheduling is shown to offer the most performance improvement: up to 2.08 speedup. This work furthermore proves that the the Discrete Event model of computation can be statically scheduled using partitioned scheduling when restricted in ways that are commonly assumed in microarchitectural simulation.

For scheduling onto multiple processors, this dissertation presents the first automatic parallelization of simulators using the HSR model of computation. It shows that effective parallelization requires techniques to avoid waiting due to locks and to improve cache locality. Two novel heuristics for lock mitigation and two for cache locality improvement are introduced and evaluated on three different parallel systems. The combination of lock mitigation and locality improvement is shown to allow superlinear speedup for some models: up to 7.56 for four processors.

The Liberty Simulation Environment: A Deliberate Approach to High-Level System Modeling [abstract] (DOI, PDF)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
ACM Transactions on Computer Systems (TOCS), Volume 24, Number 3, August 2006.

In digital hardware system design, the quality of the product is directly related to the number of meaningful design alternatives properly considered. Unfortunately, existing modeling methodologies and tools have properties which make them less than ideal for rapid and accurate design-space exploration. This article identifies and evaluates the shortcomings of existing methods to motivate the Liberty Simulation Environment (LSE). LSE is a high-level modeling tool engineered to address these limitations, allowing for the rapid construction of accurate high-level simulation models. LSE simplifies model specification with low-overhead component-based reuse techniques and an abstraction for timing control. As part of a detailed description of LSE, this article presents these features, their impact on model specification effort, their implementation, and optimizations created to mitigate their otherwise deleterious impact on simulator execution performance.

Exploiting Parallelism and Structure to Accelerate the Simulation of Chip Multi-processors [abstract] (DOI, PDF, PostScript)
David A. Penry, Daniel Fay, David Hodgdon, Ryan Wells, Graham Schelle, David I. August, and Daniel A. Connors
Proceedings of the Twelfth International Symposium on High-Performance Computer Architecture (HPCA), February 2006.

Simulation is an important means of evaluating new microarchitectures. Current trends toward chip multi-processors (CMPs) try the ability of designers to develop efficient simulators. CMP simulation speed can be improved by exploiting parallelism in the CMP simulation model. This may be done by either running the simulation on multiple processors or by integrating multiple processors into the simulation to replace simulated processors. Doing so usually requires tedious manual parallelization or re-design to encapsulate processors.

Both problems can be avoided by generating the simulator from a concurrent, structural model of the CMP. Such a model not only resembles hardware, making it easy to understand and use, but also provides sufficient information to automatically parallelize the simulator without requiring manual model changes. Furthermore, individual components of the model such as processors may be replaced with equivalent hardware without requiring repartitioning.

This paper presents techniques to perform automated simulator parallelization and hardware integration for CMP structural models. We show that automated parallelization can achieve an 7.60 speedup for a 16-processor CMP model on a conventional 4-processor shared-memory multiprocessor. We demonstrate the power of hardware integration by integrating eight hardware PowerPC cores into a CMP model, achieving a speedup of up to 5.82.

Hardware-Modulated Parallelism in Chip Multiprocessors [abstract] (DOI, PDF)
Julia Chen, Philo Juang, Kevin Ko, Gilberto Contreras, David Penry, Ram Rangan, Adam Stoler, Li-Shiuan Peh, and Margaret Martonosi
2005 Workshop on Design, Architecture and Simulation of Chip Multi-Processors (dasCMP), November 2005.

Chip multi-processors(CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these highly-parallel CMPs?

This paper presents and evaluates a new approach to highly-parallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of multi-granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling and context-switching, designed to facilitate effective hardware run-time mapping of threads to cores at low overheads.

Dynamic modulation of parallelism provides the ability to respond to run-time variability that arises from dataset changes, memory system effects and power spikes and lulls, to name a few. It also naturally provides a long-term CMP platform with performance portability and tolerance to frequency and reliability variations across multiple CMP generations. Our simulations of a range of applications posessing do-all, streaming and recursive parallelism show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on average, on a 16-core vs. a 1-core system. This is achieved with modest amounts of hardware support that allows for low overheads in thread creation, scheduling, and context-switching. In particular, our simulations motivated the need for hardware support, showing that the large thread management overheads of current run-time software systems can lead to up to 6.5X slowdown. The difficulties faced in static scheduling were shown in our simulations with a static scheduling algorithm, fed with oracle profiled inputs suffering up to 107% slowdown compared to NDP's hardware scheduler, due to its inability to handle memory system variabilities. More broadly, we feel that the ideas presented here show promise for scaling to the systems expected in ten years, where the advantages of high transistor counts may be dampened by difficulties in circuit variations.

Rapid Development of a Flexible Validated Processor Model [abstract] (PDF, PostScript)
David A. Penry, Manish Vachharajani, and David I. August
Proceedings of the Workshop on Modeling, Benchmarking, and Simulation (MoBS), June 2005.

Given the central role of simulation in procesor design and research, an accurate, validated, and easily modified simulation model is extremely desirable. Prior work proposed a modeling methodology with the claim that it allows rapid construction of flexible validated models. In this paper, we present our experience using this methodology to construct a flexible validated model of Intel's Itanium 2 processor, lending support to their claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction(CPI) to within 7.9% on average for the entire SPEC CINT2000 benchmark suite. We find that aggregate accuracy for a metric like CPI is not sufficient; aggregate measures like CPI may conceal remaining internal "offsetting errors" which can adversely affect conclusions drawn from the model. We then modified the model to reduce error in specific performance constituents. In 2 1/2 person-weeks, overall constituent error was reduced from 3.1% to 2.1%, while simultaneously reducing average aggregate CPI error to 5.4%, demonstrating that model flexibility allows rapid improvements to accuracy. Flexibility is further shown by making significant changes to the model in under eight person-weeks to explore two novel microarchitectural techniques.

Rapid Development of Flexible Validated Processor Models [abstract] (PDF, PostScript)
David A. Penry, Manish Vachharajani, and David I. August
Liberty Research Group Technical Report 04-03, November 2004.

For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quickly-obtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims.

In this paper, we present our experience using this methodology to construct a validated flexible model of Intel's Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cycles-per-instruction (CPI) to within 7.9% on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal ``offsetting errors'' which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as front-end stall errors. In 2 1/2 person-weeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; front-end stall errors were reduced from 5.6% to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility.

The Liberty Simulation Environment, Version 1.0 [abstract] (DOI, PDF, PostScript)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason Blome, and David I. August
Performance Evaluation Review: Special Issue on Tools for Architecture Research (PER), Volume 31, Number 4, March 2004. Invited.

High-level hardware modeling via simulation is an essential step in hardware systems design and research. Despite the importance of simulation, current model creation methods are error prone and are unnecessarily time consuming. To address these problems, we have publicly released the Liberty Simulation Environment (LSE), Version 1.0, consisting of a simulator builder and automatic visualizer based on a shared hardware description language. LSE's design was motivated by a careful analysis of the strengths and weaknesses of existing systems. This has resulted in a system in which models are easier to understand, faster to develop, and have performance on par with other systems. LSE is capable of modeling any synchronous hardware system. To date, LSE has been used to simulate and convey ideas about a diverse set of complex systems including a chip multiprocessor out-of-order IA-64 machine and a multiprocessor system with detailed device models.

The Liberty Simulation Environment: A Deliberate Approach to High-Level System Modeling [abstract] (PDF, PostScript)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, Sharad Malik, and David I. August
Liberty Research Group Technical Report 04-02, March 2004.

In digital hardware system design, the quality of the product is directly related to the number of meaningful design alternatives properly considered. Unfortunately, existing modeling methodologies and tools have properties which make them less than ideal for rapid and accurate design-space exploration. This article identifies and evaluates the shortcomings of existing methods to motivate the Liberty Simulation Environment (LSE). LSE is a high-level modeling tool engineered to address these limitations, allowing for the rapid construction of accurate high-level simulation models. LSE simplifies model specification with low-overhead component-based reuse techniques and an abstraction for timing control. As part of a detailed description of LSE, this article presents these features, their impact on model specification effort, their implementation, and optimizations created to mitigate their otherwise deleterious impact on simulator execution performance.

Liberty Simulation Environment, Version 1.0
Manish Vachharajani, David A. Penry, Neil Vachharajani, Jason A. Blome, and David I. August
Available at http://bardd.ee.byu.edu/Software/LSE, December 2003.

Optimizations for a Simulator Construction System Supporting Reusable Components [abstract] (DOI, PDF, PostScript)
David A. Penry and David I. August
Proceedings of the 40th Design Automation Conference (DAC), June 2003.

Exploring a large portion of the microprocessor design space requires the rapid development of efficient simulators. While some systems support rapid model development through the structural composition of reusable concurrent components, the Liberty Simulation Environment (LSE) provides additional reuse-enhancing features. This paper evaluates the cost of these features and presents optimizations to reduce their impact. With these optimizations, an LSE model using reusable components outperforms a SystemC model using custom components by 6%.

Microarchitectural Exploration with Liberty [abstract] (DOI, PDF, PostScript)
Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, and David I. August
Proceedings of the 35th International Symposium on Microarchitecture (MICRO), November 2002.
Winner Best Student Paper Award.

To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Unfortunately, the most prevalent simulator construction methodology, hand-writing monolithic simulators in sequential programming languages, yields simulators that are hard to retarget, limiting the number of designs explored, and hard to understand, instilling little confidence in the model. Simulator construction tools have been developed to address these problems, but analysis reveals that they do not address the root cause, the error-prone mapping between the concurrent, structural hardware domain and the sequential, functional software domain. This paper presents an analysis of these problems and their solution, the Liberty Simulation Environment (LSE). LSE automatically constructs a simulator from a machine description that closely resembles the hardware, ensuring fidelity in the model. Furthermore, through a strict but general component communication contract, LSE enables the creation of highly reusable component libraries, easing the task of rapidly exploring ever more exotic designs.

Coverage of Bridging Faults by Random Testing in IDDQ Test Environment (DOI)
Rochit Rajsuman and David A. Penry
Proceedings of the 6th International Conference on VLSI Design (VLSI), January 1993.

IDDQ Fault Coverage by Random Testing
David A. Penry
Masters Thesis, Department of Computer Engineering, Case Western Reserve University, April 1992.