Skip navigation
Brigham Young University
Department of Electrical & Computer Engineering

BARDD Publications

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

If you have institutional or personal access to the ACM Digital Library, IEEE Xplore, and/or SpringerLink, the DOI links will give you the official versions of papers.

(hide abstracts)

Improving the Interface Performance of Synthesized Structural FAME Simulators through Scheduling [abstract] (PDF)
David A. Penry
Proceedings of the 2015 IEEE International Conference on Computer Design (ICCD), October 2015.

Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs (also known as FAME simulators) can overcome the speed limitations of software-only simulators. However such simulators must be automatically synthesized or the time to design them becomes prohibitive. Previous work has shown that synthesized simulators should use a latency-insensitive design style in the hardware and a concurrent interface with the software.

We show that the performance of the interface in such a simulator can be improved significantly by scheduling all communication between hardware and software. Scheduling reduces the amount of hardware/software communication and reduces software overhead. Scheduling is made possible by exploiting the properties of the latency-insensitive design technique recommended in previous work. We observe speedups of up to 1.54 versus the former interface for a multi-core simulator.

Address Space Translation for FPGA Accelerated Simulators [abstract]
Michael Chamberlain
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, June 2015.

Microarchitectural simulation is needed to help explore the large design space of new computer systems. These simulations are taking increasingly longer amounts of time to run due to the increasing complexity of modern processors. Co-simulation and high level synthesis are promising fields to improve the overall time required for microarchitectural simulators, and can contribute to low design times and fast simulation speeds permitting a larger range of design space exploration. While promising, co-simulation techniques must find effective ways to map the host memory address space to the FPGA memory address space to be able to correctly transfer simulation data between the host and FPGA.

Load relations mapping is a new technique that builds upon existing techniques to provide support for the discovery and translation of runtime memory addresses to their equivalent FPGA memory addresses. This is accomplished by storing object reachability information discovered during a memory profiling run and later using it to recreate an object reachability mapping at runtime. This mapping can be traversed to discover needed memory addresses. We demonstrate how this technique can be used by incorporating it into the FAMEbuilder tool flow. Results show that simulation speed is not reduced and that only a small overhead is required to perform the additional memory initialization at the start of simulation. Area increases are also shown and are limited to near 10% increase on small single core models.

Interface Design and Synthesis for Structural Hybrid Microarchitectural Simulators [abstract] (PDF)
Zhuo Ruan
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, December 2013.

Computer architects have discovered the potential of using FPGAs to accelerate software microarchitectural simulators. One type of FPGA-accelerated microarchitectural simulator, named the hybrid structural microarchitectural simulator, is very promising. This is because a hybrid structural microarchitectural simulator combines structural software and hardware, and this particular organization provides both modeling flexibility and fast simulation speed. The performance of a hybrid simulator is significantly affected by how the interface between software and hardware is constructed. The work of this thesis creates an infrastructure, named Simulator Partitioning Research Infrastructure (SPRI), to implement the synthesis of hybrid structural microarchitectural simulators which includes simulator partitioning, simulator-to-hardware synthesis, interface synthesis. With the support of SPRI, this thesis characterizes the design space of interfaces for synthesized hybrid structural microarchitectural simulators and provides the implementations for several such interfaces. The evaluation of this thesis thoroughly studies the important design tradeoffs and performance factors (e.g. hardware capacity, design scalability, and interface latency) involved in choosing an efficient interface. The work of this thesis is essential to the research community of computer architecture. It not only contributes a complete synthesis infrastructure, but also provides guidelines to architects on how to organize software microarchitectural models and choose a proper software/hardware interface so the hybrid microarchitectural simulators synthesized from these software models can achieve desirable speedup.

Techniques for LI-BDN Synthesis for Hybrid Microarchitectural Simulation [abstract] (PDF)
Tyler S. Harris
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, May 2013.

Microarchitects use simulators to explore the microprocessor design space. As multiprocessor chips become more common, simulators will become both more essential and complex. The added complexity will require new simulation techniques in order to reduce to run time of the simulators to the point where they are use full for design space exploration.

Hybrid simulation is one technique for creating faster simulations, but it comes at the cost of vastly increased compiler complexity. CAD tools can be built automatically produce hybrid simulators, and hence neutralize the threat of compiler complexity. One troublesome area for hybrid simulators is the creation of associative data structures, such as re-order buffers and transition look aside buffers. The naive synthesis technique is to create a large matrix of flip flops and LUTs. It is a greater challenge to create a CAD tool that recognizes these structures and synthesizes them effeciently into hardware. This work presents a technique for accomplishing this task.

ADL-Based Specification of Implementation Styles for Functional Simulators [abstract] (DOI, PDF)
David A. Penry and Kurtis D. Cahill
The International Journal of Parallel Programming (IJPP), Volume 41, Number 2, April 2013. Invited.

Functional simulators find widespread use as subsystems within microarchitectural simulators. The speed of a functional simulator is strongly influenced by its implementation style, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. Such a tradeoff has become particularly unfortunate as multicore processor designs proliferate and multi-threaded benchmarks must be simulated.

We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.

Interface Design for Synthesized Structural Hybrid Microarchitectural Simulators [abstract] (DOI, PDF)
Zhuo Ruan and David A. Penry
Proceedings of the 2012 IEEE International Conference on Computer Design (ICCD), October 2012.

Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs overcome the speed limitations of software-only simulators as systems become more complex, however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. The performance of a hybrid simulator is significantly affected by how the interface between software and hardware is constructed. We characterize the design space of interfaces for synthesized structural hybrid microarchitectural simulators, provide implementations for several such interfaces, and determine the tradeoffs involved in choosing an efficient design candidate.

Automatic Discovery and Exposition of Parallelism in Serial Applications for Compiler-Inserted Runtime Adaptation [abstract] (PDF)
David Greenland
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, June 2012.

Compiler-Inserted Runtime Adaptation (CIRA) is a compilation and runtime adaptation strategy which has great potential for increasing performance in multicore systems. In this strategy, the compiler inserts directives into the application which will adapt the application at runtime. Its ability to overcome the obstacles of architectural and environmental diversity coupled with its flexibility to work with many programming languages and styles of applications make it a very powerful tool. However, it is not complete. In fact, there are many pieces still needed to accomplish these lofty goals.

This work describes the automatic discovery of parallelism inherent in an application and the generation of an intermediate representation to expose that parallelism. This work shows on six benchmark applications that a significant amount of parallelism which was not initially apparent can be automatically discovered. This work also shows that the parallelism can then be exposed in a representation which is also automatically generated. This is accomplished by a series of analysis and transformation passes with only minimal programmer-inserted directives. This series of passes forms a necessary part of the CIRA toolchain called the concurrency compiler. This concurrency compiler proves that a representation with exposed parallelism and locality can be generated by a compiler. It also lays the groundwork for future, more powerful concurrency compilers.

This work also describes the extension of the intermediate representation to support hierarchy, a prerequisite characteristic to the creation of the concurrency compiler. This extension makes it capable of representing many more applications in a much more effective way. This extension to support hierarchy allows much more of the parallelism discovered by the concurrency compiler to be stored in the representation.

The Pulled-Macro-Dataflow Model: An Execution Model for Multicore Shared-Memory Computers [abstract] (PDF)
Daniel J. Richins
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, December 2011.

The macro-dataflow model of execution has been used in scheduling heuristics for directed acyclic graphs. Since this model was developed for the scheduling of parallel applications on distributed computing systems, it is inadequate when applied to the multicore shared-memory computers prevalent in the market today.

The pulled-macro-dataflow model is put forth as an alternative to the macro-dataflow model, having been designed specifically to accurately describe the memory bandwidth limitations and request-driven nature of communications characteristic of today's machines. The performance of the common scheduling heuristics DSC and CASS-II are evaluated under the pulled-macro-dataflow model and it is shown that their poor performance motivates the development of a new scheduling heuristic. The Concurrent Tournament Reducer (ConTouR) is developed as a scheduling heuristic which operates well with the pulled-macro-dataflow model.

ConTouR is compared to the existing heuristics Load Balancing and Communication Minimization in scheduling two programs. For both programs, the other reducers are shown to outperform ConTouR.

Techniques for LI-BDN Synthesis for Hybrid Microarchitectural Simulation [abstract] (DOI, PDF)
Tyler S. Harris, Zhuo Ruan, and David A. Penry
Proceedings of the 2011 IEEE International Conference on Computer Design (ICCD), October 2011.

Computer designers rely upon near-cycle-accurate microarchitectural simulation to explore the design space of new systems. Unfortunately, such simulators are becoming increasingly slow as systems become more complex. Hybrid simulators which offload some of the simulation work onto FPGAs can increase the speed; however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. Furthermore, FPGA implementations of simulators may require multiple FPGA clock cycles to implement behavior that takes place within one simulated clock cycle, making correct arbitrary composition of simulator components impossible and limiting the amount of hardware concurrency which can be achieved.

Latency-Insensitive Bounded Dataflow Networks (LI-BDNs) have been suggested as a means to permit composition of simulator components in FPGAs. However, previous work has required that LI-BDNs be created manually. This paper introduces techniques for automated synthesis of LI-BDNs from the processes of a System-C microarchitectural model. We demonstrate that LI-BDNs can be successfully synthesized. We also introduce a technique for reducing the overhead of LI-BDNs when the latency-insensitive property is unnecessary, resulting in up to a 60% reduction in FPGA resource requirements.

ADL-Based Specification of Implementation Styles for Functional Simulators [abstract] (DOI, PDF)
David A. Penry and Kurtis Cahill
Proceedings of the 11th International Conference on Embedded Computer Systems: Archietctures, Modeling, and Simulation (SAMOS), July 2011.

Functional simulators find widespread use as subsystems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time.

We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.

Liberty Simulation Environment, Version 2.0
David A. Penry, Manish Vachharajani, Neil Vachharajani, Jason A. Blome, and David I. August
Available at http://bardd.ee.byu.edu/Software/LSE, July 2011.

A Single-Specification Principle for Functional-to-Timing Simulator Interface Design [abstract] (DOI, PDF)
David A. Penry
Proceedings of the 2011 International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2011.

Microarchitectural simulators are often partitioned into separate, but interacting, functional and timing simulators. These simulators interact through some interface whose level of detail depends upon the needs of the timing simulator. The level of detail supported by the interface profoundly affects the speed of the functional simulator, therefore, it is desirable to provide only the detail that is actually required. However, as the microarchitectural design space is explored, these needs may change, requiring corresponding time-consuming and error-prone changes to the interface. Thus simulator developers are tempted to include extra detail in the interface "just in case" it is needed later, trading off simulator speed for development time.

We show that this tradeoff is unnecessary if a single-specification design principle is practiced: write the simulator \emph{once} with an extremely detailed interface and then derive less-detailed interfaces from this detailed simulator. We further show that the use of an Architectural Description Language (ADL) with constructs for interface specification makes it possible to synthesize simulators with less-detailed interfaces from a highly-detailed specification with only a few lines of code and minimal effort. The speed of the resulting low-detail simulators is up to 14.4 times the speed of high-detail simulators.

Elaboration-time Synthesis of High-level Language Constructs in SystemC-based Microarchitectural Simulators [abstract] (DOI, PDF)
Zhuo Ruan, Kurtis Cahill, and David A. Penry
Proceedings of the 2010 IEEE International Conference on Computer Design (ICCD), October 2010.

Structural modeling serves as an efficient method for creating detailed microarchitectural models of complex microprocessors. High-level language constructs such as templates and object polymorphism are used to achieve a high degree of code reuse, thereby reducing development time. However, these modeling frameworks are currently too slow to evaluate future design of multicore microprocessors. The synthesis of portions of these models into hardware to form hybrid simulators promises to improve their speed substantially. Unfortunately, the high-level language constructs used in structural simulation frameworks are not typically synthesizable. One factor which limits their synthesis is that it is very difficult to determine statically what exactly the code and data to synthesize are. We propose an \emph{elaboration-time synthesis} method for SystemC-based microarchitectural simulators. As part of the runtime environment of our infrastructure, the synthesis tool extracts architectural information after elaboration, binds dynamic information to a low-level intermediate representation (IR), and synthesizes the IR to VHDL. We show that this approach permits the synthesis of high-level language constructs which could not be easily synthesized before.

Partitioning and Synthesis for Hybrid Architecture Simulators [abstract] (DOI, PDF)
Zhuo Ruan and David A. Penry
Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS), June 2010.
Finalist for Best Student Paper Award.

Pure software simulators are too slow to simulate modern complex computer architectures and systems. Hybrid software/hardware simulators have been proposed to accelerate architecture simulation. However, the design of the hardware portions and hardware/software interface of the simulator is time-consuming, making it difficult to modify and improve these simulators. We here describe the Simulation Partitioning Research Infrastructure (SPRI), an infrastructure which partitions the software architectural model under user guidance and automatically synthesizes hybrid simulators. We also present a case study using SPRI to investigate the performance limitations and bottlenecks of the generated hybrid simulators.

Exposing Parallelism and Locality in a Runtime Parallel Optimization Framework [abstract] (DOI, PDF)
David A. Penry, Daniel J. Richins, Tyler S. Harris, David Greenland, and Koy D. Rehme
Proceedings of the 2010 ACM International Conference on Computing Frontiers (CF), May 2010.

The widespread use of tens to hundreds of processor cores in commodity systems will require widespread deployment of parallel applications. Despite advances in parallel programming models, it seems unlikely that the average programmer will be able to negotiate the twin shoals of understanding how to map parallelism well on a particular architecture and the likelihood that the particular architecture will not even be known at development time. Furthermore, for many important applications, a good mapping depends upon data or application characteristics not known until runtime.

Runtime parallel optimization has been suggested as a means to overcome these difficulties. For runtime parallel optimization to be effective, parallelism and locality which are expressed in the programming model need to be communicated to the runtime system. We suggest that the compiler should expose this information to the runtime using a representation which is independent of the programming model. We term such a representation an exposed parallelism and locality (EPL) representation. An EPL representation allows a single runtime environment to support many different models and architectures and to perform automatic parallelization optimization.

In order to accomplish these goals, an EPL representation needs to be task-based, multi-relational, hierarchical, and concise. This paper describes these four properties. It also presents an optimizing runtime, ADOPAR, which uses an EPL representation.

An Internal Representation for Adaptive Online Parallelization [abstract] (PDF)
Koy D. Rehme
Masters Thesis, Department of Electrical and Computer Engineering, Brigham Young University, August 2009.

Future computer processors may have tens or hundreds of cores, increasing the need for efficient parallel programming models. The nature of multicore processors will present applications with the challenge of diversity: a variety of operating environments, architectures, and data will be available and the compiler will have no foreknowledge of the environment until run time. ADOPAR is a unifying framework that attempts to overcome diversity by separating discovery and packaging of parallelism. Scheduling for execution may then occur at run time when diversity may best be resolved.

This work presents a compact representation of parallelism based on the task graph programming model, tailored especially for ADOPAR and for regular and irregular parallel computations. Task graphs can be unmanageably large for fine-grained parallelism. Rather than representing each task individually, similar tasks are grouped into task descriptors. From these, a task descriptor graph, with relationship descriptors forming the edges of the graph, may be represented. While even highly irregular computations often have structure, previous representations have chosen to restrict what can be easily represented, thus limiting full exploitation by the back end. Therefore, in this work, task and relationship descriptors have been endowed with instantiation functions (methods of descriptors that act as factories) so the front end may have a full range of expression when describing the task graph. The representation uses descriptors to express a full range of regular and irregular computations in a very flexible and compact manner.

The representation also allows for dynamic optimization and transformation, which assists ADOPAR in its goal of overcoming various forms of diversity. We have successfully implemented this representation using new compiler intrinsics, allow ADOPAR schedulers to operate on the described task graph for parallel execution, and demonstrate the low code size overhead and the necessity for native schedulers.

Issues in Hybrid Simulator Synthesis [abstract] (PDF)
Zhuo Ruan, Koy Rehme, and David A. Penry
Proceedings of the 4th Workshop on Architectural Research Prototyping (WARP), June 2009.

The Simulator Partitioning Research Infrastructure (SPRI) is a project to automate the generation of hybrid architectural simulators. In this paper, we examine the interesting issues and challenges in hybrid simulator synthesis.

Multicore Diversity: A Software Developer's Nightmare [abstract] (DOI, PDF)
David A. Penry
ACM SIGOPS Operating Systems Review (OSR), April 2009.

Commodity microprocessors with tens to hundreds of processor cores will require the widespread deployment of parallel programs. This deployment will be hindered by the architectural and environmental diversity introduced by multicore processors. To overcome diversity, the operating system must change its interactions with the program runtime and parallel runtime systems must be developed that can automatically adapt programs to the architecture and usage environment.

SPRI: Simulator Partitioning Research Infrastructure [abstract] (PDF)
Zhuo Ruan, Koy Rehme, and David A. Penry
Proceedings of the 3rd Workshop on Architectural Research Prototyping (WARP), June 2008.

Using FPGAs as architectural simulation accelerators has been widely discussed in the computer architecture design community. We previously proposed a hybrid SW/HW simulation infrastructure named SPRI (Simulator Partitioning Research Infrastructure) which automatically partitions the general timing model into the software and hardware portions for simulation speedup, conforming to the set-based partitioning specification. The SPRI platform takes two main inputs—partitioning specification and the architectural model; it then produces a modified SW architectural binary and a HW-accelerated RTL description which can communicate with each other, called hybrid SW/HW co-simulator—the final output of SPRI. Various experiment cases have been also run through the SPRI infrastructure to test its partitioning functionality and API wrapper generation.

An Infrastructure for HW/SW Partitioning and Synthesis of Architectural Simulators [abstract] (PDF)
David A. Penry, Zhuo Ruan, and Koy Rehme
Proceedings of the 2nd Workshop on Architectural Research Prototyping (WARP), June 2007.

Many researchers are interested in using FPGAs to accelerate architectural simulation. Partitioning of the simulator between hardware and software is an important problem which has not been explored because of the enormous effort required to develop different RTL and communication infrastructure for each potential partition. We are developing a hybrid HW/SW simulation infrastructure which will provide tools for partitioning architectural simulators and synthesizing RTL for the hardware portions. This infrastructure will allow the community to explore and understand the partitioning problem and will eventually lead to automated partitioning algorithms.

You Can't Parallelize Just Once: Managing Manycore Diversity [abstract] (PDF)
David A. Penry
Position paper for the Workshop on Manycore Computing at ICS'07, June 2007.

One of the greatest challenges for the use of manycore architectures will be the growing diversity of manycore systems. This diversity will come in many forms: architecture, goals, programming languages, pre-parallelization, and dynamicisim. We argue that the most managable approach to such diversity is to delay optimization and parallelization until runtime.