No Cover Image

Conference Paper/Proceeding/Abstract 23 views 8 downloads

Using Runahead Execution to Hide Memory Latency in High Level Synthesis / Shane Fleming, David B. Thomas

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Swansea University Author: Shane Fleming

DOI (Published version): 10.1109/fccm.2017.33

Abstract

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory a...

Full description

Published in: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
ISBN: 978-1-5386-4038-8 978-1-5386-4037-1
Published: IEEE 2017
URI: https://cronfa.swan.ac.uk/Record/cronfa57993
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract: Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.
College: College of Science
Funders: EPSRC