No Cover Image

Conference Paper/Proceeding/Abstract 96 views 66 downloads

Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Shane Fleming, David B. Thomas

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Swansea University Author: Shane Fleming

DOI (Published version): 10.1109/fccm.2017.33

Abstract

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory a...

Full description

Published in: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
ISBN: 978-1-5386-4038-8 978-1-5386-4037-1
Published: IEEE 2017
URI: https://cronfa.swan.ac.uk/Record/cronfa57993
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2021-09-20T20:15:45Z
last_indexed 2021-11-25T04:16:50Z
id cronfa57993
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2021-11-24T16:38:09.7105760</datestamp><bib-version>v2</bib-version><id>57993</id><entry>2021-09-20</entry><title>Using Runahead Execution to Hide Memory Latency in High Level Synthesis</title><swanseaauthors><author><sid>fe23ad3ebacc194b4f4c480fdde55b95</sid><firstname>Shane</firstname><surname>Fleming</surname><name>Shane Fleming</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-09-20</date><deptcode>FGSEN</deptcode><abstract>Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)</journal><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher>IEEE</publisher><placeOfPublication/><isbnPrint>978-1-5386-4038-8</isbnPrint><isbnElectronic>978-1-5386-4037-1</isbnElectronic><issnPrint/><issnElectronic/><keywords/><publishedDay>3</publishedDay><publishedMonth>7</publishedMonth><publishedYear>2017</publishedYear><publishedDate>2017-07-03</publishedDate><doi>10.1109/fccm.2017.33</doi><url/><notes/><college>COLLEGE NANME</college><department>Science and Engineering - Faculty</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>FGSEN</DepartmentCode><institution>Swansea University</institution><apcterm>Another institution paid the OA fee</apcterm><funders>EPSRC</funders><lastEdited>2021-11-24T16:38:09.7105760</lastEdited><Created>2021-09-20T20:55:23.6695589</Created><path><level id="1">College of Science</level><level id="2">Computer Science</level></path><authors><author><firstname>Shane</firstname><surname>Fleming</surname><order>1</order></author><author><firstname>David B.</firstname><surname>Thomas</surname><order>2</order></author></authors><documents><document><filename>57993__20949__dd2996edcd07412fb4de12d9a8f41c31.pdf</filename><originalFilename>relish_fccm2017.pdf</originalFilename><uploaded>2021-09-20T21:14:33.0074104</uploaded><type>Output</type><contentLength>2847092</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling 2021-11-24T16:38:09.7105760 v2 57993 2021-09-20 Using Runahead Execution to Hide Memory Latency in High Level Synthesis fe23ad3ebacc194b4f4c480fdde55b95 Shane Fleming Shane Fleming true false 2021-09-20 FGSEN Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x. Conference Paper/Proceeding/Abstract 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) IEEE 978-1-5386-4038-8 978-1-5386-4037-1 3 7 2017 2017-07-03 10.1109/fccm.2017.33 COLLEGE NANME Science and Engineering - Faculty COLLEGE CODE FGSEN Swansea University Another institution paid the OA fee EPSRC 2021-11-24T16:38:09.7105760 2021-09-20T20:55:23.6695589 College of Science Computer Science Shane Fleming 1 David B. Thomas 2 57993__20949__dd2996edcd07412fb4de12d9a8f41c31.pdf relish_fccm2017.pdf 2021-09-20T21:14:33.0074104 Output 2847092 application/pdf Accepted Manuscript true true eng
title Using Runahead Execution to Hide Memory Latency in High Level Synthesis
spellingShingle Using Runahead Execution to Hide Memory Latency in High Level Synthesis
Shane Fleming
title_short Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_full Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_fullStr Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_full_unstemmed Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_sort Using Runahead Execution to Hide Memory Latency in High Level Synthesis
author_id_str_mv fe23ad3ebacc194b4f4c480fdde55b95
author_id_fullname_str_mv fe23ad3ebacc194b4f4c480fdde55b95_***_Shane Fleming
author Shane Fleming
author2 Shane Fleming
David B. Thomas
format Conference Paper/Proceeding/Abstract
container_title 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
publishDate 2017
institution Swansea University
isbn 978-1-5386-4038-8
978-1-5386-4037-1
doi_str_mv 10.1109/fccm.2017.33
publisher IEEE
college_str College of Science
hierarchytype
hierarchy_top_id collegeofscience
hierarchy_top_title College of Science
hierarchy_parent_id collegeofscience
hierarchy_parent_title College of Science
department_str Computer Science{{{_:::_}}}College of Science{{{_:::_}}}Computer Science
document_store_str 1
active_str 0
description Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.
published_date 2017-07-03T04:14:26Z
_version_ 1737027869742202880
score 10.896966