Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Fleming, Shane; Thomas, David B.

doi:10.1109/fccm.2017.33

Conference Paper/Proceeding/Abstract 1276 views 517 downloads

Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Shane Fleming, David B. Thomas

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Swansea University Author: Shane Fleming

PDF | Accepted Manuscript
Download (2.72MB)

DOI (Published version): 10.1109/fccm.2017.33

Abstract

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory a...

Full description

Published in:	2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
ISBN:	978-1-5386-4038-8 978-1-5386-4037-1
Published:	IEEE 2017
URI:	https://cronfa.swan.ac.uk/Record/cronfa57993

first_indexed	2021-09-20T20:15:45Z
last_indexed	2021-11-25T04:16:50Z
id	cronfa57993
recordtype	SURis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2021-11-24T16:38:09.7105760</datestamp><bib-version>v2</bib-version><id>57993</id><entry>2021-09-20</entry><title>Using Runahead Execution to Hide Memory Latency in High Level Synthesis</title><swanseaauthors><author><sid>fe23ad3ebacc194b4f4c480fdde55b95</sid><firstname>Shane</firstname><surname>Fleming</surname><name>Shane Fleming</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-09-20</date><deptcode>MACS</deptcode><abstract>Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)</journal><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher>IEEE</publisher><placeOfPublication/><isbnPrint>978-1-5386-4038-8</isbnPrint><isbnElectronic>978-1-5386-4037-1</isbnElectronic><issnPrint/><issnElectronic/><keywords/><publishedDay>3</publishedDay><publishedMonth>7</publishedMonth><publishedYear>2017</publishedYear><publishedDate>2017-07-03</publishedDate><doi>10.1109/fccm.2017.33</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Another institution paid the OA fee</apcterm><funders>EPSRC</funders><lastEdited>2021-11-24T16:38:09.7105760</lastEdited><Created>2021-09-20T20:55:23.6695589</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Shane</firstname><surname>Fleming</surname><order>1</order></author><author><firstname>David B.</firstname><surname>Thomas</surname><order>2</order></author></authors><documents><document><filename>57993__20949__dd2996edcd07412fb4de12d9a8f41c31.pdf</filename><originalFilename>relish_fccm2017.pdf</originalFilename><uploaded>2021-09-20T21:14:33.0074104</uploaded><type>Output</type><contentLength>2847092</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling	2021-11-24T16:38:09.7105760 v2 57993 2021-09-20 Using Runahead Execution to Hide Memory Latency in High Level Synthesis fe23ad3ebacc194b4f4c480fdde55b95 Shane Fleming Shane Fleming true false 2021-09-20 MACS Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x. Conference Paper/Proceeding/Abstract 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) IEEE 978-1-5386-4038-8 978-1-5386-4037-1 3 7 2017 2017-07-03 10.1109/fccm.2017.33 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Another institution paid the OA fee EPSRC 2021-11-24T16:38:09.7105760 2021-09-20T20:55:23.6695589 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Shane Fleming 1 David B. Thomas 2 57993__20949__dd2996edcd07412fb4de12d9a8f41c31.pdf relish_fccm2017.pdf 2021-09-20T21:14:33.0074104 Output 2847092 application/pdf Accepted Manuscript true true eng
title	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
spellingShingle	Using Runahead Execution to Hide Memory Latency in High Level Synthesis Shane Fleming
title_short	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_full	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_fullStr	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_full_unstemmed	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
title_sort	Using Runahead Execution to Hide Memory Latency in High Level Synthesis
author_id_str_mv	fe23ad3ebacc194b4f4c480fdde55b95
author_id_fullname_str_mv	fe23ad3ebacc194b4f4c480fdde55b95_***_Shane Fleming
author	Shane Fleming
author2	Shane Fleming David B. Thomas
format	Conference Paper/Proceeding/Abstract
container_title	2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
publishDate	2017
institution	Swansea University
isbn	978-1-5386-4038-8 978-1-5386-4037-1
doi_str_mv	10.1109/fccm.2017.33
publisher	IEEE
college_str	Faculty of Science and Engineering
hierarchytype
hierarchy_top_id	facultyofscienceandengineering
hierarchy_top_title	Faculty of Science and Engineering
hierarchy_parent_id	facultyofscienceandengineering
hierarchy_parent_title	Faculty of Science and Engineering
department_str	School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str	1
active_str	0
description	Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.
published_date	2017-07-03T08:13:51Z
_version_	1863153910272753664
score	11.335692

Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Similar Items