Pupillometry and the vigilance decrement: Task‐evoked but not baseline pupil measures reflect declining performance in visual vigilance tasks

Abstract Baseline and task‐evoked pupil measures are known to reflect the activity of the nervous system's central arousal mechanisms. With the increasing availability, affordability and flexibility of video‐based eye tracking hardware, these measures may one day find practical application in real‐time biobehavioural monitoring systems to assess performance or fitness for duty in tasks requiring vigilant attention. But real‐world vigilance tasks are predominantly visual in their nature and most research in this area has taken place in the auditory domain. Here, we explore the relationship between pupil size—both baseline and task‐evoked—and behavioural performance measures in two novel vigilance tasks requiring visual target detection: (1) a traditional vigilance task involving prolonged, continuous and uninterrupted performance (n = 28) and (2) a psychomotor vigilance task (n = 25). In both tasks, behavioural performance and task‐evoked pupil responses declined as time spent on task increased, corroborating previous reports in the literature of a vigilance decrement with a corresponding reduction in task‐evoked pupil measures. Also in line with previous findings, baseline pupil size did not show a consistent relationship with performance measures. Our data offer novel insights into the complex interplay of brain systems involved in vigilant attention and question the validity of the assumption that baseline (prestimulus) pupil size and task‐evoked (poststimulus) pupil measures reflect the tonic and phasic firing modes of the locus coeruleus.

paragraph there is considerable variability in how performance was measured, with some focusing 113 primarily on RT measures, such as mean RTs (e.g., Smallwood et al., 2011Smallwood et al., , 2012

118
Task demands also vary considerably across experiments, with some requiring only simple target 119 detection (e.g., Massar et al., 2016) and others requiring simultaneous (Beatty, 1982; van den Brink et al.,  The distinction between successive and simultaneous discrimination tasks was first made by Parasuraman (1979). Successive tasks are absolute judgement tasks where observers must compare the current sensory input with a template in working memory in order to determine whether a particular stimulus is, or is not, a critical signal. Simultaneous tasks on the other hand are comparative judgement tasks, where each stimulus contains all of the information required to determine whether it is (or is not) a signal. Due to the involvement of working memory, successive tasks are thought to be more resource demanding than simultaneous tasks. 2012), differences in visual attributes such as luminance, color and contrast may have contributed to 132 pupillometric and behavioral variance (Barbur, Harlow, & Sahraie, 1992;Goldwater, 1972).

133
The increasing availability, affordability, and flexibility of video-based eye tracking hardware 134 means that pupils' predictive power for vigilant attention may one day find practical application in 135 passive, real-time biobehavioral monitoring systems to assess performance or fitness for duty. Such  Since Mackworth (1948Mackworth ( , 1950, experimental vigilance tasks have generally aimed to simulate 144 the conditions of real-world scenarios where monotonous repetitive tasks have become commonplace due 145 to automation and industrial mechanisation. Though vigilance tasks can vary in many ways, the defining 146 characteristic is that observers must remain alert and respond to critical signals presented against a 147 background of noncritical signals over prolonged, unbroken stretches of time-usually at least 30 mins 148 (Frankmann & Adams, 1962;Parasuraman & Davies, 1976). Key differences between tasks known to 149 influence performance are the sensory modality of stimulus presentation (e.g., auditory, visual), the 150 psychophysical dimensions used to define critical signals (e.g., brightness, loudness), and whether the 151 detection of targets requires successive or simultaneous discrimination (Parasuraman, 1979; 152 2008); but performance ultimately depends on complex interactions between factors relating to the task, 153 the environment, and the individual (Ballard, 1996). To date, most vigilance tasks conducted with been considerable focus on event-related pupil measurements. The present experiment therefore aimed to 166 examine event-related pupil responses in a novel vigilance task with visual stimuli, whilst controlling 167 appropriately for the effects of eye movements and luminance confounds.

168
Our task required participants to continuously monitor four centrally presented equiluminant 169 visual stimuli for 30 min in order to detect and respond to brief targets occurring with temporal and 170 spatial uncertainty against a high background event rate. A relatively high number of targets-6 per 171 min-was used to ensure that a suitable amount of event-related data would be generated for the analysis 172 (e.g., Mackie, 1987). We predicted that performance measures (e.g., RT, accuracy, d') across successive 173 10 min task blocks would betray a classic vigilance decrement as has been reported widely in the 174 literature (Frankmann & Adams, 1962;Mackie, 1987;Mackworth 1948Mackworth , 1950Wiener, 1987 (Baddeley & Colquhoun, 1969;Mackworth, 1948Mackworth, , 1950Parasuraman & Davies, 1976).

190
Participants were asked to monitor four low-contrast gratings arranged squarely around a central fixation 191 circle. The gratings rotated synchronously in a clockwise ticking motion at a rate of 120 ticks per minute 192 (30° rotation per tick) and targets were defined as instances where one became briefly out of sync with 193 the others (i.e., it missed a tick: see Figure 1). The task lasted for 30 min, during which time continuous 194 monitoring was required. Six targets were presented every minute (180 overall) at pseudorandom 195 intervals, subject to the following constraints: 1) the time between targets was at least 6 s and at most 30 196 s, 2) targets did not occur within 2 s of the beginning or end of the task. Targets occurred equally often at 197 all of the four locations, although this was randomized across the whole experiment so that spatial 198 uncertainty as to the location of the target would contribute to task difficulty (Broadbent, 1958;Mackie,

-Procedure.
On arrival at the lab, participants were told that for the next 30 min they 221 would be required to complete a vigilance task that involved monitoring four circular patches rotating 222 with a ticking motion at the center of the screen. It was explained that, from time to time, one of the 223 patches would briefly become out of phase with the others, and that this was a target to which they had to 224 respond. Participants were not given any further information about the frequency or temporal and spatial uncertainty of the targets. Once comfortable with the definition of a target, they were instructed that their 226 task was to press the space bar every time they noticed such an event. Participants were forewarned that 227 the task was monotonous, but were asked to try and respond as quickly and accurately as possible. They 228 were also instructed to maintain central fixation on the screen. A 5-point calibration and validation routine 229 was performed prior to starting the experiment.

298
The average RT for all hits was 666 ms (SD = 156 ms). Average sensitivity and response bias across the 299 whole experiment were 2.93 (SD = 0.67) and 1.07 (SD = 0.26), respectively, indicating that perceptual 300 sensitivity to targets was good, but also that participants were generally biased to withhold responses to 301 targets.    between Watch Period 2 and Watch Period 3 (p = .937). This suggests that participants became more 329 conservative as the task progressed and were therefore more reluctant to report that a target was present.

330
As predicted, these performance data are consistent with the classic vigilance decrement.

338
Event-related pupil data were time-locked to button events for hits and false alarms and to 339 stimulus events for misses and correct rejections. This was to ensure the comparability of pupil data that beginning up to 500 ms before the motor act and peaking shortly afterwards. Permutation tests revealed 345 significant modulation from baseline for hits and false alarms, as well as a significant difference between 346 these two outcomes (lower-right panel of Figure 4). The differences between the two traces can be 347 summarized as follows. For hits, there was an average pupil modulation of 2.04% and a peak modulation 348 of 5.62% with a latency of 400 ms from the button press, whereas for false alarms these values were

391
Overall, these patterns in the pupil data are consistent with the prediction that the magnitude of 392 task-evoked responses will mirror behavioral performance and decline as time-on-task increased.      Broadbent, 1953;Broadbent & Gregory, 1965;Buck, 1966;Mackworth, 1948, 1950 415 Davies, 1976Davies, , 1982Warm et al., 2008). The signal detection measures, sensitivity (d') and response bias 416 (c), were calculated to gain further insight into the cause of the declining percentage of hits. Given that 417 the nature of the task was in making trivially easy judgements about suprathreshold stimuli, it is not 418 surprising that sensitivity remained at ceiling throughout. However, there was a conservative shift in 419 response bias, suggesting that the decline in accuracy was linked to the participants becoming less willing 420 to report a detection, rather than a diminishing ability to discriminate targets from nontargets (Green & 421 Swets, 1974). This is consistent with previous reports that the vigilance decrement in tasks with high 422 event rates is more closely related to changes in the strictness of the decision criterion over time, rather 423 than perceptual sensitivity (e.g., Baddeley & Colquhoun, 1969;Broadbent, 1971;Colquhoun, 1961;

444
However, the larger pupil dilation for false alarms may also be associated with the higher degree of 445 uncertainty that accompanies these events compared to correct detections (Yu & Dayan, 2005).

446
To examine the effects of time-on-task on pupil dynamics, scalar values of baseline and task-447 evoked pupil size were calculated for all stimulus and button events in each third of the task ( Figure 5).

448
As with Beatty (1982), baseline pupil size for all outcomes was relatively unchanged across the duration

475
Key indicators would be whether participants experienced the task as being effortful and the extent to 476 which they engaged in task-unrelated thought, but we did not obtain these data as it would have required In sum, the present study replicated the well-known vigilance decrement-the reduction in 492 detection performance that takes place during conditions of prolonged and continuous monitoring.

493
Mirroring this behavioral effect, task evoked pupil responses declined across the duration of the task, but 494 baseline pupil size was mostly unchanged, suggesting that the vigilance decrement may have been linked 495 to gradual disengagement of attention as the task progressed, rather than a change in organismic arousal  displaying a millisecond counter set to '000'. The subject held the device and quickly pressed a button 508 every time the counter began to increment, which happened at intervals ranging between 1-10 s. Upon 509 detection of a response, the timer froze for 1.5 s, and the RT was saved before the timer reset to '000'. A 510 variety of performance metrics can be derived from the data produced by this task, but analysis 511 commonly focuses on mean and median RT, the fastest and slowest 10% of trials, and the proportion of 512 'lapses', which are usually defined as RTs greater than 500 ms .

513
In Experiment 2 we sought to examine how pupil measures relate to PVT task performance, but 514 with a novel stimulus approach optimized for pupillometry. Most PVTs utilize the prototypical stimulus 515 of a running millisecond timer that counts up from zero, but as noted by Thorne

577
To assess task performance we focused on 1/RT and lapse frequency, which are among the most 578 sensitive measures of alertness in PVTs . Lapses in PVTs are traditionally 579 defined as RT greater than 500 ms but due to our novel take on the task we defined lapses as RT greater

605
Block 3: F = 3.40, p < .012), which is consistent with the prediction that performance would decline as 606 time-on-task increased. Post hoc analysis with Bonferroni adjustment showed that, in Trial Groups 1-3, 607 1/RT was significantly greater in Block 1 compared to Blocks 2 and 3 (all ps < .05) and that 1/RT in Trial 608 Group 4 was significantly greater for Block 1 compared to Block 3 (p < .05). No other comparisons were 609 significant (p > .05). Therefore, as indexed by 1/RT, performance was best overall in Block 1 compared 610 to Block 2 and Block 3, but the magnitude of this effect decreased across Trial Groups.

-Pupil data.
Grand-average button-locked pupil traces for each Block are shown in Figure   623 8. The pupil began to dilate slowly following the stimulus event and then rapidly after the button-press. In 624 the 1500 ms following the button-press there was an average modulation of 5.22% and a peak latency of 625 880 ms. A conspicuous trough in the pupil traces after the button-press coincides with a transient but 626 marked increase in the percentage of interpolated data. This artifact resembles the blink-induced pupillary 627 response (e.g., Knapen et al., 2016) and is therefore indicative of task-correlated blinking (i.e., participants tended to blink after button presses). We did not correct this artifact with linear interpolation 629 as it would involve altering too much data and excluding more trials.      678 rotation and responded with a button press as quickly as possible after the event. We adopted an atypical 679 stimulus approach to avoid confounds associated with the canonical running counter stimulus-namely 680 its variable intensity and the performance feedback that it provides-which could potentially contribute to 681 variance in behavioral and pupillometric measures (Thorne et al., 2005). Participants completed three 682 successive blocks of the task taking only a 1-min break in between, and changes in performance and pupil 683 measures were explored both within and between blocks. We predicted that performance and pupil size would decrease as time-on-task increased, and that worse performance would be associated with smaller 685 pupils at baseline.

686
The initial point to note is that our novel stimulus approach led to longer RTs than are typically 687 observed in PVTs that use the canonical running counter stimulus. In these tasks, average RT for 691 Wilkinson & Houghton, 1982), whereas in the current PVT, also with subjectively alert participants, 692 average RT was 420 ms. We attribute this to differences in stimulus intensity. The running counter 693 stimulus is dynamic and constantly changing, providing a constantly refreshed cue for the participant to 694 respond, whereas a change in the orientation of a low contrast grating is more subtle and discrete, and 695 issues no refreshing cue to respond.

706
As regards the pupil data, the pattern of within-block declining baseline pupil size broadly 707 reflected the decline in task performance, corroborating findings from previous PVT studies (Massar et  across Trial Groups in Block 3, where performance was at its worst. In a similar fashion, the task-evoked 716 pupil responses were largest at the beginning of Block 1, where performance was best, but were less 717 consistent with respect to the performance data at other times. These patterns in the pupil data are in line 718 with the general prediction that pupil size would decrease as time-on-task increased, but they run contrary 719 to the prediction that worse performance would be reflected in smaller pupils at baseline.

720
Previous experiments offer conflicting evidence as to whether optimal task performance is 721 associated with larger or smaller pupils at baseline (e.g., Kristjansson

757
Finally, we note that our novel take on the PVT limits the extent to which it can be directly 758 compared to a more traditional PVT. The use of an alternative stimulus was desirable to avoid certain 759 confounds, but the experiment also differed in terms of block length and ISI. In their general

777
Jones & Cohen, 2005), task-evoked pupil responses were generally more pronounced when performance 778 was best. This trend was most consistent in Experiment 1, where the decline in detection performance was 779 mirrored by a decline in the magnitude of task-evoked responses associated with hits, misses, and false 780 alarms. In Experiment 2, the relationship between task-evoked responses and performance measures was 781 less consistent, although the largest responses did occur when performance was best (i.e., at the beginning 782 of Block 1). In general, these findings suggest that changes in task-evoked pupil responses may serve as 783 an accurate indication of general task engagement, with a decline in their magnitude over time reflecting 784 cognitive disengagement from the task and an increased likelihood of suboptimal performance.

785
Our baseline pupil measures did not show a consistent relationship with performance. In 786 Experiment 1, baseline pupil size was mostly unchanged across three successive periods of watch, despite 787 a marked decrement in performance. In Experiment 2, baseline pupil size showed an overall decline within each Block, although the slope became less pronounced with each successive Block. Interestingly, 789 baseline pupil size was biggest overall at the beginning of each Block, where task performance was best, 790 suggesting that it reflects heightened arousal, alertness, and focused attention. But, by this account, our 791 baseline measures in the PVT reflect combinations of autonomic tone as well as task-related factors, 792 which means that they are not serving uniquely as a window of insight into the "tonic" mode of LC 793 activation, as is often explicitly or implicitly assumed (see below). The lack of consistency in our baseline  1908) of optimum arousal, whereby the relationship between task performance and arousal is described 802 by an inverted-U function, such that poor performance is associated with both under-and over-arousal, 803 and optimum performance occurs at a "sweet spot" on the arousal curve.

804
We refrained from using the words "tonic" and "phasic" to describe our pupil measures because 805 we are aware of numerous caveats to the assumption that baseline and task-evoked measures map neatly 806 onto the different modes of LC output. Joshi and Gold (2020) discuss this issue in detail and emphasize 807 that, in the context of LC activation, the terms "tonic" and "phasic" differentiate between distinct modes 808 of activation, and not simply between baseline and transient activity (Aston-Jones & Cohen, 2005).

809
Further, the operational definition of "tonic" and "phasic" pupil measures varies substantially between 810 publications. Also, the precise neural mechanisms of the relationship between pupil measures and LC 811 activation are presently unclear and it is possible that a third variable, as of yet not understood, may 812 account for the observed pupil-LC link (Costa & Rudebeck, 2016).
In conclusion, the results of our two vigilance experiments support the general notion that 814 changes in task-evoked pupil measures can be used to gain insight into monitoring performance in long 815 and demanding tasks where the emphasis is on additive effects over a series of trials. But there is clearly a 816 need for further research to determine the practical feasibility of utilizing pupil size as a 817 psychophysiological marker of attentional lapses in real-time monitoring systems. Characterizing the 818 precise relationship between different measures of behavioral performance, task-related factors and 819 patterns of pupil behavior will be a crucial next step in this regard.