Critical

We tested a commercial in-car navigation system prototype against the NHTSA criteria for acceptance testing of in-vehicle electronic devices, in order to see what types of in-car tasks fail the acceptance test and why. In addition, we studied the visual demands of the driving scenario recommended by NHTSA for task acceptance testing. In the light of the results, NHTSA guidelines and acceptance criteria need to be further developed. In particular visual demands of the driving scenario and for different simulators need to be standardized in order to enable fair testing and comparable test results. We suggest the visual occlusion method for finding a driving scenario that corresponds better with real-life driving in visual demands as well as for standardizing the visual demands of the scenario when applied to different driving simulators. Furthermore, the acceptance criteria need to be re-evaluated. Especially the TEORT limit’s applicability to a variety of test tasks needs to be validated and exceptions for certain task types considered. The utility of the average glance duration criterion should be reconsidered.


INTRODUCTION
In 2013 National Highway Traffic Association (NHTSA) in US published the Visual-Manual NHTSA Driver Distraction Guidelines for In-Vehicle Electronic Devices [11] with the aim to decrease the amount of car crashes of drivers distracted by interacting with in-car electronic devices while driving. The guidelines consist of general interface recommendations and a collection of secondary task features that are considered to impede the safe operation of a car, and thus, should not be performable while driving. Further it outlines two test methods to determine the level of visual distraction induced by interactions with an incar system, ultimately drawing a line on whether an in-car task is safe to be performed while driving. The suggested methods for task acceptance testing are 1) Occlusion Testing or 2) Eye Glance Measurement Using Driving Simulator Testing [11]. The latter method, on which we are focusing here, uses metrics of eye glance durations away from the road while executing a task with the in-car system and driving in a simulator. Test participants are asked to follow a lead vehicle on a highway and maintain headway of around 70 m to it. The in-car test tasks are performed while on the move. The following criteria need to be fulfilled by an in-car task to pass the acceptance test ( [11], p. 272 -273): 1. "For at least 21 of the 24 test participants, no more than 15 percent (rounded up) of the total number of eye glances away from the forward road scene have duration of greater than 2.0 seconds while performing the testable task one time.
2. For at least 21 of the 24 test participants, the mean duration of all eye glances away from the forward road scene is less than or equal to 2.0 seconds while performing the testable task one time.
3. For at least 21 of the 24 test participants, the sum of the durations of each individual participant's eye glances away from the forward road scene is less or equal to 12.0 seconds while performing the testable task one time." These criteria have been applied in research studies to assess distraction effects induced by in-car infotainment systems (i.e. [2] [7]). However, so far there have been only a few analytical evaluations on how appropriate the NHTSA acceptance criteria are as well as on the factors that influence whether a task passes the criteria or not. Furthermore, there is no solid theoretical background for the suggested driving scenario for task acceptance tests and no evidence that failure in the proposed testing criteria would in fact correspond to risky visual behaviors in this type of scenario. In other words, the visual demands of the suggested driving scenario are unknown. According to Wierwille's visual sampling model [17] drivers try to keep off-road glance durations between 500 to 1600 milliseconds in almost all driving situations in real traffic. There is no evidence to date indicating that this would hold also for the NHTSA driving scenario.
Visual occlusion is an established method for assessing the visual demands of driving [14] [16]. The visual demand of driving can be defined as the frequency at which the driver has to update the focus of visual attention in order to decrease uncertainty of the task-critical event states (e.g. speed, lane position) in the immediate field of view to a preferred level, after [14] and [20]. In the visual occlusion technique the visual field of the driver (driving scene) is intermittently occluded by the means of visor, goggles or blanked screens on system-or driver-paced intervals in order to get an idea of the visual demands of driving. In driverpaced occlusion the occlusion time (OT) can be taken as equal to the longest time that a driver would choose to drive comfortably with eyes off road when fully concentrating on the driving task. 85 th percentile OT of a driver sample can then be interpreted as a limit of acceptable behavior, following the logic of a common design standard in traffic engineering.
This paper focuses on the following research questions: -Do the studied in-car navigation system functionalities pass the NHTSA recommended criteria?
-What are the reasons in-car tasks do not meet the criteria?
-What are the individually preferred limits of inappropriate offroad glance durations for the NHTSA recommended driving scenario as the 85 th percentile occlusion times when concentrating fully on the driving task?
-Are these in line with the NHTSA recommended in-car glance duration distributions and acceptance criteria?
-Is the NHTSA driving scenario sensitive to small changes in the driving task?
In order to answer these questions, we evaluated a commercial navigation system based on the NHTSA criteria to see which incar tasks would pass as well as studied the visual demands of the NHTSA driving scenario. In order to study the sensitivity of the NHTSA scenario's visual demands we compared these to a slightly modified scenario with no lead car.
Regarding the analyses of visual demands of the NHTSA scenario and the acceptance criteria 1 and 2, we tested the following hypotheses: H1. The 85 th percentile of the 85 th percentile OTs is near 2.0 seconds (criterion 1).
H2. The 85 th percentile of the ratio of over 2-second OTs is near 15 percent (criterion 1).
H3. The 85 th percentile of the mean OTs is near 2.0 seconds (criterion 2).
H4. The OT distributions differ significantly between the NHTSA scenarios with and without the lead car.

METHOD 2.1 Design
The study followed the NHTSA (2013) task acceptance testing guidelines [11] as accurately as possible within the technical limitations caused by the driving simulator in use. The acceptance testing of the in-car tasks (10) followed a within-subject 10 x 1 design. For the analyses of the visual demands of the NHTSA scenario a within-subject 2 x 1 design was used (driving with and without a lead car).

Participants
A total of 26 participants took part in the study. Thirteen of the participants were female and the other half male. They were all in good general health. Participants were between 20 to 62 years old, with an average of 39.9 years (SD=15.2 years) and all of them had previous experience on cell phone use while driving. Six test participants were 18 to 24 years old, six 25 to 39 years old, six 40 to 54 years old, and eight were 55 years old or older. Two participants over 55 years felt simulator sickness and their data was removed from the test data. One had to cancel at the beginning of the tests whereas the other one was able to complete the occlusion trials before interrupting the test. All the participants had a valid driving license and were driving more than 5,000 kilometers (over 3,000 miles) per year. They were all unfamiliar with the prototype navigation system under testing. Participants were recruited by sending invitations through public university and company e-mail lists. Each participant was rewarded with a movie ticket for taking part in the study.

Apparatus
The tests were conducted at the Driving Simulator Laboratory of the Department of Computer Science and Information Systems in the University of Jyväskylä, Finland. The fixed-base mediumfidelity driving simulator consists of parts of a real vehicle cab ( Figure 1).

Figure 1. Experimental setup (navigation system UI blurred because of confidentiality issues)
The driving environment is projected on three screens. Because of the GPS-simulation we needed to use a simulation of a real traffic environment. The part of the highway used for testing, E12 Hämeenlinnanväylä, is not perfectly straight but has a very mild curvature typical to a highway.
The prototype of the commercial navigation system software was displayed as a 10-inch head unit screen on the upper left corner of a 22-inch 3M capacitive multi-touch display. Dikablis headmounted eye-tracking device with 50 Hz sampling rate was used for gathering eye movement data. Two Dell laptops were needed for capturing eye-tracking data and for controlling the experimental setup.

Procedure
Before the experiment started each participant had to sign a Non Disclosure Agreement and was informed about the purpose and setup of the study. They were then taken to the driving simulator and steering wheel and pedals were adjusted for the participant. Participants practiced driving in the simulator using a city environment. Once they felt comfortable driving the eye tracking system was put on and calibrated.
Participants were introduced to the navigation system and shown how to perform each of the tasks to be tested. They then could practice them until they felt comfortable doing them, without and while driving. An overview of the tested tasks is given in table 1. Tasks were grouped into 5 trials, with each trial consisting of 1 -3 tasks. NB: Before participants started driving they were asked to set the destination to "Klaukkalantie 10" and add the shop "Suomussalmen Kiriakauppa" as a waypoint to the route.
Add the hotel "Easydays/Lomamökit" as a waypoint to the current route and start guidance. Tap on Start navigation Furthermore, the participants received instructions on the driving task, which should be prioritized over the secondary task. They were asked to drive in the right hand lane and keep a constant headway distance of 70 meters to the lead vehicle driving at a speed of 80 km/h (50 mph). Participants then could practice driving the highway scenario.
Order of the trials were mixed and counterbalanced. Each trial took about 3.5 minutes each. Testing of a task was not done for a participant if the task had already failed a criterion or there were enough successful performances (21/24) to pass all the criteria.
After the NHTSA task acceptance trials participants went on with two visual occlusion trials, driving occluded in the NHTSA scenario as well as in a modified NHTSA scenario without the lead car. The order of the trials were counterbalanced across the sample. In the occlusion trial, the screens were occluded (black) as default, and the participant could unocclude the driving scene for 500 ms for each press of the right side paddle shifter in the steering wheel. The goals for the two trials were to keep the own lane and a constant 70-meter headway distance to a lead car driving at a constant 80 km/h (50 mph) for the NHTSA scenario or a constant speed of 80 km/h (50 mph) for the scenario without the lead car.

Analysis
Due to tracking inaccuracies and the pupil often being lost during eye-movements from the driving scene to in-car display and back, the automatically scored glance durations seemed to be systematically less than manually scored durations. Thus, a human data reducer determined from the overlaid gaze video data the offroad glance durations by following the SAE-J2396 standard [15].
Occlusion time (OT), which is the time that the driver drove with vision occluded, was used to estimate the visual demand. The data was calculated with a script from the driving simulator log data. It is clear that a longer occlusion time corresponds to a lower perceived visual demand, even though the exact functional form is not known. Even without knowing the exact form, it is possible to compare the OT distributions of the two scenarios to determine which scenario has a higher visual demand. OT distributions are typically heavy-tailed, and thus Gaussian statistics cannot be properly used. In these cases statistical analyses were performed with a Wilcoxon rank sum test with continuity correction for the median OTs and Welch two-sample t-test on the logarithm of the OT. Alpha level of .01 was used in the statistical testing.

Acceptance test results for the in-car navigation system
Mean individual glance duration, mean percentage of over 2 seconds glances and mean Total Eyes Off Road Time (TEORT) are shown in Table 2. There were no off road glances to elsewhere than in the head unit, so in-car glances equal eyes off road glances here (e.g., no glances on mirrors).  Table 3 gives an overview of the number of participants failing the criteria. The mean individual glance duration criterion was passed for all the tasks. The percentage of over 2 seconds glances was above the 15% limit for 1 out of 24 participants for 4 tasks.
More than 3 out of 24 participants exceeded the 12 seconds TEORT limit for the tasks "Find road number" (5), "Add a hotel to your route" (14) and "Get guidance to a restaurant without using toll roads" (8), thus failing the acceptance test.

Visual Demands of the NHTSA Scenario
As expected, the OT distributions were found to be heavy-tailed (Figure 2), and thus Gaussian statistics cannot be properly used. Robust estimators (median, 85th percentile) can, however, be determined (Table 5).  Considering H1, the data indicates the 85 th percentile of the 85 th percentile OTs in the NHTSA scenario is 2.54 seconds which means just over 500 millisecond difference to the bar of 2.0 seconds (Table 5 and Figure 3). The mean difference is small but the difference in the 85 th percentile ratios of over 2 seconds OTs in the NHTSA scenario is unacceptably high (46%) compared to the bar of 15 percent (H2, Table 5, Figure 4). There is one participant with a value looking as an outlier, but there is no clear reason to reject him. There are several others with over 15% values as can be seen in Figure 4. H3 gets direct support from the occlusion data. The 85 th percentile of the mean OTs for the NHTSA scenario was exactly 2.0 seconds (see Figure 5).  In km/h, for the NHTSA scenario with the lead car the mean speed was 79.6 (SD=4.7), whereas for the scenario with no lead car it was 79.9 (SD=4.0). The car speeds are symmetrically distributed, and Gaussian statistics can be used directly ( Figure  4). The means differ in a statistically significant way, but in practice the effect is too small to be meaningful: a two-sample

Figure 7. Vehicle speed distributions in the NHTSA scenario (up) and in the modified NHTSA scenario without the lead car (down), m/s (22.22 m/s ~ 80 km/h ~ 50 mph)
The higher variance of speed with the lead car could be a result of task instructions. In the NHTSA scenario the goal is to keep a steady 70 m distance to the lead car driving at 80 km/h, whereas in our modified scenario without the lead car the goal was to keep a steady speed of 80 km/h (in addition to lane-keeping in both scenarios). For the distance keeping task the goal of 70 m headway is more vague and thus, probably more uncertainty could be tolerated in the task than in the speed maintenance task when the speedometer needle told the driver the accurate speed value. The greater visual demands of the scenario without the lead car could be explained by the requirement to shift gaze between the vanishing point of the road for anticipatory look-ahead fixations [8] and the speedometer necessary for speed control in the fixedbase simulator. The anticipatory information gathering and speed control (in the form of headway distance control) can be visually more easily managed when the driver is able to keep the fixations on the lead car moving at a static speed. The uncertainty about the states of speed and the path of travel can be reduced simultaneously when keeping the fixations on the tail of a lead car.

DISCUSSION
In this experiment, we evaluated ten in-car tasks on a commercial in-car navigation system prototype following the NHTSA task acceptance testing of eye glance measurements using a driving simulator [11], in order to see what types of in-car tasks fail the test and why. We further studied the visual demands of the driving scenario recommended by NHTSA for acceptance testing.
Out of the ten tested tasks seven easily passed the acceptance criteria, but three of them exceeded the Total Eyes Off Road Time (TEORT) limit of 12 seconds. This included the tasks "Add a hotel to your route" and "Get guidance to a restaurant without using toll roads" which indeed require a high number of interaction steps. It is recommended to reduce the amount of steps needed to perform these tasks to reduce the TEORT. Surprisingly also the task "Find road number" did not pass although it did not require any manual interaction with the system, instead the driver only needed to identify the road number from the map. However, the road number was not continuously shown on the map, thus it was a matter of chance if a participant happened to look at the display at the very moment the road number was present. The test result could imply to present this information continuously. However, this would result in visually cluttered maps, considering that there is other relevant information in addition to the road number that would need to be presented permanently. Instead it seems that for certain types of in-car tasks the TEORT limit just seems not to work very well.
Our test tasks 4.1 and 5 had the search targets listed early in the search results. However, there are search tasks, such as looking for a nearby restaurant or a gas station that can provide a long list of search results. It matters a lot from the TEORT perspective where the target is located in the result list. If the search target is put on the first page of the list, the 12 seconds TEORT criterion could easily be passed. However, how likely is it that the target is always on the first page? Does this mean we should test all possible target positions against TEORT? In a similar manner there are real-world tasks for which the end of a task cannot be defined exactly, i.e. exploring the content of a music player. Already today one can easily browse internet radio stations for more than 12 or 20 seconds just to find one you like among the numerous possibilities available. These types of in-car services that are available in many infotainment systems today, need to be tested for distraction as well, and should have user interfaces optimized to minimize the associated risky overlong glances.
Given these limitations it can be argued that it needs to be clearly defined for which types of in-car tasks TEORT is applicable as it obviously is not suitable for any given task. If applied to the example browsing tasks above, following TEORT recommendations would ultimately mean to hide away a lot of information, e.g., about services nearby, which are however much valued by drivers. In addition, it is unclear how well TEORT applies for multi-step and extended interactions that characterize activities involving voice-command interfaces with visual feedback elements [13]. The good point in TEORT for search tasks is that it forces designers to optimize the search interfaces (short cuts, sorting, etc.) so that the search target won't be at the bottom of the list, unless there is no target (i.e., in browsing music).
From the risk association and visual demand perspective, the percentage of over 2 seconds glance criteria seems more valuable than the TEORT limit, and should be decisive for these kinds of tasks. Yet, this criterion is only meaningful if the visual demands of the driving scenario are adjusted to correspond to the real visual demands of driving. This is why we analyzed the visual demands of the NHTSA driving scenario by testing four hypotheses.
Regarding hypotheses H1 and H2, our analyses of the visual demands of the NHTSA scenario (at least as realized in our simulator) indicated that the driving task seems to be very low on its visual demands. 85 percent of the participants felt comfortable driving blind for more than 2 seconds at a time in about 46% of the occluded periods. Compared to earlier studies on visual demands of driving [14][17], the NHTSA scenario does not really well imitate the visual demands of real driving. Well, who would try to keep a constant 70-meter distance to a lead car while keeping the own lane in real life without observing anything else in the environment? The 85 th percentile for the 85 th percentile occlusion times was as high as 2.54 seconds. Does this mean the acceptance limit should be 2.5 seconds instead of 2 seconds in this driving scenario?
From the safety point of view it is of course rational to keep the criterion somewhat under the driver-accepted OTs, but here the 85 th percentiles of over 2 seconds OT ratios were certainly too high compared to real driving. One reason for the low visual demands could be that there is no risk of collision or unexpected events as in more realistic driving scenarios. Keeping the gaze at the lead car is all that is required in order to keep the distance and the lane position for experienced drivers [18] who are to be recruited by the NHTSA guidelines.
The data gives direct support for H3. The 85 th percentile of the mean OTs for the NHTSA scenario was exactly 2.0 seconds. However, when looking at our test results, the results of [4], and the Wierwille's visual sampling model for real driving [17], this acceptance criteria does not make much sense. The average in-car glance durations for the test tasks were all below 1.0 seconds and in line with Wierwille's model. One might wonder what kind of horrible in-car task would not pass the limit, and thus it can be considered as useless. The fact that the criterion was supported by the visual demand data actually tells us again that the visual demands of the NHTSA driving scenario are too low.
According to literature, there is a rationale behind the NHTSA 2 seconds limit for a risky off road glance. Field studies have indicated a statistical relationship between individual 2-second off-road glance durations and the risk of safety-critical incidents [9]. According to Wierwille [17] drivers try to keep in-car glance durations between 500 to 1600 milliseconds in almost all driving situations. But shouldn't we make sure that the driving scenario in any given driving simulator used for task acceptance testing corresponds to a real-world scenario, in which 2 seconds is near the 85 th percentile for OT? H4 got support from the data. The modified NHTSA scenario without the lead car seemed to have somewhat greater visual demands. The demands of this scenario were closer to the criterion of NHTSA (85 th % of over 2 s OT: 38%, 85 th % of 85 th % OTs: 2.14 s). One reason for the lower visual demands of the NHTSA scenario might be that keeping the gaze at the lead car is all that is required in order to control the headway distance and the lane position. Without the lead car the driver has to alternate fixations between the road ahead for anticipatory information gathering [8] and the speedometer for speed control in a fixedbase simulator. However, the visual demands of the scenario without the lead car would be much more sensitive to simulatordependent factors such as the location of the speedometer as well as motion feedback of the simulator. With a motion platform, the need to observe the speedometer for keeping a static speed can be significantly lower than in a fixed-base simulator. A question for further research is: would the visual demands have been reversed if the drivers were simply told that the speed limit in the simulated highway is 80 km/h but with no explicit instructions to monitor and attempt to maintain accurately the speed?
Based on these findings, one can raise the question whether the visual demands of the NHTSA driving scenario and the test results would be comparable if the tests were run in different simulators? There were small differences in our driving scenario compared to the NHTSA scenario, most importantly the lane markings were white, not yellow, and there were small curvature on the road instead of a perfectly straight road. However, it seems unlikely that these differences could have a major impact to the visual demands of the driving scenario. In fact these differences should raise the demands if anything, not lower. In any case, we can highly recommend the use of the visual occlusion method for controlling the visual demands of the NHTSA testing scenario across different simulator platforms.
When looking at the visual demand data carefully one can notice significant individual differences in the experienced levels of visual demand. These could be due to personal factors such as uncertainty toleration [4] or the skill level of the driver [18]. The usage of the often-used 85 th percentile in the acceptance criteria shows a pretty nice sense of realism about the data. Medians and percentiles are very robust estimators whenever the data is heavytailed, very messy and sample sizes are small. And that is what we have here. Again the use of the mean in-car glance duration as an acceptance criterion can be highly questioned.
NHTSA guidelines are not perfect but they are the best we have at the moment for controlling the explosion of bad, even lethal, incar user interfaces. Without this kind of guidance, even if imperfect, we could really talk about "killer apps". In the current study, the criteria were able to differentiate the most visually demanding tasks from the tasks with low visual demands (with one false exception). Thus the testing procedure and criteria allow in-car system designers to iteratively test and validate their design to ultimately create less demanding interfaces and interactions. Changes to the UI have been already made based on the results.
Given the visual demands of the driving scenario in our simulator, the criterion was strict regarding the percentage of long in-car glances, which was a good thing. Interestingly however, all the incar tasks passed this criterion although there were individual percentages exceeding 15 percent. In some simulator implementations, however, it could be that the visual demands of the NHTSA driving scenario are much higher. In the worst-case scenario this would mean that visually highly demanding in-car tasks could pass the percentage of over 2 seconds glance criterion due to unrealistically high visual demands of driving. At a general level, drivers try to adapt the in-car glance durations according to the prevailing visual demands of the driving task [17].
AAM [1] as well as NHTSA [11] have used a manual radio tuning task as the baseline reference task for determining a societally acceptable level of distraction. This type of task could help to calibrate the test results for different simulators. However, NHTSA does not specify exactly what radio and what kind of task should be used in the baseline tests. Even if the radio and the baseline reference task would be accurately specified, this won't solve the problem of possibly varying visual demands of the driving scenario in different driving simulators. The radio tuning task won't give reliable baseline to the visual demands of the driving scenario against which inappropriate off-road glance durations could be decided. A more theoretically sound base for the criteria is needed. Driver's self-accepted 85 th percentile OTs when fully focusing on driving could be used as a limit for acceptable off-road glancing behavior. In-car glance durations exceeding the self-accepted OTs would then indicate a real lapse of control. This approach would give room or means to account for individual differences between test samples. Significant differences in the test participants' visual sampling skills, especially with elderly drivers sample, are highly probable [19].
In our study we also noticed that many eye-tracking systems give unreliable automatic scores for the off-road glance durations when Area-Of-Interest analyses are used. Due to inaccuracy and because the pupil is often lost during eye-movements from the driving scene to in-car display and back, the glance durations seemed to be systematically less than the manually scored durations. Despite the increased work effort we suggest to use manual scoring of the off-road glances following the SAE-J2396 [15] definitions in order to get reliable and comparable glance durations (see also [13]). The automatically scored glance duration data should in any case be carefully inspected manually. As an important end note, the visual occlusion acceptance test alternative outlined in the NHTSA guidelines has little to do with the visual occlusion method suggested here, and can be highly criticized for its metrics' informativeness and validity [6][10] [12].

CONCLUSION
In light of the findings the NHTSA guidelines and acceptance criteria should be further developed. In particular visual demands of the driving scenario and for different simulators need to be standardized in order to enable fair testing and comparable test results. We suggest the visual occlusion method [14] for finding a driving scenario that corresponds better with real-life driving in visual demands as well as standardizing the visual demands of the scenario when applied to different driving simulators. In addition, alternative, less expensive scenarios with less irrelevant details could be used in acceptance testing as long as the visual demands correspond to those of the NHTSA scenario.
Furthermore, the acceptance criteria need to be re-evaluated taking into account all the possible modern in-car activities. Especially the TEORT limit's applicability to a variety of test tasks needs to be validated and exceptions for certain task types considered. The utility of the average glance duration criteria should be reconsidered. Specifications on test task design for acceptance testing would be helpful to avoid manipulation of test results into a desired direction by task design.