Usability Heuristics Usability

My reading this week included Nielsen’s Enhancing the Explanatory Power of Usability Heuristics. As usual, I’ll get my trivial beef out of the way up front.

First, the paper is downright painful to read. The English-as-a-second-language rule buys back a few points for Nielsen here, but seriously?:

Note that it would be insufficient to hand different groups of usability specialists different lists of heuristics and let them have a go at a sample interface: it would be impossible for the evaluators to wipe their minds of the additional usability knowledge they hopefully had, so each evaluator would in reality apply certain heuristics from the sets he or she was supposed not to use.

Sure, I’m nitpicking, but that sentence makes my inner ear bleed.

Before going any further, some orientation with respect to the aim of the paper is in order. Surrounding the multiple self-citations Nielsen makes right out of the gate (before the third word of the paper), he defines heuristic evaluation as

a ‘discount usability engineering’ method for evaluation user interfaces to find their usability problems. Basically, a set of evaluators inspects the interface with respect to a small set of fairly broad usability principles, which are referred to as ‘heuristics.’

(I’ll forego my opinion that usability should be concerned with issues beyond just those in the interface itself…) A number of batteries of these usability heuristics have been developed by different authors, and in this paper Nielsen’s aim is to synthesize ‘a new set of usability heuristics that is as good as possible at explaining the usability problems that occur in real systems.’ In short, Nielsen compiles a master list of 101 heuristics from seven lists found in the literature. Armed with this master list, he examines 249 usability problems across different stages of development and types of interfaces. Each of the heuristics was given a grade for how well it explained each of the 249 problems. A principal components analysis (PCA) of these grades revealed that no heuristics account for a large portion of variability in the problems he examined.

After his PCA Nielsen groups individual heuristics into larger factors–essentially heuristic categories. In his opinion, seven of these categories warrant presentation given here in decreasing order of PCA loadings as calculated by Nielsen:

  • Visibility of system status
  • Match between system and real world
  • User control and freedom
  • Consistency and standards
  • Error prevention
  • Recognition rather than recall
  • Flexibility and efficiency of use

His presentation of these factors and their component heuristics is troubling and confusing. First, the highest PCA loading of any of these factors is 6.1%. Not only is this an exceedingly small amount of explanatory power, it represents the aggregated contribution of 12 individual heuristics! Furthermore, the individual heuristic loadings themselves seem to be at odds. As an example, the heuristic speak the user’s language taken from one source in the literature and another speak the user’s language taken from another source give respective loadings of 0.78 and 0.67 Why do two identically phrased heuristics have different loadings? Furthermore, why are two identically phrased heuristics even present in the master list at all? This should, at the very least, be addressed by the author. Without some sort of explanation, I am wary of taking Nielsen’s PCA results seriously. Nielsen sweeps this under the rug, stating that ‘it was not possible to account for a reasonably large part of the variability in the usability problems with a small, manageable set of usability factors.’ (That, or some data preprocessing or an upgraded PCA gizmo was in order…)

Nielsen states that 53 factors are needed to account for 90% of the variance in the usability problems in the dataset. I’m lost. The factors for which Nielsen did show the component heuristics had an average of 10 heuristics each. With only 101 total heuristics, how does one arrive at 53 factors (in addition to the others that account for the remaining 10% of variability)? Is Nielsen shuffling heuristics around into different factors to try and force something to work? To make matters worse, Nielsen states that ‘we have seen that perfection is impossible with a reasonably small set of heuristics’. No, you’re missing the point, Nielsen. Perfection is impossible even with a very large set of heuristics. At this point, I’m beginning to lose faith that this paper is going anywhere meaningful…

So, since perfection is impossible, Nielsen pivots to using a new lens for the data. Now, it’s a head-to-head match of the individual lists of heuristics gathered by Nielsen. Here, he ‘consider[s] a usability problem to be “explained” by a set of heuristics if it has achieved an explanation score of at least 3 (“explains a major part of the problem, but there are some aspects of the problem that are not explained”) from at least one of the heuristics in the set.’ Strange, I guess we are now ignoring Nielsen’s previous statement that ‘the relative merits of the various lists can only be determined by a shoot-out type comparative test, which is beyond the scope of the present study’… Nevertheless, based on this approach, Nielsen gives the ten heuristics that explain all usability problems in the dataset and the ten that explain the serious usability problems in the dataset. With this new analysis in hand, and after jumping through several hoops (I’m not entirely clear on how Nielsen’s data were rearranged to make this new analysis work), Nielsen concludes that ‘it would seem that [these lists of ten heuristics indicate] the potential for the seven usability factors to form the backbone of an improved set of heuristics.’ Going on, Nielsen then states that two important factors are missing: error handling and aesthetic integrity…so we’ll add those to the list, too. In other words, even though my data don’t bear this out, I’m adding them because they’re important to me, dammit.

I’m utterly confused. How is it that one can take real data, slice and dice the analysis several ways, never really get the data to shape up and prove your point, and then act like it does? Add to this the necessary hubris to come out and say, ‘Hey, even without the data to prove it, I’m stating that these are equally important factors’, and I’m left wholly unimpressed with this paper.