Playability Heuristics

Next up in the reading stack is Playability Heuristics for Mobile Games.

Stemming from the literature on usability heuristics, the authors (Korhonen and Koivisto) develop a set of playability heuristics for mobile games. In the paper, they present their motivations for developing these heuristics, the heuristics themselves, and the ‘results’ of their ‘validation’ of these characteristics.

Their heuristics are grouped into three categories: gameplay, mobility, and game usability. Their initial list of heuristics was made up of the following:

  • Don’t waste the player’s time
  • Prepare for interruptions
  • Take other persons into account
  • Follow standard conventions
  • Provide gameplay help
  • Differentiation between device UI and the game UI should be evident
  • Use terms that are familiar to the player
  • Status of the characters and the game should be clearly visible
  • The player should have clear goals
  • Support a wide range of players and playing styles
  • Don’t encourage repetitive and boring tasks

In order to validate these heuristics, six evaluators applied them to a selected application and noted all playability problems that were both covered and not covered by the list of heuristics. The evaluators found 61 playability problems, but 16 of these were not adequately described by one of their heuristics. Thus, the authors expanded their initial set of heuristics into three expanded sublists (one for each ‘category’):

  • Game Usability
    • Audio-visual representation supports the game
    • Screen layout is efficient and visually pleasing
    • Device UI and game UI are used for their own purposes
    • Indicators are visible
    • The player understands the terminology
    • Navigation is consistent, logical, and minimalist
    • Control keys are consistent and follow standard conventions
    • Game controls are convenient and flexible
    • The game gives feedback on the player’s actions
    • The player cannot make irreversible errors
    • The player does not have to memorize things unnecessarily
    • The game contains help
  • Mobility
    • The game and play sessions can be started quickly
    • The game accommodates with the surroundings
    • Interruptions are handled reasonably
  • Gameplay
    • The game provides clear goals or supports player- created goals
    • The player sees the progress in the game and can compare the results
    • The players are rewarded and rewards are meaningful
    • The player is in control
    • Challenge, strategy, and pace are in balance
    • The first-time experience is encouraging
    • The game story supports the gameplay and is meaningful
    • There are no repetitive or boring tasks
    • The players can express themselves
    • The game supports different playing styles
    • The game does not stagnate
    • The game is consistent
    • The game uses orthogonal unit differentiation4
    • The player does not lose any hard-won possessions

This expanded set of heuristics was validated using the same process, only now with five different games. Based on this process, the authors draw the following conclusions:

  • Usability problems were both the easiest to identify with their heuristics, as well as the easiest violations to make.
  • More mobility problems were found than expected.
  • Gameplay is the most difficult aspect of playability to evaluate.

Yowza–talk about a scattered paper. I mean, this bad boy is all over the place. It seems as though the authors’ thoughts simply haven’t gelled well at all. Nevertheless, they do present what seem to be reasonable heuristics for the evaluation of playability. I have two primary problems with this paper. First, an the world of smartphones and mobile games has changed dramatically in the last decade. I would imagine an more recent look at playability is both available and more useful. Second, while their heuristics seem reasonable, and they claim to have validated these heuristics, I can’t find any evidence of this. Do Korhonen and Koivisto not understand that just using a set of heuristics doesn’t imply that they are valid? This leads to the bigger question of what it means for a set of heuristics to be valid. Do valid heuristics completely describe all possible problems? Is the ‘most’ valid set of heuristics that which completely describes all possible problems with the fewest heuristics? I’m not sure. I am sure, however, that writing a list of heuristics and then applying them absolutely does not make them valid. The analysis necessary to do so just isn’t present in this paper. Even if the authors claim to have begun to validate their framework of heuristics, they certainly haven’t presented any such results in this paper. While the work shows (showed) promise, I find this both misleading and frustrating.

Usability Heuristics Usability

My reading this week included Nielsen’s Enhancing the Explanatory Power of Usability Heuristics. As usual, I’ll get my trivial beef out of the way up front.

First, the paper is downright painful to read. The English-as-a-second-language rule buys back a few points for Nielsen here, but seriously?:

Note that it would be insufficient to hand different groups of usability specialists different lists of heuristics and let them have a go at a sample interface: it would be impossible for the evaluators to wipe their minds of the additional usability knowledge they hopefully had, so each evaluator would in reality apply certain heuristics from the sets he or she was supposed not to use.

Sure, I’m nitpicking, but that sentence makes my inner ear bleed.

Before going any further, some orientation with respect to the aim of the paper is in order. Surrounding the multiple self-citations Nielsen makes right out of the gate (before the third word of the paper), he defines heuristic evaluation as

a ‘discount usability engineering’ method for evaluation user interfaces to find their usability problems. Basically, a set of evaluators inspects the interface with respect to a small set of fairly broad usability principles, which are referred to as ‘heuristics.’

(I’ll forego my opinion that usability should be concerned with issues beyond just those in the interface itself…) A number of batteries of these usability heuristics have been developed by different authors, and in this paper Nielsen’s aim is to synthesize ‘a new set of usability heuristics that is as good as possible at explaining the usability problems that occur in real systems.’ In short, Nielsen compiles a master list of 101 heuristics from seven lists found in the literature. Armed with this master list, he examines 249 usability problems across different stages of development and types of interfaces. Each of the heuristics was given a grade for how well it explained each of the 249 problems. A principal components analysis (PCA) of these grades revealed that no heuristics account for a large portion of variability in the problems he examined.

After his PCA Nielsen groups individual heuristics into larger factors–essentially heuristic categories. In his opinion, seven of these categories warrant presentation given here in decreasing order of PCA loadings as calculated by Nielsen:

  • Visibility of system status
  • Match between system and real world
  • User control and freedom
  • Consistency and standards
  • Error prevention
  • Recognition rather than recall
  • Flexibility and efficiency of use

His presentation of these factors and their component heuristics is troubling and confusing. First, the highest PCA loading of any of these factors is 6.1%. Not only is this an exceedingly small amount of explanatory power, it represents the aggregated contribution of 12 individual heuristics! Furthermore, the individual heuristic loadings themselves seem to be at odds. As an example, the heuristic speak the user’s language taken from one source in the literature and another speak the user’s language taken from another source give respective loadings of 0.78 and 0.67 Why do two identically phrased heuristics have different loadings? Furthermore, why are two identically phrased heuristics even present in the master list at all? This should, at the very least, be addressed by the author. Without some sort of explanation, I am wary of taking Nielsen’s PCA results seriously. Nielsen sweeps this under the rug, stating that ‘it was not possible to account for a reasonably large part of the variability in the usability problems with a small, manageable set of usability factors.’ (That, or some data preprocessing or an upgraded PCA gizmo was in order…)

Nielsen states that 53 factors are needed to account for 90% of the variance in the usability problems in the dataset. I’m lost. The factors for which Nielsen did show the component heuristics had an average of 10 heuristics each. With only 101 total heuristics, how does one arrive at 53 factors (in addition to the others that account for the remaining 10% of variability)? Is Nielsen shuffling heuristics around into different factors to try and force something to work? To make matters worse, Nielsen states that ‘we have seen that perfection is impossible with a reasonably small set of heuristics’. No, you’re missing the point, Nielsen. Perfection is impossible even with a very large set of heuristics. At this point, I’m beginning to lose faith that this paper is going anywhere meaningful…

So, since perfection is impossible, Nielsen pivots to using a new lens for the data. Now, it’s a head-to-head match of the individual lists of heuristics gathered by Nielsen. Here, he ‘consider[s] a usability problem to be “explained” by a set of heuristics if it has achieved an explanation score of at least 3 (“explains a major part of the problem, but there are some aspects of the problem that are not explained”) from at least one of the heuristics in the set.’ Strange, I guess we are now ignoring Nielsen’s previous statement that ‘the relative merits of the various lists can only be determined by a shoot-out type comparative test, which is beyond the scope of the present study’… Nevertheless, based on this approach, Nielsen gives the ten heuristics that explain all usability problems in the dataset and the ten that explain the serious usability problems in the dataset. With this new analysis in hand, and after jumping through several hoops (I’m not entirely clear on how Nielsen’s data were rearranged to make this new analysis work), Nielsen concludes that ‘it would seem that [these lists of ten heuristics indicate] the potential for the seven usability factors to form the backbone of an improved set of heuristics.’ Going on, Nielsen then states that two important factors are missing: error handling and aesthetic integrity…so we’ll add those to the list, too. In other words, even though my data don’t bear this out, I’m adding them because they’re important to me, dammit.

I’m utterly confused. How is it that one can take real data, slice and dice the analysis several ways, never really get the data to shape up and prove your point, and then act like it does? Add to this the necessary hubris to come out and say, ‘Hey, even without the data to prove it, I’m stating that these are equally important factors’, and I’m left wholly unimpressed with this paper.

Pay Attention!

Physically and virtually embodied agents offer great potential due to their capacity to afford interaction using the full range of human communicative behavior. To know when to best utilize these behaviors, these agents must be able to perceive subtle shifts in users’ emotional and mental states.

Contributing to the development of agents with such capabilities, Dan Szafir and Bilge Mutlu presented their work in implementing a robotic agent capable of sensing and responding to decreasing engagement in humans in their recent paper, Pay Attention! Designing Adaptive Agents that Monitor and Improve User Engagement.

In any learning situation, teachers communicate most effectively to learners when they sense that learners are beginning to lose attention and focus, and re-engage their attention through verbal and non-verbal immediacy cues. These cues–for example, changes in speech patterns, gaze, and gestures–create a greater sense of immediacy in the relationship between the teacher and learner, drawing the focus of the learner back to the topic at hand. Szafir and Mutlu posit that equipping robotic agents with the ability to monitor learners’ brain wave patterns through electroencephalography (EEG), recognize declining attention, and respond with such immediacy cues, the efficacy of learning can be improved. They also argue that such agents will promote a stronger sense of rapport between learner and agent, as well as a greater motivation to learn in learners.

EEG Correlates of Task Engagement and Mental Workload in Vigilance, Learning, and Memory Tasks


In 2007, Berka et al published their article, EEG Correlates of Task Engagement and Mental Workload in Vigilance, Learning, and Memory Tasks. With the aim to improve our ‘capability to continuously monitor an individual’s level of fatigue, attention, task engagement, and mental workload in operational environments using physiological parameters’, they present the following:

  • A new EEG metric for task engagement
  • A new EEG metric for mental workload
  • A hardware and software solution for real-time acquisition and analysis of EEG using these metrics
  • The results of study of the use of these systems and metrics

The article focuses primarily on two related concepts: task engagement and mental workload. As they put it:

Both measures increase as a function of increasing task demands but the engagement measure tracks demands for sensory processing and attention resources while the mental workload index was developed as a measure of the level of cognitive processes generally considered more the domain of executive function.

Using features derived from the signals acquired using a wireless, twelve channel EEG headset, Berka et al trained a model using linear and quadratic discriminant function analysis to identify and quantify cognitive state changes. For engagement, the model gives probabilities for each of high engagement, low engagement, relaxed wakefulness, and sleep onset. For workload, the model gives probabilities for both low and high mental workload. (They appear to consider cognitive states as unlabeled combinations of probabilities of each of these classes.) The aim of their simplified model was generalizability across various subjects and scenarios, as well as the ability to implement the model in wireless, real-time systems.

They trained the model using 13 subjects performing a battery of tasks, and cross-validated it with 67 additional subjects performing a similar battery of tasks. Task order was not randomized in either training or cross validation. The batteries of tasks encompass a range of task types and difficulties. Unfortunately, the authors struggle to present these batteries of tasks as a cohesive whole and to argue for relationship between the tasks.

In general, Berka et al found that for the indexes they developed:

[T]he EEG engagement index is related to processes involving information-gathering, visual scanning, and sustained attention. The EEG-workload index increases with working memory load and with increasing difficulty level mental arithmetic and other problem-solving tasks.

My primary issue with this article revolves around the authors’ statement:

During [some] multi-level tasks, EEG-engagement showed a pattern of change that was variable across tasks, levels, and participants.

Indeed, these tasks represented a large portion of the task battery. The authors argue for the effectiveness of their engagement index, but never thoroughly address why this index is inconsistent across tasks, levels, and participants. At the very least, this might have been included in the authors’ suggestions for future work.

Open Questions

  • The authors gave very few details on the specifics of their wireless EEG system. Many recent products in this area have been of questionable usefulness, at best…
  • Why did the authors not control for ordering effects?
  • Why the different protocols for training and cross-validation? More than this, why modify tasks that were common across both protocols. Finally, if the authors were going to modify common tasks, why not modify those that seemed particularly problematic–at least as they presented them in the paper (e.g., “Trails”)?

I thought we were over ‘synergy’…

Hey Matthew Bietz, Toni Ferro, and Charlotte Lee, 2004 called–it wants its terrible buzzwords back. No really, people have been vocal about their hate for ‘synergy’ for over a decade now–find a less grating way to describe cooperative interaction. Here’s a brilliant suggestion: ‘cooperative interaction’.

Now that that’s out of the way, I’m just through with reading Sustaining the Development of Cyberinfrastructure: An Organization Adapting to Change by Bietz et al. (yes, at least they left it out of the title.) This was a 2012 study in how to create cyberinfrastructure sustainability through ‘synergizing’ (an unholy, Frankensteinian abomination of a made up word).

Paper Mindmap

Paper Mindmap


According to the authors, a ‘cyberinfrastructure’ (CI) is a virtual organization composed of people working with large-scale scientific computational and networking infrastructures. This seems to be an overly limiting definition of a CI, but a suitable one for the purposes of the paper. Within this definition, the authors consider how the people who work on and within CIs grapple with the issues of growing amounts of data, and sizes and complexities of computational problems. In particular, the authors are interested in the exploring the sustainability of CIs. They do so through a large case study of one particular CI, that at the Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) out of UCSD. The authors spent an extended period of time over two observation periods separated by two years interviewing participants on the projects, working amongst the participants, and observing general trends in the microbial research community.


Overall, the sustainability of a CI boils down to how well relationships are managed, and how open to change the developers of CIs are to change. The authors present several observations from their work with CAMERA that demonstrate how innate constant change is to the environments in which CIs are situations, and how CIs are, fundamentally, an intricate set of relationships between people, organizations, and technologies. Over the course of the study’s observation of the camera project, the authors observed a number of changes in the structure of the project. These changes, due to the multi-layered relationships comprising the CI, had far-reaching effects across many different pieces of the CI. The only successful way to navigate such changes is to understand their potential impact throughout the CI.


At the risk of sounding overly reductionistic, it seems to me that the overwhelming majority of what the authors present in this paper is basic common sense. Take any business, stir the pot, and watch how the business responds. I assume that most intelligent people would be able to surmise that any significant changes would have far reaching effects within the business, and that sensitivity to such changes and their effects on relationships would be important in determining how well the organization copes with such changes. Certainly, the situation becomes more complex given a more complex relationship structure, but the principal remains the same. Furthermore, this does not pair well with my general cynicism toward practice-based research. While the paper is well structured and written, I find it hard to identify any genuine contribution the paper makes beyond a decent articulation of what most people should already know.