14 Evaluation studies: from controlled to natural settings

14.1 Introduction

14.2 Usability testing

14.3 Conducting experiments

14.4 Field studies

chapter focuses on
i- usability testing which takes place in usability labs
ii- experiments which take place in research labs
iii- field studies which take place in natural settings such as people's homes, schools, work and leisure environments

Intro:

the usability of products has traditionally been tested in controlled laboratory settings
doing usability testing in a laboratory, or a temporarily assigned controlled environment, enables evaluators to control what users do and to control environmental and social influences that might impact the users' performance
the goal is to test whether the product being developed is usable by the intended user population to achieve the tasks for which it was designed

14.2.1 Methods, Tasks and Users

collecting data about users' performance on predefined tasks is a central component of usability testing
Performance times and numbers are the two main performance measures used, in terms of the time it takes typical users to compete a task, such as finding a website, and the number of errors that participants make, such as selecting wrong menu options when creating a spreadsheet
a key concern is the number of users that should be involved in a usability study: 5-12 is considered an acceptable number but sometimes it is possible to use fewer when there are budget and schedule constraints

14.2.2 Labs and Equipment

Many companies such as Microsoft and IBM, used to test their products in custom-built usability labs. these facilities comprise a main testing laboratory, with recording equipment and the product being tested, and an observation room where the evaluators watch what is going on and analyse the data
there may also be a reception area for testers, a storage area and a viewing room for observers
typically there are two to three wall-mounted video cameras that record the user's behaviour, such as hand movements, facial expression, and general body language
the observation room is usually separated from the main laboratory or work room by a one-way mirror so that evaluators can watch participants being tested but testers cannot see them
usability labs can be very expensive and labor-intensive to run and maintain. a less expensive alternative, that started to become more popular in the early and mid-90s are is the use of mobile usability testing equipment
another advantage is that equipment can be taken into work settings, enabling testing to be done on site, making it less artificial and more convenient for the participants
there is an increasing number of products that are specifically designed for mobile evaluation. some are referred to as lab-in-a--box or lab-in-a-suitcase because they pack away neatly into convenient carrying case
another trend has been to conduct remote usability testing, where users perform a set of tasks with a product in their own setting and their interactions with the software are logged remotely. An advantage of this approach is that many users can be tested at the same time and the logged data automatically compiled into statistical packages for data analysis i

14.2.3. An example of usability testing; The iPad

they wanted to understand how the interactions with the device affected people and to get feedback to their clients and developers as well as those who were eager to know of the iPad lived up to their hype -which was being reported at that time it came to market . they used two methods:
i- usability testing with think-aloud in which users said what they were doing and thinking as they did it
ii- expert review
the test sessions were similar, the aim of both was to understand the typical usability issues that people encounter when using applications an d accessing websites on the iPad

The test

the equipment

usability problems

the main findings from the study showed that the participants were able to interact with websites on the iPad but that it was not optimal
getting lost in an application is an old but important problem for designers of digital products and some participants got lost because they tapped the iPad too much and could not find a back button and could not get themselves back to the home page

Interpreting and presenting the data

recommendations include supporting standard navigation
while being revealing about how usable websites and apps are on the iPad, the usability testing was not able to reveal how it will be used in people's everyday lives. this would require an in the wild study where observations are made of how people use them in their own homes and when traveling

intro

an example of hypothesis testing is: context menus (ie menus that provide options related to the context determined by the users' previous choices) are an easier to select option compared with cascading menus

i. hypotheses testing

typically, a hypothesis involves examining a relationship between two things, called variables.
-variables can be independent or dependent.
an independent variable is what the investigator manipulates (ie selects) eg different menu types
the other variable is called the dependent variable, eg the time taken to select an option. it is a measure of the user performance and, if our hypothesis is correct, will vary depending on the different types of menu.
when setting up a hypothesis to test the effect of the independent variable(s) on the dependent variable, it is usual to derive a null hypothesis and an alternative one
the null hypothesis in our example would state that there is no difference in the time it takes users to find items (ie selection time) between context and cascading menus
the alternative hypothesis would state that there is a difference between the two on selection time
when a difference is specified but not what it will be, it is called a two-tailed hypothesis. this is because it can be interpreted in two ways: either the context menu or the cascading menu is faster to select options from.
alternatively, the hypothesis can be stated in terms on one effect. this is called a one-tailed hypothesis and would state that context menus are faster to select items from, or vice versa. a one-tailed hypothesis would be made if there was a strong reason to believe it to be the case.
-a two-tailed hypothesis would be chosen if there was no reasons or theory that could be used to support the case that the predicted effect would goo one way or the other
null hypothesis is put forward so that the data can reject a statement without necessarily supporting the opposite statement
in order to test a hypothesis, the experimenter has to set up the conditions and find ways to keep other variables constant, to prevent them from influencing the findings. this is called the experimental design
hypothesis testing can also be extended to include even more variables, but it makes the experimental design more complex.

ii experimental design

a concern in experimental design is to determine which participants to involve for which conditions in an experiment
in some experimental designs, however, it is possible to use the same participants for all conditions without letting such training effects bias the results
the names given for the different designs are:
different-participant design, same-participant design and matched-pairs design
(i) in different-participant design, a single group of participants is allocated randomly to each of the experimental conditions, so that different participants perform in different conditions.
another term used for this experimental design is between-subjects design
an advantage is that there are not ordering or training effects caused by the influence of participants' experience of one set of tasks on their performance in the next, as each participant only ever performs in one condition.
a disadvantage is that large numbers of participants are needed so that the effect of any individual differences among participants, such as differences in experience and expertise, is minimised.
randomly allocated the participants and pre-testing to identify any participants that differ strongly from the other can help
(ii) in the same-participant design (also called within subjects design), all participants perform in all conditions so only half the number of participants is needed;
the main reason for this design is to lessen the impact of individual differences and to see how performance varies across conditions for each participant
it is important to ensure that the order in which participants perform tasks for this set up does not bias the results.
for example, if there are two tasks, A and B, half the participants should do task A followed by task B and the other half should do task B followed by task A. This is know as counterbalancing
counterbalancing neutralises possible unfair effects of learning from the first task, know as the order effect
(iii) in matche-participant design (also known as pair-wise design), participants are matched in pairs based on certain user characteristics such as expertise and gender. Each pair is then randomly allocated to each experimental condition
a problem with this arrangement is that other important variables that have not been taken into account may influence the results. for example, experience in using the web could influence the results of tests to evaluate the navigability of a website. so web expertise would be a good criterion for matching participants
statistical tests are then used, such as t-tests that statistically compare the differences between the conditions, to reveal if these are significant. For example, a t-test will reveal whether context or cascading menus are faster to select options from

iii. statistics: t-test

there are many types of statistics that can be used to test the probability of a result occurring by chance but t-tests are the most widely used statistical test in HCI and related fields, such as psychology.
the t-test uses a simple equation to test the significance of the difference between the means for the two conditions.
if they are significantly different from each other, we can reject the null hypothesis and in so doing infer that the alternative hypothesis holds.
the dfs are calculated by summing the number of participants in one condition minus one and the number of participants in the other condition minus one.
p is the probability that the effect found did not occur by chance
so when p<0.05, it means that the effect found is probably not due to chance and that there is only a 5% chance that it could be by chance. in other words there most likely is a difference between the two conditions
typically, a value of p<0.05 is considered good enough to reject the null hypothesis, although lower levels of p are more convincing, eg p<0.01 where the effect found is even less likely to be due to chance, there being only a 1% chance of that being the case

intro:

the trade-off is that we cannot test specific hypotheses about an interface nor account, with the same degree of certainty, for how people react to or use a product- as we can do in controlled settings like laboratories
Experience Sampling method (ESM), which is often used in healthcare

14.4.1 In the wild studies

the term 'in the wild' reflects the context of the study-where the new technologies are created, deployed, and evaluated in situ.
the evaluation takes place in a natural setting in which the researchers release their prototype technology and observe from a distance how it is approached and used by people in their own homes, and elsewhere.
instead of developing solutions that fit in with existing practices, researchers often explore new technological possibilities that ca n change and even disrupt behavior

i. An in the wild study: The UbiFit Garden

ii. Data collection ad participants

the goals of the in the wild study were to identify usability problems and to see how this technology fitted into the everyday lives of the participants

iii. data analysis and presentation

14.4.2 Other perspectives

field studies may be conducted where a behaviour of interest to the researchers only reveals itself after a long time of using a tool
sometimes, a particular conceptual or theoretical framework is adopted to guide how the evaluation is performed or how the data collected from the evaluation is analysed
this enables the data to be explained at a more general level in terms of specific cognitive processes, or social practices such as learning, or conversational or linguistic interactions.
for example, Activity Theory was used as a framework to analyse how a family learned to use a new TV and video system in their own home.
another example of theory use is semiotic engineering, which is based on semiotic theory. In semiotic engineering, the study of signs aids designers in analysing communication between designers and users
field studies where interventions by evaluators are limited, other than the placement of the prototype or product in the setting, and questions and/or probes to discover how the system is learned, used and adopted. In contrast, evaluations in laboratories tend to focus on usability and how users perform on predefined tasks
deciding on how many to involve in a usability study is party a logistical issue that depends on schedules, budgets, representative users, and the facilities available.
many professionals recommend that 5-12 testers is enough for many types of studies such as those conducted in controlled or partially controlled settings, although a handful of users can provide useful feedback at early stages of a design
for field studies the number of people being studied will vary, depending on what is of interest
the problem with field studies is that they may not be representatives of how other groups would act. However, the detailed findings gleaned from these studies about how participants learn to use a technology and appropriate it over time can be very revealing