You are on page 1of 10

Using Observational Pilot Studies to Test and Improve Lab Packages

Manoel Mendona, Daniela Cruzes, Josemeire Dias


Computer Networks Research Group (NUPERC), Salvador University (UNIFACS) Rua Ponciano de Oliveira, 126 Salvador, BA 41950275, Brazil +557133304627 (daniela, mgmn, meire@unifacs.br)

Maria Cristina Ferreira de Oliveira


Instituto de Cincias Matemticas e de Computaco, Departmento de Cincias da Computao e Estatstica. Av. Trabalhador So-Carlense, 400 Centro Caixa Postal 668, 13560-970 So Carlos, SP, Brazil (cristina@icmc.usp.br)

ABSTRACT
Controlled experiments are a key approach to evaluate and evolve our understanding of software engineering technologies. However, defining and running a controlled experiment is a difficult and error-prone task. This paper argues that one can significantly reduce the risks associated with defining a new controlled experiment by running a set of well-planned observational pilot studies aimed at improving the experimental material. It gives the steps for such approach and uses a case study to illustrate it. The case study shows the definition of an experiment to evaluate inspection techniques for information visualization tools through a set of four observational studies and one experimental trial. Based on the lessons learned we present some guidelines on how to test and improve experimental material in such way.

engineering is subject to many intervening variables and usually produces information that is diverse in nature [14]. Controlled experiments are a key approach to gather empirical data in software engineering. Well-defined and well-run experiments can produce information under controlled settings and be replicated by others to proof or refute their finds and current understanding. Many questions lie behind defining a good controlled experiment, questions such as: (1) Does the experimental design reduce the threats to experimental validity? (2) Is the subject selection adequate for the problem at hand? (3) Is the experimental artifacts mature enough for running the experiment? (4) Is the time allocated to each activity satisfactory? Some of those questions, like questions one and two, can be answered by careful analysis and planning. Others, like questions 3 and 4, have to be tested in practice. There are too many intervening factors affecting an experiment to answer them analytically. Moreover, addressing the replication process conformity is also an issue that should be considered in this context. In our own work, we have reasoned that better replications and complementary studies can be encouraged by the availability of laboratory packages that document an experiment. A laboratory package describes the experiment in specific terms and provides materials for replication, highlights opportunities for variation, and builds a context for combining results of different types of experimental treatments. Laboratory packages build an experimental infrastructure for supporting future replications. They establish a basis for confirming/denying original results, complementing the original experiment, and tailoring the object of study to the specific experimental context [18]. This paper argues that one can significantly reduce the risks associated with defining a new controlled experiment, by running a set of well-planned observational pilot studies. Those studies are aimed at sanity check the experimental design, improving the experimental material, and check the timing of the experimental activities. The paper defines an approach to establish such pilot studies and uses a case study to illustrate it. The case study shows the definition of an experiment to evaluate data exploration tools inspection techniques through a set of four observational studies and one experimental trial. Based on the lessons learned we present some guidelines on how to test and improve experimental material in such way.

Categories and Subject Descriptors


D.2.0 [Software Engineering]: General

General Terms
Measurement, Experimentation.

Keywords
Empirical Studies, Lab Package, Visual Data Mining, Inspection Techniques.

1. INTRODUCTION
No science can advance without experimentation and measurement [12]. Progress in any discipline involves building models that can be evaluated, through experimental studies to check whether the current understanding of the field is correct [1]. Software engineering is no different, but experimentation in this field is a complex and time consuming endeavor. Software

Definition

Planning

Testing

Operation

Analysis Interpretation

Packaging Experiment Base

Knowledge Base

Figure 1 - Experimentation Process

2. ADDING A TESTING PHASE TO THE EXPERIMENTATION PROCESS


This section defines our approach. Its basis is the experimental paradigm for software engineering described by Wohlin in [19]. This paradigm is composed of the following five phases: Definition, Planning, Operation, Analysis, and Packaging. We propose the inclusion of a testing phase between the planning and operation phase, see Figure 1. The definition phase states the problem and its context. The main artifact produced by the end of definition and planning phases is the experimental plan [19]. It contains all necessary information to run and analyze the experimental study and some instructions about how the experimental study should be replicated. In the planning phase, there are also the definition of hypotheses, selection of variables, selection of individuals, instrumentation, and adequacy of the results validity. Usually, these definitions are based on some sort of document model (templates), which can guide the experimenter in filling it throughout the experimentation process activities. The resulting documents will contain problems that have to be solved before the operation phase. The added testing phase is geared towards sanity checking the experimental design, improving the experimental material, and testing the timing of the experimental activities. It could be viewed as part of the planning phase, but we pulled it apart to highlight its importance. Its main purpose is not to prove that the experiment material is free of errors or problems, but rather to detect the presence of as many problems as possible. The goal is to make sure that the material contains no major problems that could invalidate the results of an experiment or a replication at the operation phase. The Testing phase is successful when one can progress through the operation phase iterations with no need of changing essential information on the experimental material. As shown in Figure 2, the testing phase is composed of four activities: (1) select a simplified experimental project and run it, (2) collect data and observations, (3) interview subjects, and, (4) analyze and improve the experimental material. One should iterate through these steps until the experimental material is ready to the operational phase. The testing phase has similarities with the operation phase but it has a different focus. Instead of focusing on collecting experimental data about the technology being evaluated (the

experiment goal), it focuses on collecting data about the experimental material and experiment design. The first step of the testing phase is to select a simplified experimental project. This step will define the scope of the pilot study. One usually wants to reduce the scope of the planned experiment for two reasons. The first is that a simplified version can be geared towards specific questions about the experiment and its artifacts. The second is that running the whole experiment may be too costly or to complex. There are at least three ways in which the experimental project can be simplified: Time reduction: when one reduces the time the subjects will use to complete experimental activities; Subject selection: when one selects a small number of subjects, usually handpicking subjects that are knowledgeable on experimentation or on the domain being evaluated; Artifact limitation: when only part of an experimental artifact is evaluated, or when a subset of the experimental artifacts is used;

The simplifications may be executed apart or in combination. As a criterion, we suggest that one starts with subject selection, selecting a small group of people that they trust and know for the first pilot study. These subjects should be easy to interview and it is useful to know their profile upfront. It is even better if one can get very knowledgeable subjects. They should be able to give good feedback on the experimental procedures. One should try to get subjects that are knowledgeable both on experimentation and on the domain being evaluated, not necessarily at the same time. If these subjects are expensive resources, the experimenter may adopt one of the other experimental simplifications in order to reduce the time demanded from the subjects. The first pilot study should help to sanity check the experimental design. The designer must return to the planning phase if a major flaw is detected in the design. If the feedback provided is good, one should run a few more studies with subjects with different profiles. As a coverage criterion, most of the profiles of the intended subject audience should be covered. One may also run specific pilot studies to evaluate troubled spots in the artifacts or the techniques. For each observational study, the four steps of the testing phase should be executed. During the study, every result collected should be packaged to create a Knowledge Base and Experimental Base. These are the basis for the execution, and replication, of the

Select a simplified experimental project and run it

Collection of Data and Observations

Interview of the subjects

Analyze and Improvement of the Material

Storage of Information of Application of Questionnaires and Interviews, Collection of Data and Improvement of the Material based on the Collected Data Lessons Learned, Insights, Perceptions etc Experimental Design Experimental Procedure Software Artifact Glossary Experiment Artifact Course Training Notes Assignment Descriptions Experimental Raw Results Experimental Refined Results

Knowledge Base: experiments

Experimental Base

Experimental Designer Aggregator Master Librarian

Figure 2 The Testing Phase actual experiments in the future [2]. In all the steps of the 3.1 Definition of the Problem observational study, the designer has to collect feedback, insights, The proposed experiment aims to evaluate techniques to inspect lessons learned and subject perception for improvement of the visual data exploration tools. experiment material. The analysis and improvement step should use all collected data and insights to solve the problems detected and improve the experimental material [5]. The exit criteria for the testing phase can be: (1) a pilot study, with few differences from the complete experimental design, was run successfully; (2) each key activities of the experimental process (training, execution, interviewing, data collection, and data analysis) were executed at least once in a pilot study; (3) the timing of these activities is sanity checked; (4) the key experimental artifacts (documents, forms, tools, and checklists) were used; (5) the set of subjects that participate in the pilot studies covers the expected profiles of experiment subjects; (6) the main threats to experimental validity are examined in the light of the pilot studies. The designer should progress to the operational phase only if there are no visible problems with the design, the activities, the timing, the artifacts, or the subjects that can invalidate the external or internal validity of the experiment Exploratory data analysis is the discovery-driven interactive exploration of a dataset. Its focus is on discovery as opposed to hypothesis testing or model building. Its main goal is to identify previously unknown, useful patterns in the explored data. Visual data representations can be used to facilitate exploratory data analysis due to the natural ability of human beings to interpret visual scenes [7][11]. Card [3] considers that visualization is mapping data to a visual form that a human being can perceive. Visual data exploration tools have several mechanisms that allow experts to control data transformations and produce visions that address a particular data exploration task on a computer screen. Using these tools, experts can map data tables to visual structures that can be easily interpreted by them. Visual data exploration tools can become very complex to implement due the following characteristics: a) their interaction mechanisms manipulate voluminous data sets; b) they usually implement complex geometric algorithms; c) they handle multidimensional data [11]; d) their canvas has to create a visual metaphor that faithfully and usefully represents complex data sets, facilitating their interpretation.; and e) their user interface has to be highly interactive and user friendly.

3. EXPERIMENTAL PROCESS A CASE STUDY


This section describes the experiment that served as the basis for our case study and laid the ground for the observational studies discussed in Section 4.

Luzzardi [5] defines criteria contemplating the essential characteristics of the visual representations and the interactive mechanisms supplied by information visualization interfaces. The principles present two distinct aspects: one to evaluate the visual representations and other to evaluate the interactive mechanisms. But, the definition of these criteria is not enough to guarantee quality evaluations. It is also necessary to define an evaluation process based on techniques and methodologies. We have proposed a technique to evaluate visual data exploration tools. In particular tools that uses hierarchical visualization techniques [11]. We then decided to design an experiment to evaluate this technique and to create a lab package that can be used to replicate our studies. Our technique is based on inspections [1][21]. It guides users to inspect for possible problems in a data exploration tool. Problems are found through the inspectors experience and the used technique. The technique allows inspectors to work individually or as a team. The context of the proposed evaluation is shown in the Figure 3. We argue that besides functionality and usability problems, visual data exploration tools have domain specific problems and can be better detected by specific criteria and techniques geared to find them.

distinct scenario. At the end of each session, the inspector has to fill an execution form where he describes the detected problems, their location and severity. In our technique, different perspectives emphasize the different knowledge required to evaluate a visual data exploration tool or to execute a visual exploration task. When using a computer to visually explore data, a user will need to assume one or more of the following perspectives: Data Analysis Experts - The user worries about how the systems helps with his data analysis expertise and if this expertise alone do tell the user how to use the system to achieve the data exploration goal; he doesn't worry about usability evaluation or the meaning of the data. Data Domain Experts - The user knows a lot about the data domain used in the experiment, he worries if the system fits the nature of the data at hand; he doesn't worry about usability evaluation or about data analysis issues. Interface Evaluation Experts- The user focus on usability evaluation and how to use the system to achieve the goal; he doesn't need worry about nature of the data or about how to analyze it.

3.2 Definition of the Experimental Goal


We use the Goal Question Metric paradigm (GQM) [14] to provide the framework that defines the empirical study. The GQM requires explicit identification of an object of study as well as a focus for the study.

Our GQM goal is: Analyze perspective-based and heuristic evaluation techniques to evaluate and compare them with respect to problem detection effectiveness in the context of an inspection team from the viewpoint of Analyst of data experts, data domain experts and interfaces evaluation experts.
We intend to answer several questions that characterize the object of study and the quality aspects of the data exploration tool. These questions are described below: What type of problems is found by each technique? Does an inspector team using the perspective-based technique detect more types of problems than a team using an heuristics technique [9][10]? Does an inspector team using the perspective-based technique detect more problems than a team using a heuristics technique? Is the effectiveness of the team using the perspective-based technique greater than the effectiveness of the team using a heuristics technique? The number and severity of the problems found by an inspector using each one perspective are consistent with the number and severity of the problems found for others inspectors using the same perspective? To answer these questions, some metrics are defined. The most important of them are described below: The number, the severity and the type of the problems found by the participant and by the team.

Figure 3 - Evaluation Context The proposed technique is perspective-based [1][21]. Instead of looking for all possible problems at the same time, each inspector focuses on a subgroup of questions covered by a perspective that he assumes during the inspection. Each perspective supplies the inspectors with a point of view, a list of verification questions and a specific procedure for conduction the inspection. It is assumed that each inspection session can detect a greater percentage of problems related with the used perspective, and that the combination of different perspectives can discover more problems than that of the combination of the same number of inspectors using one general purpose inspection technique. This assumption is supported by previous studies in perspective-based reading [1][21]. We propose three possible perspectives to be assumed during the inspections: data analysis expert, domain expert, interface evaluation expert. A list of criteria and questions is supplied to each scenario. The criteria list the questions to be verified, with a description of the inspection process. Each inspector works in a

Problem Effectiveness: percentage of problems that each participant found by level of severity.
I=5

This training provides an overview of the inspection criteria and, based on these criteria, shows some examples of problems that can appear on data exploration tools. The second day is divided in three parts: training, evaluation of the tool and feedback. The second day training is divided in two sessions. The first one is about the Tool, where the instructor gives an overview of the visualization paradigm and functionalities of the tool. The subjects are separated in two groups before the start of the second training session; they will receive specific instructions about the technique he will apply next. The evaluation of the tool takes up to two hours; the subjects can stop before that, but they cannot take more than that. Lastly, the subject fills the feedback form to answer questions on: the training sessions, the used techniques, the experimental design, the used forms, and the tool itself.

((xi*pi)*100)/y
I=1

Where x is the number of observed problems per subject in each level of severity; p is a weight adjustment for level (0 false positive; 1-cosmetic problem; 2 simple problem; 3 significant problem; 4 critical problem) and y is the total number of problems in the tool. The factors considered in the experiment are: participants profile; the adopted technique; the chosen tool; the domain data to be explored in the experiment. The experiment is limited to testing two hypotheses on the efficiency and effectiveness of the participants: HA0: The inspection teams using perspective-based inspections have better effectiveness than teams using heuristic evaluation. HB0: The problem found by teams using perspective-based inspection has a larger average severity than those found by teams using heuristic evaluation. Table 1 - Experimental Project First Day Part A 0:30 min Part B 2:30 hours Second Day Part A 1 hour Training: Evaluation Techniques Overview Training: Tool Overview Tool Evaluation : Part B 2 hours 9 Subjects for Heuristics 3 Subjects in Interface Evaluation Experts Perspective 3 Subjects in Data Analysis Experts Perspective 3 Subjects in Data Domain Experts Perspective Feedback Form Analyst Survey and Consent Form Training: Criteria Evaluation Overview

3.4 Instrumentation
3.4.1 Tools and Dataset
The experimenter can choose any visual data exploration tool for this experiment. In the observational studies and in the lab package we assembled, we have selected a tool that uses treemaps as its visual paradigm. A treemap is a visualization structure that uses nested rectangles to represent large trees. It uses 100% of the available space for information visualization and can successfully represent very large hierarchical structures [15][16]. The tool used was developed at the Human Computer Interaction Laboratory (HCIL) of the University of Maryland (http://www.cs.umd.edu/hcil/treemap/). The dataset used in the laboratory package is the 2004 Business Intelligence Cup data set, which is available at (http://www.tis.cl/bicup2004/). The context of the data is that an important Latin-American retail company is suffering from an increasing number of churn regarding their credit cards and has decided to improve its retention system. Two groups of 23 variables are available for each customer: socio-demographic and behavioral data. The data is available as an Excel file containing 14.814 registers (customers), described by the mentioned variables.

3.4.2 Training Material


The training main goals are: (1) give an overview of the evaluation criteria; (2) explain the inspection techniques; (3) give and over view of the tool; and (4) give instructions on how to fill the forms during the experiment. Training was divided in three different sessions. Session I Slide Presentation: 1. 2. 3. 4. 5. 6. 7. 8. Overview of the experiment; Main goals of the evaluation; Factors that determine a type of evaluation; Different methods of evaluation; Inspections overview Data mining overview; Visualization techniques overview Criteria for data exploration tool evaluation;

Part C 30 min

3.3 Experimental Design


Table 1 shows the experimental design. The experiment is divided in two days: the inspection criteria training day and the technique training and application day. During the first day, subjects receive the Analyst Survey and the Consent Form. They fill them up and return them to the experimenter. After that, an instructor starts the training section talking about the evaluation criteria for data exploration tools.

Session II Overview of the Tool Session III Filling of the Forms

3.4.3 Techniques
We used two techniques in this experiment, the perspective-based technique, described at the beginning of section 3, and the heuristic evaluation technique. The heuristic evaluation, defined by Nielsen and Molich [9][10] is a usability evaluation in which the evaluator looks for usability problems in a user interface through the analysis and interpretation of a set of heuristic principles. Each element of the interface must be analyzed to verify its conformity with the heuristic criteria. We extended these criteria to include other criteria specifically geared to data exploration tools. The used criteria are the same for the two (perspective-based and heuristic) techniques. The material created for the use of the heuristic technique was: Criteria Description Table Severity Description Table Problems Collection Forms The perspective-based technique is characterized by the scenarios. A scenario describes certain activities that should be performed by the subject while evaluating the tool. In addition to the activities, the scenario contains questions related to the subset of criteria that deals with the activity. The material created for the use of this technique was: Evaluation Scenarios for the three perspectives: data domain experts, data analysis expert, interface evaluation experts. Severity Description Table Problems Collection Forms
1. General Experience (4 Questions) 2. Computers Experience (1 Question) 3. Software Development Experience (1 Question) 4. Interfaces Development Experience (2 questions) 5. Interfaces Evaluation Experience (4 Questions)

each perspective. The basis for this analysis is the Analyst Survey form. As described in the Figure 4, we defined a 100 point score for each perspective based on the questions contained in the survey form. The answers of the subject will determine the perspective that best fits his or her profile. The Experimenters Characterization Form is similar to Analyst Survey form, but it is used for gathering the experience of the people who will apply the experiment (e.g., the instructor of the experiment and the designer of the experiment). The reason for collecting this data is that we believe that level of experience of the experimenter may influence the rigor and quality with which the experiment is run. The main goal of Feedback Form is to receive the feedback of the subjects about the experimental study. This form has 52 questions divided in seven sections. It contains questions about the training, the evaluation techniques, the experiment, the tool and the forms used in the experimental study. We developed four other forms to gather the observations of the experiment. We believe that this information is useful in the improvement of the material and will be useful in the analysis. The forms are related to doubts, ideas, cases and lessons learned. The forms contain fields related to its purpose and help the conductor in the collection of the information. These forms were the basis for improving the material during the observational studies described below.

4. TESTING
We ran observational studies to sanity test the experimental design, test and improve the experiment material, and to check the timing of the experimental activities. In this case study, we ran a set of four observational studies. In each study we tried to validate the materials and update the list of problems found on the chosen tool. Following the guidelines given in Section 2, the studies followed simplified experimental projects centered mostly on subject selection. The first and the second observational study involved people that we knew, with well-known profiles and background experience. We used three Subjects in each study. The rationale was that these were experienced people that could sanity check the experiment and give extensive feedback on the material. We used a small number of people because we did not wanted to involved more than a handful of people until we sanity checked the material as a whole. In the third study, we chose a group of four university lecturers. They had a more diverse profile than the first group, but with a very solid background on computing. Lastly, the fourth study used 22 undergrad students and used an experimental setup very close to the real experimental design. We evolved the material between each study. The conditions under which this last observational study was run, and the results we obtained convinced us that we finally had material to progress to the operational phase of our experiment.

Data Domain Expert 100 Points

Data Analyst Expert 100 Points

Interface Evaluation Expert 100 Points

6. Data Exploration Experience (10 Questions) 7. Data Domain Experience (3 Questions).

Figure 4- Analyst Survey Sections

3.4.4 Other Forms


Besides the techniques forms, the other forms created for the experiment were: the Consent Form, the Analyst Survey Form, the Feedback form, the Experimenter Characterization Form, and the Gathered Knowledge Forms. The subjects must accept to participate in the experimental process. If the subject does not know or is not sure about the goals of the experiment, it is possible that he will not execute properly all the activities [19]. For this reason, a consent form was created based on the guidelines defined by Preece [13] The main goal of the Analyst Survey Form is to know the previous experience and to evaluate the expertise of the subjects. After determining the expertise of the subject we use an intentional, non-probabilistic, technique to select of the team to

Table 2. Number of Subjects in each Perspective Pilot Study First Second Third Fourth TOTAL Technique Perspective Heuristic Perspective Heuristic Perspective Heuristic Perspective Heuristic Perspective Heuristic Interface Data Evaluation Analysis Experts Experts 1 1 2 1 2 1 2 2 7 3 6 2 6 4 2 5 5 8 5 Data Domain Experts 1

The details of the observational studies are presented in the next sections. Table 3 summarizes the studies and highlights the simplified experimental designs used in them.

4.1 First Observational Study


The main goal of this observational study was to sanity test the experiment, analyzing the training material, testing the forms and detecting errors, failures, inconsistencies on the experimental material. That was the first dry run for this material. We decided to use a simplified design with very few trustworthy people applying only part of the techniques. Three subjects executed the experiment divided into two of the three perspectives: Interface Evaluation Experts (1 Subject) and Data Domain Experts (1 Subject) and Heuristics Evaluation (1 Subject), see Table 2. All the activities were placed in one day as showed in Table 3. All the subjects were motivated to help us in the improvement of the material and we didnt need to add any additional motivational resource. As expected, many observational study: problems were highlighted in this

The number of subjects that took part in each study as well as the distribution of subjects by technique is shown in Table 2. Table 3 - Experimental Design for Each Observational Study First Observational Study Duration Activity One Day 2 Hours Training 30 Dataset Exploration Using Excel Minutes 1:30 Tool Evaluation hours Second Observational Study Duration Activity First Day 2 hours Criteria Evaluation Overview Second Day 30 Evaluation Techniques Overview minutes 30 Tool Overview minutes 2:30 Tool Evaluation hours Third Observational Study Duration Activity One Day 1 hour Criteria Evaluation Overview / Techniques Overview / Tool Overview 2 hours Tool Evaluation

Ambiguity in the filling box interface of forms; We didnt give an overview of the tool to the subjects, and they were confused about it; One day activities was stressing and boring for the subjects; The training had some gaps and new sessions about the tool and the techniques had to be revised; Some tables were difficult to use and they had to be applied every time that a problem was found. We analyzed the detected problems and improved the material for the next observational study.

4.2 Second Observational Study


Three subjects executed the experiment in this study. They were allocated as shown in Table 2: Interface Evaluation Experts Perspective (2 Subjects) and Heuristics Evaluation (1 Subject). In the second observational study, we inserted in the material an overview of the tool, changed the layout of the forms, and divided the activities in two sessions in two different days, see Table 3. We observed that the subjects were more successful in detecting problems with the tools than they were in the first study; we surmise that it was due to the insertion of the overview of the tool. Evaluation This study involved two people that took part in the first study, only one subject was new. They approved the changes and didnt find other serious problems with the material. Their only complaint was about some technical terms used in the training material.

Fourth Observational Study Duration Activity First Day 1 hour Consent Form Analyst Survey Form Second Day 2:30 Criteria Evaluation Overview hours Third Day 2:30 Evaluation Techniques Overview, Tool Overview hours and Tool Evaluation

4.3 Third Observational Study


This observational study was run with five university lecturers, distributed as shown in Table 2. All of them were known to us and thus we knew their profile and background experience. As shown in Table 3, the experiment was executed in a single day, a reduction due to the time availability of the participants. Subjects were subdivided in: Interface Evaluation Experts (2 Subjects) and Data Domain Experts (2 Subjects) and Heuristics Evaluation Technique (1 Subject). A good portion of the

problems reported during the inspections was different in nature than those found in the previous studies. Some of subjects were experts in user interfaces and had a very different profile than those of the first two studies. Many times, they followed their own heuristics and criteria. We detected this problem only when they were applying the technique and evaluating the tool. We traced that back to the training session. Besides that, some evaluation criteria were not understood, so they did not use them. They later reported that found the training material to be geared towards Data Mining Experts and complained about the terminology used in it. Analyzing the results of the previous studies we saw that some questions in the forms were not being rightly answered because the subjects didnt know the definitions of some terms that we were using. An example of this type of problem is the question: Have you ever used Heuristic Evaluation to evaluate a user interface? Some subjects answered NO and we knew they had used it, others answered YES and we found you they had never used it. To solve this problem, we inserted some definitions, examples and hints in the forms, in other to help the subjects to correctly answering the questions.

4.4 Fourth Observational Study


In all the previous observational studies we knew the subjects and their background experience, then, we could choose the perspective for everybody without any further analysis in their profile. But, in the fourth study, we didnt know the 22 subjects beforehand. We then decided to create a quantitative criterion to define in which perspective the subject should execute the experiment. We created this metric based on the Analyst Survey Form as shown in Figure 4 and discussed in Section 3.4. The final distribution of the subjects for this study is shown in Table 2. The observational study was executed in three days. We first received the forms, applied the metric using the forms and distributed the subjects by the techniques. The second day was dedicated to the training session on the Criteria of Evaluation and the third day to the other two training sessions and the evaluation session, as described in Table 3. To create an additional motivator for the participation of the subjects in the experiment we issued a certificate for their participation. In this study, the subject profiles were diverse from the previous three. They were undergraduates and had no extensive background as the previous subjects. This helped to identify more problems with technical terms. They had some difficulty to understand of the criteria used in the techniques which we considered normal for their level of experience level.

Maturation: In the experiment, the whole session of evaluation takes two hours, without breaks. The likely effect would be that towards the end of the inspection session, the inspector would tire and perform worse or that this time is not enough for the all of the evaluation process. From observation records of the pilot studies, there were no signs showing that the subjects looked tired or bored when they performed the evaluation for two hours. Testing: Getting familiar with the material and the technique may have an effect on subsequent results of an experiment. The pilot study helped to confirm that this is not a threat to this particular experiment. Each subject applies only one technique. The techniques and forms are explained during the second day training but they have hands-on contact with them only during the tool evaluation part. Instrumentation: The pilot studies helped to eliminate problems in the forms and other instruments used during the experiment. Also, after collecting data for these studies, the designers changed the data coding. This effort has improved the material enough so it could go through the operation phase. Selection: Based on the pilot studies, we established guidelines to balance the subjects allocated to each treatment. For that we used the Analyst Survey Questionnaire. We also provided guidelines to match the perspectives with the subject background, using the metric explained in Section 3.4.4. Treatment: We prompted the subjects not to discuss what they have done during the inspection with others, before all of them had finished participating in the experiment. The pilot studied allowed us to confirm that this strategy works. Process conformance: Another threat is that people may follow the techniques poorly. Based on the pilot study experiences, we improved training sessions about technique and about the evaluation criteria to mitigate that. This, however, remains one of the main threats to the experiment.

The threats to the experiment external validity are as follows: Through the pilot study, we found out that a few subjects deviated to their ad-hoc usability evaluation techniques instead of using the prescribed techniques. This represents a real threat to the internal and external validity of the experiment. We tried to remediate that with improved training. The lab environment keeps the subject concentrated on the inspection without distraction or interruption. This may not replicate a work environment. Internally this is not a problem because it applies to all techniques. Awareness about being observed by others may have some impact on their behavior. Internally this is not a problem because it applies to all techniques. The sample does not represent the general population. After all, we are running a quasi-experiment in which the population selection is not random. We try to mitigate the natural distribution of perspectives, by using an intentional, non-probabilistic, technique to select the team for each perspective. This distribution technique was developed based on the pilot studies and is quickly discussed in Section 3.4.4.

4.5 How the Pilot Studies Helped to Address the Threats to Validity
Threats to validity [3] are factors other than the independent variables that can affect the dependent variables. This section discusses how the pilot studies also helped to mitigate some of the threats to experimental validity. We divide the discussion on the threats to the internal and external validity of the experiment. The threats to the internal validity for the experiment are as follows:

4.6 Experimental Trial


By the end of the fourth pilot study we considered the material stable enough to move on to our first experiment. This experiment was run at the end of May, 2005 with a group of 12 graduate students enrolled in a data visualization course from another university. The subjects were distributed as shown on Table 4: Table 4 Experimental Design Experiment Trial Experiment Technique First Trial Perspective Heuristic Interface Data Evaluation Analysis Experts Experts 2 2 2 2 Data Domain Experts 2 2

studies also helped to mitigate some of the threats to experimental validity. After our experiences, we suggest the following guidelines to run pilot studies to test and improve a lab package: It is important to recruit good subjects to run the pilot studies. Establish strategies to motivate and reward prospective subjects. Possible strategies are: grading, extra credit, or educational activities for students; pay subjects to take part in the experiment; contact colleagues that are interested on the same problem domain; or simply appeal to the kindness of knowledgeable people. Start with a small pilot study involving a few subjects with a good knowledge about the experiment domain; Start with subjects of known background, experience, and profile; After validation with a small group of known subjects, you should move towards a larger and more diverse group of subjects. This will help to find diverse and unexpected problems and flaws on the experimental material; After successfully running the experiment with a set of subjects that covers most of the expected profiles, the package should be ready to the operation phase; The testing phase is useful to create a live list known artifact characteristics (e.g., problems and false positive). This list should be used as a reference in future replications of the experiment; Dividing training in more than one session can be useful to pin point problems in this very critical experimental procedure.

The experimental design was the same used on the Fourth Observation. Despite all the tests, we still faced small problems. The most important one was the fact that the students had problems with the understanding of the dataset domain. In this case, the Experimenter was able to clear out the doubts of the students. Table 5 Number of Problems Found Category of Heuristics Problem
Usability Dataset Problems False positives Problems Total 12 3 3 42 60

Interface Data Evaluation Analysis Experts Experts


2 1 1 10 14 1 0 2 14 17

Data Domain Experts


0 1 0 5 6

6. CONCLUSIONS
This paper suggests pulling experiment testing out of the planning phase of the experimental process. The main goal of this phase is to evaluate and improve of the empirical material before the operation phase of a new experiment. Its purpose is not to prove that the experimental material doesnt have errors or problems, but to find as many problems as possible in it. If this phase is successful, one can progress through the operation phase iterations without having to change essential information on the experimental material. Changes at the operation phase will most probably invalidate the results of an experiment or a replication. By improving the material, the studies also helps to mitigate some of the threats to the experiment validity, helping to reduce its risk of failure. The results gathered during the pilot studies are not useful for analysis of the effectiveness of the techniques being evaluated. A pilot study is not an experiment, because the experimental material is still being modified and improved. Besides, the observational studies are not the full version of the experiment. The experimenter usually only runs a simplified version of the full experiment during an observational pilot study. The case study discussed in this paper showed examples of: Improvement of material; Improvement of training; Perceiving the necessity of, and implemented, better subject classification;

Table 5 shows the distribution of problems found per technique. The subjects reported 97 potential problems. Of those, five were dataset problems and six were false positives. Also, fifteen of them were clearly usability problems, problems like: "option for other languages does not exist", "color for selection is not seen for elements painted with the same color", etc. The remainder 71 problems were classified as problems related to visualization tools and were further analyzed. The problems were grouped in 18 classes. Evaluating the results by techniques we observed that the heuristics technique was more effective in finding the problems. Grouping the result by subjects that used the heuristics technique, they cover 100% of the classes of problems found while the perspective-based technique covered only 50% of the problem classes. The results were surprising and further studies are needed in order to evaluate them.

5. LESSONS LEARNED AND GUIDELINES


The observational studies helped us to sanity check the experimental design, improve the experimental material, and check the timing of the experimental activities. As discussed in Section 4.5, by improving the experimental material, the pilot

Better defining the timing of activities for the experiment. After all the pilot studies, we were able to run and are now replicating the full version of the experiment.

[10] Nielsen, J. and Mack, R. L. (1994) Usability Inspection Methods. New York , John Wiley. [11] Oliveira, M. C. F.; Levkowitz, Haim. From Visualization to Visual Data Mining: A Survey. IEEE Transactions on Visualization and Computer Graphics, United States, v. 9, n. 3, p. 378-394, 2003. [12] Pfleeger, S. L.; Albert Einstein and Empirical Software Engineering, IEEE Computer, 0018-9162/99, 1999. [13] Preece, J. et al. Interaction Design. Beyond HumanComputer Interaction. John Wiley. 2002. [14] Rombach, H.D. and V.R. Basili. Practical benefits of goaloriented measurement, in Proc. Annual Workshop of the Centre for Software Reliability: Reliability and Measurement. GarmischPartenkirchen, Germany: Elsevier, 1990. [15] Shneiderman, B., Johnson, B.: Treemaps: A Space-Filling Approach to the Visualization of Hierarchical Information Structures. Proc. of IEEE Information Visualization, pp. 275-282, 1991. [16] Shneiderman, B. Tree visualization with tree-maps: 2-d space-filling approach. ACM Transactions on Graphics, vol. 11 , n. 1, pp. 92-99, 1992. [17] Shull, F., Carver, J., Travassos, G.H.; An Empirical Methodology for Introducing Software Processes, 8th European Software Engineering Conference (ESEC) and 9th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-9), Viena. Proceedings of the 8th ESEC - 9th ACM SIGSOFT FSE. ACM Press, 2001. p.288 296. [18] Shull, F.; Mendonca, M.; Basili, V.; Carver, J.; Maldonado, J.; Fabbri, S.; Travassos, G.; M. Ferreira; "Knowledge-sharing Issues in Experimental Software Engineering." Empirical Software Engineering - An International Journal, 2004. 9(1): p. 111-137. [19] Travassos, G.H.; Gurov, D.; Amaral, E.A.G.G. Introduo Engenharia de Software Experimental, Relatrio Tcnico ES590/02-Abril, Programa de Engenharia de Sistemas e Computao, COPPE/UFRJ. [20] Wohlin, C., Runeson, P., Host, M. Regnell, B., Wesslen,A.; Experimentation in Software Engineering: An Introdution, Kluwer Academic Publishers, Boston, MA, 2000. [21] Zhang, Zhijun, Basili, Victor, and Shneiderman, Ben. An empirical study of perspective-based usability inspection". Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting. p. 1346-1350. Chicago, 1998.

7. REFERENCES
[1] Basili, V., Green, S., Laitenberger, O., Shull, F., So-rumgard, S., & Zelkowitz, M. (1996). The empirical investigation of perspective-based reading. Empirical Software Engineering, 1, 133-164. [2] Basili, V., Shull, F., Lanubile, F.; Building Knowledge through Families of Experiments, IEEE Transactions of Software Engineering, 1999, vol. 25, No. 4. [3] Campbell, D. T. and Stanley, J. C. Experimental and QuasiExperimental Designs for Research. Houghton Mifflin Company, 1966. [4] Card, S.K., Mackinlay, J.D. and Shneiderman, B. (1999) Readings in Information Visualization - Using Visualization to Think. San Francisco, Morgan Kaufmann. [5] Cruzes, Daniela Soares; Mendona Neto, Manoel Gomes De; Maldonado, Jos Carlos; Jino, Mario. Using Visualization to Bring Context Information to Software Engineering Model Building. In: International Workshop on Visual Languages And Computing (VLC2004), 2004, San Francisco. Proceedings of the 2004 International Conference of Distributed Multimedia Systems. Skokie, Illinois 60076: Knowledge Systems Institute, 2004. v. 1, p. 219-224. [6] Freitas, C. M. D. S.; Luzzardi, P. R. G.; Cava, R. A.; Winckler, M. A.; Pimenta, M.; Nedel, L. P. Evaluating Usability of Information Visualization Techniques In. 5th Symposium On Human Factors In Computer Systems (Ihc), 2002, Fortaleza CE Proceedings of 5th Symposium on Human Factors in Computer Systems (IHC), 2002. [7] Keim, D. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics, 8(1):1-8, January 2002. [8] Luzzardi, P. R. G. ; Freitas, Carla Maria Dal Sasso de ; Cava, Ricardo Andrade; Duarte, Glaucius Dcio; Vasconcelos, Maria Helena. An Extended Set of Ergonomic Criteria for Information Visualization techniques. In: Proceedings of the Seventh IASTED International Conference on Computer Graphics And Imaging (Cgim-2004), Kauai, 2004. p. 236-241 [9] Nielsen, J. (1993) Usability engineering. Boston, MA: Academic Press.

You might also like