19.5 ITS EVALUATIONS

19. Intellignet tutoring systems: past, present, and future
PDF

19.1	Introduction
19.2	Precursors of ITS
19.3	Intelligent Tutoring Systems Defined
19.4	The 20-year History of ITS
19.5	ITS Evaluations
19.6	Future ITS Research and Development
19.7	Conclusion
	References

19.5 ITS EVALUATIONS

Building a tutor and not evaluating it is like building a boat and not taking it in the water. We find the evaluation as exciting as the process of developing the ITS. Often, the results are surprising, and sometimes they are humbling. With careful experimental design, they will always be informative. (Shute & Regian, 1993, p. 268).

Which systems instruct effectively? What makes them effective? One might think that increasing the personalization of instruction (e.g., model-tracing) would enhance learning efficiency, and in the process, improve both the rate and quality of knowledge and skill acquisition. But results cited in the literature on learning, in relation to increased computer adaptivity, are equivocal. In some cases, researchers have reported no advantage of error remediation in relation to learning outcome (e.g., Bunderson & Olsen, 1983; Sleeman, Kelly, Martinak, Ward & Moore, 1989). In others, some advantage has been reported for more personalized remediation (e.g., Anderson, Conrad & Corbett, 1989; Shute, 1993a; Swan, 1983).

If, however, more researchers conducted controlled ITS evaluations, this issue would be easier to resolve. But, in addition to the availability of relatively few reported evaluations of ITS, there has been little agreement upon a standard approach for designing and assessing these systems(see 39.5). Results from six ITS evaluations will now be presented.

19.5.1 Six ITS Evaluations

A few examples of systematic, controlled evaluations of ITS reported in the literature include: the LISP tutor (e.g., Anderson, Farrell, & Sauers, 1984) instructing LISP programming skills; Smithtown (Shute & Glaser, 1990, 1991), a discovery world that teaches scientific inquiry skills in the context of microeconomics; Sherlock (Nichols, Pokorny, Jones, Gott, & Alley, in preparation; Lesgold, Lajoie, Bunzo & Eggan, 1992), a tutor for avionics troubleshooting; Bridge (Bonar, Cunningham, Beatty, & Weil, 1988; Shute, 1991) teaching Pascal programming skills; Stat Lady instructing statistical procedures (Shute & Gawlick-Grendell, 1993), and the Geometry tutor (Anderson, Boyle & Yost, 1985), providing an environment in which students can prove geometry theorems. Results from these evaluations show that these tutors do accelerate learning with, at the very least, no degradation in outcome performance compared to appropriate control groups.

19.5.1.1. The LISP Tutor. Anderson and his colleagues at Carnegie-Mellon University (Anderson, Farrell, & Sauers, 1984) developed a LISP tutor which provides students with a series of LISP programming exercises and tutorial assistance as needed during the solution process. In one evaluation study, Anderson, Boyle, and Reiser (1985) reported data from three groups of subjects: human-tutored, computer-tutored (LISP tutor), and traditional instruction (subjects solving problems on their own). The time to complete identical exercises were: 11.4, 15.0, and 26.5 hours, respectively. Furthermore, all groups performed equally well on the outcome tests of LISP knowledge. A second evaluation study (Anderson, Boyle & Reiser, 1985) compared two groups of subjects: Students using the LISP tutor and students completing the exercises on their own. Both received the same lectures and reading materials. Findings showed that it took the group in the traditional instruction condition 30% longer to finish the exercises than the computer-tutored group. Moreover, the computer-tutored group scored 43% higher on the final exam than the control group. So, in two different studies, compared to traditional instruction, the LISP tutor was apparently successful in promoting faster learning with no degradation in outcome performance.

In a third study using the LISP tutor to investigate individual differences in learning, Anderson (1990) found that when prior, related experience was held constant, two "meta-factors" emerged. These two meta-factors, or basic learning abilities, included an acquisition factor and a retention factor. Not only did these two factors explain variance underlying tutor performance, they also significantly predicted performance on a paper-and-pencil midterm and final examination.

A fourth study with the LISP tutor concerns the usefulness of productions for analyzing learning. In analyzing student performance on the first six problems in Chapter 3 of the LISP tutor, Anderson (1993, p. 32) discovered uneven, unsystematic trends in learning. One problem was relatively easy and the next might be relatively more difficult. However, by decomposing the problems into their constituent production rules, Anderson was able to convert the chaos of these results into very systematic program solution learning curves, for both time and accuracy. He analyzed performance on individual production rules across problems. Because productions were reused, and others newly introduced in each problem, he could plot performance in terms of the number of opportunities each production rule had for contributing to an additional unit of LISP code. This simplifying transformation demonstrates that knowledge is acquired in terms of production rules, and that if we are to understand how learning cognitive skills is to be explained, our analysis of the task and data ought to be conducted in terms of production rules.

19.5.1.2. Smithtown. Shute and Glaser (1991) developed an ITS designed to improve an individual's scientific inquiry skills within microworld environment for learning principles of basic microeconomics. In one study (Shute, Glaser & Raghavan, 1989), three groups of subjects were compared: a group interacting with Smithtown, an introductory economics classroom, and a control group. The curriculum was identical in both treatment groups (i.e., laws of supply and demand). Results showed that while all three groups performed equivalently on the pretest battery (around 50% correct), the classroom and the Smithtown groups showed the same gains from pretest to posttest (26.4% and 25.2%, respectively); they significantly outperformed the control group. Although the classroom group received more than twice as much exposure to the subject matter as did the Smithtown group (11 vs. 5 hours, respectively), the groups did not differ on their posttest scores. These findings are particularly interesting because the instructional focus of Smithtown was not on economic knowledge, per se, but rather on general scientific inquiry skills, such as hypothesis testing.

19.5.1.3. Sherlock. "Sherlock" is the name given to a tutor which provides a coached practice environment for an electronics troubleshooting task (Lesgold, Lajoie, Bunzo, and Eggan, 1990). The tutor teaches troubleshooting procedures for problems associated with an F-15 manual avionics test station. The curriculum consists of 34 troubleshooting scenarios with associated hints. A study was conducted evaluating Sherlock's effectiveness using 32 trainees from two separate Air Force bases (Nichols, Pokorny, Jones, Gott, & Alley, in preparation). Pre- and post-tutor assessment used verbal troubleshooting techniques as well as a paper-and-pencil test. Two groups of subjects per Air Force base were tested: (1) subjects receiving 20 hours of instruction on Sherlock, and (2) a control group receiving on-the-job training over the same period of time. Statistical analyses indicated that there were no differences between the treatment and the control groups on the pretest (means = 56.9 and 53.4, respectively). However, on the verbal posttest as well as the paper-and-pencil test, the treatment group (mean = 79.0) performed significantly better than the control group (mean = 58.9) and equivalent to experienced technicians with several years of on-the-job experience (mean = 82.2). The average gain score for the group using Sherlock was equivalent to almost four years of experience.

19.5.1.4. Pascal ITS ("Bridge"). An intelligent programming tutor was developed to assist novice programmers in their designing, testing, and implementing Pascal code (Bonar, Cunningham, Beatty, & Weil, 1988). The goal of this tutor is to promote conceptualization of programming constructs or "plans" using intermediate solutions. A study was conducted with 260 subjects who spent up to 30 hours learning from the Pascal ITS (see Shute, 1991). Learning efficiency rates were estimated from the time it took subjects to complete the curriculum. This measure involved both speed and accuracy since subjects could not proceed to a subsequent problem until they were completely successful in the current one. To estimate learning outcome (i.e., the breadth and depth of knowledge and skills acquired), three criterion posttests were administered measuring retention, application, and generalization of programming skills.

The Pascal curriculum embodied by the tutor was equivalent to about 1/2 semester of introductory Pascal. That is, the curriculum equaled about 7 weeks or 21 hours of instruction time. Adding two hours per week for computer laboratory time (conservative estimate), the total time spent learning a half-semester of Pascal the traditional way would be at least 35 hours. In the study discussed above, subjects completed the tutor in considerably less time (i.e., mean = 12 hours, SD = 5 hours, normal distribution). So, on average, it would take about three times as long to learn the same Pascal material in a traditional classroom and laboratory environment as with this tutor (i.e., 35 vs. 12 hours).

While all subjects finished the Pascal ITS curriculum in less time compared to time needed to complete the curriculum under traditional instructional methods, there were large differences in learning rates found at the end of the tutor. For these subjects (having no prior Pascal experience), the maximum and minimum completion times were 29.2 and 2.8 hours, a range of more than 10:1. In addition, while all 260 subjects successfully solved the various programming problems in the tutor's curriculum, their learning outcome scores reflected differing degrees of achievement. The mean of the three criterion scores was 55.8% (SD = 19, normal distribution). The range from the highest to the lowest score, 96.7% to 17.3%, represented large between-subject variation at the conclusion of the tutor. To account for these individual differences in outcome performance, Shute (1991) found that a measure of working memory capacity, specific problem solving abilities (i.e., problem identification and sequencing of elements) and some learning style measures (i.e., asking for hints and running programs) accounted for 68% of the outcome variance.

19.5.1.5. Stat Lady. Two studies have been conducted to date with Stat Lady. One study (Shute, Gawlick-Grendell, & Young, 1993) tested the efficacy of learning PROBABILITY from Stat Lady in relation to a traditional Lecture and a no-treatment Control group. Results showed that both treatment groups learned significantly more than the control group, yet there was no difference between the two treatment groups in terms of pretest to posttest improvements after three hours of instruction. The results were viewed as very encouraging because, not only was the lecture a more familiar learning environment for these subjects, but the professor administering the Lecture had more than 20 years experience teaching this subject matter while this was Stat Lady's first teaching assignment. When test items were separated into declarative and procedural categories, they found that: (a) students using Stat Lady acquired significantly more declarative knowledge than the other groups, but (b) when procedural skill acquisition was assessed, the Lecture group prevailed. Finally, a significant aptitude-treatment interaction was obtained where high-aptitude subjects learned significantly more from Stat Lady than from the Lecture environment, but for low-aptitude subjects, there was no difference in learning outcome by condition. Together, these results suggest that a teacher-computer combination maximizes learning.

The second study (Shute & Gawlick-Grendell, 1994) compared learning from Stat Lady vs. learning from a paper-and-pencil Workbook version of the identical curriculum, and addressed the question: "What does the computer contribute to learning?" Findings showed that Stat Lady learners performed at least as well (and in some cases, much better) on the outcome tests compared to the Workbook group, again despite the presence of factors strongly favoring the traditional condition. Specifically, they found that (a) Stat Lady was clearly the superior environment for high-aptitude subjects, (b) Stat Lady subjects acquired significantly more declarative knowledge than the Workbook subjects, and (c) regardless of aptitude, the majority of learners found the Stat Lady condition to be significantly more enjoyable and helpful than the Workbook condition.

19.5.1.6. Anderson's Geometry Tutor. The geometry tutor (Anderson, Boyle & Yost, 1985) provides an environment for students to prove geometry theorems. The system monitors student performance and jumps in as soon as a mistake is made. The skill this system imparts is how to prove geometry theorems that someone else has provided. Schofield and Evans-Rhodes (1989) conducted a large-scale evaluation of the tutor in place within an urban high school. Six geometry classes were instructed by the tutor (in conjunction with trained teachers), and three control geometry classes taught geometry in the traditional manner. The researchers closely observed the classes using the geometry tutor and traditional instruction for more than 100 hours. One of the really nice and intriguing results of Schofield and Evans-Rhodes (1989) evaluation of this tutor was the counter-intuitive reversal of its effects. Although the geometry tutor was designed to individualize instruction, one of its pragmatic and unintended side effects was to encourage students to share their experiences and cooperatively solve problems. Since their experiences with the Geometry tutor was so carefully controlled by the immediate feedback principles of its operations, the tutor guaranteed that students' experiences were much more uniform and similar than was the case for normal classrooms. As a result, students could more easily share experiences and make use of one another's experiences and problem solving strategies. The practical result was a great deal of cooperative problem solving.

19.5.2 Conclusions from the Six Evaluation Studies

These evaluation results all appear very positive regarding the efficacy of ITS; however, there is always a selection bias involved with the publication of unambiguous evidence of successful instructional interventions. We are familiar with other (unpublished) tutor-evaluation studies that were conducted but were "failures." However, the general positive trend is viewed as encouraging, especially given the enormous differences among the six tutors in design structure as well as evaluation methods. The findings indicate that these systems do accelerate learning with no degradation in final outcome.

Obviously, principled approaches to both the design and evaluation of ITS are badly needed before we can definitively judge the merits of these systems. Some principled approaches are beginning to emerge. For example, Kyllonen and Shute (1989) outlined a taxonomy of learning skills that has implications for the systematic design of ITS. They hypothesized a multi-dimensional interaction predicting learning outcome as a function of: type of learning/instructional environment, type of knowledge/skill being instructed, subject matter, and characteristics of the learner (e.g., aptitude, learning style). With a few modifications to this taxonomy, Regian and colleagues at the Armstrong Laboratory are currently trying to fill in the cells in the matrix through systematic, empirical studies designed to assess performance across a range of these aforementioned dimensions. Their goal is to map instructional and knowledge-type variables to learning.

In terms of systematic approaches to evaluating ITS, Shute and Regian (1993) suggested seven steps for ITS evaluation: (1) Delineate goals of the tutor, (2) Define goals of the evaluation study, (3) Select the appropriate design to meet defined goals, (4) Instantiate the design with appropriate measures, number and type of subjects, and control groups, (5) Make careful logistical preparations for conducting the study, (6) Pilot test tutor and other aspects of the study, and (7) Plan primary data analysis concurrent with planning the study. These principles may also be employed as a framework for organizing, discussing, and comparing ITS evaluation studies.

Updated August 3, 2001
Copyright © 2001
The Association for Educational Communications and Technology

AECT
1800 North Stonelake Drive, Suite 2
Bloomington, IN 47404

877.677.AECT (toll-free)
812.335.7675