Workshops
2004 Workshop
NSF Proposal | NSF Proposal |
|
|
|
Workshop on Perceptive Animated Interfaces and Virtual HumansRon Cole, CU Javier Movellan, UCSD Jonathan Gratch, USC1. Vision and Purpose We envision a new generation of human computer interfaces that engage users in natural face-to-face conversational interaction with intelligent animated characters. These perceptive animated interfaces will incorporate virtual humans that interact with people much like people interact with each other during face-to-face conversational interaction. The interface will use language processing and machine perception technologies to locate, monitor and interpret the user’s speech, facial expressions, gaze and hand and body gestures. Lifelike computer characters, with personality and attitude, will orient to the user and provide real time feedback while the user is speaking through head nods, facial expressions and other behaviors) and interpret the speaker’s auditory and visual behaviors to infer the user’s intentions and cognitive state. The animated agents will produce natural and expressive speech accompanied by contextually appropriate facial expressions and gestures consistent with the agent’s unique personality. We propose work to establish a vital research community to stimulate and enable research and development of perceptive animated interfaces. Perceptive animated interfaces will be of great value to society, as they will revolutionize learning, training, interpersonal electronic communication, information access and retrieval and online transactions. The advent of intelligent animated agents will present unprecedented opportunities to engage and empower individuals to learn new skills, communicate more effectively, and increase their participation in the emerging information society. They can help people learn to read or to speak, and can liberate teachers from some routine teaching tasks and help them tailor the learning process to the specific needs of each student. The invention of virtual humans within perceptive animated interfaces provides a new and exciting task domain for multidisciplinary research leading to development of converging technologies to improve human performance. For example, inventing virtual humans and assessing their effectiveness in different tasks (e.g., Web guides, science tutors, job counselors or therapists) will require the integration of new ideas and technologies about the realization of personalities through communication behaviors, and new architectures that can handle real-time interaction between individuals and agents across modalities operating at different time scales. The proposed workshop provides an opportunity for computer scientists, cognitive psychologists, social psychologists, personality researchers and other interested researchers to brainstorm with program managers from the NSF (and perhaps other agencies) to brainstorm about research challenges, infrastructure needs and program models that can focus the talents in cross-cutting efforts leading to perceptive animated interfaces. While perceptive animated interfaces are currently science fiction, available tools and technologies exist today that could enable development and deployment of system prototypes in the next few years (Gratch et al., 2002; Cole et al., 2003). Indeed, initial efforts are currently underway to develop first generation systems incorporating perceptive animated agents in the context of intelligent tutoring systems designed to teach children to read and learn from text (Cole et al., 2003). To accelerate progress, it is crucial to establish a community of researchers from all relevant disciplines that will work together to initiate and undertake the many tasks required to make these interfaces a reality. These activities include defining research goals and challenges, sharing knowledge about prevailing theories and methodologies in each discipline, proposing and designing system architectures for inventing and evaluating virtual humans, defining realistic and challenging task domains, building prototype systems as test beds for research, establishing evaluation criteria for measuring progress, and identifying and developing critical infrastructure that enables this work, and is accessible and available to all. A major goal of the proposed workshop is to assemble a team of interested colleagues who will work together to conceptualize and plan these tasks, and undertake some important first steps toward developing the knowledge base, research objectives and infrastructure needed to accelerate scientific research leading to initial systems incorporating virtual humans. The development of perceptive animated interfaces requires collaboration among researchers in many areas—psychologists, linguists, speech scientists, engineers and computer scientists with multidisciplinary expertise in human communication, interface design, speech and language technologies, dialogue modeling and management, computer vision and computer animation. While individual researchers, research labs and existing research communities represent knowledge and skills in each of these areas, no research community exists today that strives to focus the necessary multidisciplinary resources on research and development of perceptive animated interfaces incorporating virtual humans. We thus propose to begin efforts to establish this community by bringing together researchers to share their knowledge, expertise, tools and technologies and conceptualize and plan research and development activities leading to perceptive animated interfaces. We will seek researchers with complementary interests and demonstrated expertise in areas of human communication technology, human computer interaction, computer vision and animation, as well as researchers who study human communication, personality, emotions and gestures. The workshop organizers will also consider inviting individuals in other areas who may contribute to the workshop, such as researcher who study empathy and social resonance between therapists and patients. 2. The Challenge of Perceptive Animated Interfaces Building systems that enable face-to-face communication with intelligent animated agents requires a deep understanding of the auditory and visual behaviors that individuals produce and respond to while communicating with each other. Face-to-face conversation is a virtual ballet of auditory and visual behaviors, with the speaker and behavior simultaneously producing and reacting to each other’s sounds and movements. While talking, the speaker produces speech annotated by smiles, head nods and other gestures, while the listener provides simultaneous auditory and visual feedback to the speaker (e.g., “I agree,” “I’m puzzled,” “I want to speak.”). The listener may signal the speaker that she desires to speak; the speaker continues to talk, but acknowledges the nonverbal communication by raising his hand and smiling in a “wait just a moment” gesture. Face-to-face conversation is often characterized by such simultaneous auditory and visual exchanges, in which the sounds of our voices, the visible movements of our articulators, direction of gaze, facial expressions and head and body movements present linguistic information, paralinguistic information, emotions and backchannel cues, all at the same time. Inventing systems that engage users in accurate and graceful face-to-face conversational interaction is a challenging task. The system must simultaneously interpret and produce auditory and visual signals. The system must interpret the user’s auditory and visible speech, eye movements, facial expressions and gestures, since these cues combine to signal the speaker’s intent—e.g., a head nod can clarify reference, while a shift of gaze can indicate that a response is expected. Paralinguistic information is also critical, since the prosodic contour may signal that the user is being sarcastic. The animated agent must also produce accurate, natural, and expressive auditory and visible speech with facial expressions and gestures appropriate to the physical nature of language production, the context of the dialogue, and the goals of the task. Most important, the animated interface must combine perception and production to interact conversationally in real time – while the animated agent is speaking, the system must interpret the user’s auditory and visual behaviors to detect agreement, confusion, desire to interrupt, etc., and while the user is speaking, the system must both interpret the user’s speech and simultaneously provide auditory and/or visual feedback via the animated character. Developing such systems requires advances in speech recognition, natural language generation and synthesis, facial animation, recognition of facial expressions and gestures, dialogue interaction and imparting personalities to computer agents. As well, realizing these scenarios requires a deeper understanding of the nature of human communication and human computer interaction. Most importantly, achieving these advances in knowledge and technology requires a community of researchers willing to work in an interdisciplinary manner and willing to go beyond the boundaries of well-established research communities. Speech researchers, for example, need to go beyond their traditional area of expertise and interact with computer vision researchers, psychologists, and computer animators. The rudiments of such a community are already established but are in dramatic need for consolidation. 3. Virtual Humans: An Emerging Field What is the current state of research and development of virtual humans and how effective are these perceptive animated agents in improving human computer interaction? A vital and growing multidisciplinary community of scientists worldwide is addressing these questions, and significant efforts are underway to develop and evaluate virtual humans in various application scenarios. To date, researchers have generated powerful conceptual frameworks, architectures and systems for representing and controlling behaviors of animated characters to make them believable, personable and emotional (Albeck & Badler, 2002; Badler et al., 2002; Cassell, et al., 2001; Gratch and Marsella, 2001; Loyall, 1997; Marcella & Gratch, 2001). Gratch et al. (2002) and Johnson et al. (2000) present excellent overviews of the scope of enquiry and the theoretical, cognitive and computational models underlying current research aimed at developing believable virtual humans capable of natural face-to-face conversations with people. Animated conversational agents have been deployed in a variety of application domains, including information kiosks, literacy tutors, and language training. In pioneering work conducted over the past 10 years at KTH in Stockholm, Joakim Gustafson (2002) and his colleagues developed a series of multimodal dialogue systems of increasing complexity incorporating animated conversational agents: (1) Waxholm, a travel planning system for ferryboats in the Stockholm archipelago (Blomberg et al., 1993; Bertenstam et al., 1995); (2) August, an information system deployed for several months at the Culture Center in Stockholm (Gustafson et al., 1999; Lundeberg & Beskow, 1999), in which the animated character moved its head and eyes to track the movements of persons walking by the exhibit, and produced facial expressions such as listening gestures and thinking gestures during conversational interaction; and (3) AdApt, a mixed-initiative spoken dialogue system incorporating multimodal inputs and outputs, in which users conversed with a virtual real estate agent to locate apartments in Stockholm (Gustafson & Bell, 2002). AdApt produced accurate visible speech, used several facial expressions to signal different cognitive states and turn taking behaviors, and used direction of gaze to indicate turn taking and to direct the user to a map indicating apartment locations satisfying expressed constraints. These systems produced important insights into the challenges of developing and deploying multimodal spoken dialogue systems incorporating talking heads in public places. The poor quality of animated conversational agents developed to date is a major stumbling block to progress. Johnson et al. (2000) argue that it is premature to draw conclusions about the effectiveness of animated agents because they are still in their infancy, and “…nearly every major facet of their communicative abilities needs considerable research. For this reason, it is much too early in their development to conduct comprehensive, definitive empirical studies that demonstrate their effectiveness in learning environments. Because their communicative abilities are still very limited compared to what we expect they will be in the near future, the results of such studies will be skewed by the limitations of the technology.” While significant progress has been made in development and integration of core technologies into virtual humans since Johnson’s article, it is clearly the case that many of the technologies needed to enable virtual humans to behave like people are still in their infancy. An Emerging Community: A number of research communities are emerging that focus on different aspects of the general problem of face-to-face communication with intelligent animated agents. For example, there are active (and largely independent) groups of researchers investigating audio-visual speech processing, face and gesture recognition, affective computing, and perceptive interfaces. These groups tend to get together via conference workshops or annual meetings that run in collaboration with larger, more established conferences. Example of such meetings include the International Conference on Face and Gesture Recognition, the Multisensory Research Conference, the NIPS workshop on affective computing, the ICCV workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time Systems, the workshop on Lifelike Computer Characters, the AVSP (Audio Visual Speech Processing) workshop, and the UIST annual workshop on Perceptive User Interfaces. Despite initial efforts by various groups to address areas of research related to research and development of virtual humans, the situation today is that researchers in different fields are working more or less independently in separate areas within psychology, linguistics and computer science, studying problems related to human communication, expression of emotions and gestures during communication, spoken dialogue systems, computer vision, computer animation and gesture recognition. Together, these fields provide a significant base of knowledge and methods that are critical to development of virtual humans. It is therefore critically important to holds workshops to bring researchers from these diverse fields together to share knowledge and help establish a coherent research community to share knowledge and formulate a vision and plan for inventing virtual humans. A workshop organized by Co-PI Jonathan Gratch provides an important first step in this direction, and the resulting journal article resulting from the workshop is an excellent starting point for conceptualizing this field. To summarize, development of virtual humans is still in its infancy. In the past ten years, a small but emerging community of researchers has made great progress towards identifying the scope of multidisciplinary research required and the key research challenges that need to be addressed, and by offering strong theoretical, conceptual and computational frameworks that provide a foundation for multidisciplinary research among computer scientists, cognitive scientists, psychologists and researchers in other disciplines. Although much innovative research has been conducted, we conclude that experiments investigating the efficacy of animated agents are limited today by constraints imposed by the state of the art of human communication technologies, including speech and language technologies, computer vision, and real time character animation. A grand challenge is to develop new architectures and technologies that will enable experimentation with perceptive animated agents that are more natural, believable and graceful than those available today. To meet this challenge, it is necessary to bring together a community of researchers that can work together to achieve a vision and research agenda for accelerating research and development of virtual humans. 4. Building on Prior Work Establishing a vital research community should be guided by prior efforts that produced successful outcomes, and leverage existing infrastructure. In this section, we describe program models and critical infrastructure already in place that can “jump start” research and development efforts in perceptive animated interfaces. DARPA speech programs, which now span over three decades, provide important insights for establishing research programs and communities. These programs brought together researchers who worked together to define challenging tasks and conceptualize systems that could be developed in 3-4 years to achieve targeted levels of performance on the designated tasks. The research community then worked together to identify the infrastructure needed to develop the proposed systems, and developed rigorous evaluation methodologies and metrics to measure and compare performance of different systems, and measure progress over time. One of the key lessons learned from these programs is the critical importance of infrastructure, and the remarkable amount of work required to produce it. In the area of speech recognition (which is just one component of a perceptive animated interface) infrastructure includes annotated speech corpora, pronunciation dictionaries, lexicons, and tools for training and evaluating speech recognition systems. Development of infrastructure in each of these areas represents many thousands of hours of work! The good news is that significant infrastructure has been developed in recent years that can be used today to enable research and development of perceptive animated interfaces. This infrastructure includes annotated speech corpora, research tools and research systems that have been placed in the public domain for research use. These include the Interactive Book Architecture (a Java-based extension of the Galaxy architecture developed at MIT) that supports natural, mixed-initiative, spoken dialogue interaction with animated characters, Interactive Book authoring tools, and the CU Animate system developed at the University of Colorado. We briefly describe these systems to show that powerful tools that can be shared with other researchers are now available to support research and development of perceptive animated interfaces. Interactive Book Architecture & Runtime Environment. Under NSF ITR and IERI grants, researchers at CSLR have developed an authoring and runtime environment for developing Interactive Books that incorporate perceptive animated agents. The animated agents interact with students to help them learn to read and understand what they read. Interactive books enable students to converse with animated characters, read aloud with immediate feedback on the pronunciation of each spoken word, click on words to have them pronounced or defined, and interact with media objects (text, objects in illustrations, etc.) using speech, typing or mouse clicks. The interactive book is displayed on the student’s client machine, but interaction with animated characters and other media objects occurs via communication with and among technology servers in the Interactive Book architecture. These technology servers include: Audio Server – Receives signals from microphone or telephone and sends them to the recognizer. Also sends synthesized speech to PC speakers or telephone. Sonic Speech Recognizer – Takes signals from an audio server and produces a word lattice. Developed by Bryan Pellom at CSLR, it will be extended as part of the proposed work. Phoenix Natural Language Parser – Takes word lattice from recognizer and produces the “best” interpretation of the recognized utterance. Confidence Server – Takes hypothesis and semantic parse from the speech recognizer and parser as input and annotates the words and concepts with levels of confidence. Dialogue Manager – Resolves ambiguities; estimates confidence in the extracted information; clarifies with user if required; integrates with current dialogue context; builds database queries; sends data to NL generation for presentation to user, prompts user for information; Database / Backend Server – Receives SQL queries from Dialogue Manager; interfaces to SQL database; retrieves data from the web to enable learning tools to access online information; Natural Language (NL) Generator – Constructs strings of words to speak back to the user based on the current dialog action; Text-to-Speech (TTS) Synthesizer – Receives word strings from NL generation; synthesizes them to be sent to the audio server; CU Animate Character Animation Server—Receives a string of symbols (phonemes, animation control commands) with start and end times from the TTS server, and produces visible speech, facial expressions and other gestures in synchrony with the speech waveform. (Descriptions of each of these modules and publications providing more detail are available on the CSLR Web site at http://cslr.colorado.edu) MPL Face Tracker—tracks faces in real time (at 30 frames per second) under arbitrary illumination conditions and backgrounds (which may include moving objects). The face detector communicates the location of the user’s face to the animation server, which, by triangulating between the user, camera and animated agent, allows the animated agent’s eyes to track the user. MPL Emotion Monitor—a prototype system that classifies facial expressions into seven emotion dimensions: neutral, angry, happy, disgusted, fearful, sad, and surprised. The system will be integrated into interactive books in the near future. Interactive Book Authoring Tools- Interactive books provide a test bed for research, development and evaluation of perceptive animated interfaces with virtual humans. (They are currently being to teach children to read in schools in Boulder Colorado.) To facilitate application development, authoring tools have been developed to enable designers to create interactive learning experiences. Designers can create content by typing in text, or scanning text and illustrations. Once text and illustrations have been input, designers can orchestrate interactions between students, animated characters and various media objects using any of the technology servers in the Interactive Book architecture. Developers can cause animated characters to narrate their parts in a story using synthetic or naturally recorded speech, mark up text to control the character’s facial expressions and gestures while speaking, and design interactive spoken dialogues with characters to talk about the story or to test comprehension. CU Animate. CU Animate is a toolkit designed for research, development and control of 3D animated characters for use in perceptive animated interfaces. Nine characters developed in 3D Studio Max have been imported into CU animate. Each character was designed with a full body and fully articulated skeletal structure, with sufficient polygon resolution to produce natural animation in regions where precise movements are required, such as lips, tongue and finger joints. CU Animate provides real time rendering of animated characters by controlling parameters and/or morphing between target states. In addition to providing a public domain platform for research, CU Animate provides a graphical user interface for designing arbitrary animation sequences. These sequences can then be tagged (or iconified) and inserted into text strings, so that characters will say the text while producing desired emotions and gestures. UCSD Head Tracking System. A head tracking system has been developed by Co-PI Javier Movellan at UCSD and integrated into the Galaxy architecture for distribution with CSLR toolkits. The head tracker, which accurately locates and tracks the user’s face in real time, represents an important first step towards integration of state of the art computer vision technology and computer animated interfaces. Once the location of the user’s face is known, further research can be undertaken to interpret visible speech, etc., and research advances can be integrated readily into working systems. The system has been placed in the public domain. 5. Proposed Work The main goals of the workshop are (a) to understand prior work and research challenges required to develop perceptive animated interfaces and virtual humans, (b) determine practical steps and activities required; and (c) initiate a set of activities that will help establish a vital community of researchers who will work together to accelerate progress. The proposed workshop will be held in Boulder Colorado. It will be hosted by Ron Cole at CSLR at a venue in Boulder, perhaps the Boulderado Hotel. The workshop organizers will form an organizing committee, and work with this committee to develop a detailed agenda, and to identify the key issues that will form the topics for breakout sessions. The workshop will address the following questions: • What is the state of scientific knowledge about perception, production and interpretation of auditory and visual behaviors during face-to-face communication? How are these behaviors influenced by task domain, social influences, and other variables? What knowledge can be applied immediately to the design of perceptive animated interfaces? What scientific knowledge is missing, and what research is required to gain this knowledge? • What are the capabilities and limitation of technologies and methodologies currently used in research, development and evaluation of advanced dialogue systems? What sorts of perceptive animated interfaces can these technologies support today? What key research breakthroughs are needed in speech and language technologies, what is required to achieve these breakthroughs, and how will these breakthroughs translate into more effective perceptive animated interfaces? • What is the state of the art of computer vision technology relative to monitoring and interpreting visual behaviors to enable face-to-face communication with an animated character? What is the missing science? What key research breakthroughs are needed to enable perceptive interfaces? What research tools, corpora, and systems are currently available to enable research and development efforts? What new infrastructure is needed to conduct research? What effort and cost is required to develop this infrastructure? • What is the state of the art of animation technology? What research and development activities are needed to produce natural and contextually appropriate facial expressions, eye movements, and hand and body movements in different tasks? What infrastructure is required to achieve key research breakthroughs? • What architectures have been proposed or implemented to support real time dialogue interaction between users and virtual humans? Does the proposed system architecture and task domain enable researchers to achieve and measure key research objectives? How do we measure and evaluate the performance of these systems and system modules? How do we compare different systems? How do we measure progress over time? • What systems could and should be developed to serve as test beds for research and development of perceptive animated interfaces? What task domain(s) should be selected? • What resources—annotated corpora, research tools, etc.—are needed to study relevant communication behaviors between people or between people and machines, and to enable researchers to train and evaluate machine perception and generation algorithms? (In the appendix below, we explain the critical importance of developing corpora to enable research in perceptive animated interfaces, and provide examples of how development of corpora has accelerated progress in science.) • What standards are required to assure interoperability of system components and real time interaction over communication channels? • What metrics and methodologies are required to evaluate and compare systems and system components, and to measure progress within and between research and development sites over time? • What concrete steps should the research community take to stimulate and sustain research, and to create a strong and enduring community that will realize the vision of perceptive animated interfaces? A major goal of the workshop is to understand (and hopefully plan some of) the activities required to provide answers to the above questions. To this end, the workshop organizers will work with invitees both before the workshop to develop literature reviews and position papers related to key issues, and then to modify these position papers based on the results of the workshop. The workshop organizers will also work with the invitees to develop an agenda for the workshop that consists of focused breakout groups and other activities designed to optimize the outcomes of the workshop, which will be measured in terms of concrete proposals and plans to conduct activities that address the above questions. The workshop will produce a set of recommendations to the NSF on programs or initiatives that could be undertaken to support activities required to establish a vital community leading to perceptive animated interfaces and virtual humans. Given the new knowledge, technologies and systems, and the positive impact of these systems on society, we believe this an important and time-critical task. Although we have asked for funding for a single workshop, the authors of this proposal believe that the objective of establishing a vital research community quickly—one that can develop critical infrastructure and initiate research and development efforts leading to systems that can be evaluated on common tasks—will be better served through a series of workshops. A single workshop can bring scientists together to conceptualize, plan and recommend future steps toward perceptive animated interfaces. A series of workshops could establish a field of research and produce concrete research results and even initial systems through collaborative activities. For example, audio and visual recordings of children’s speech and emotions (described in the appendix below) could be annotated and shared by the community for research, development and comparison of kids’ speech recognition, face tracking and emotion classification algorithms. Research tools, systems (e.g., UCSD MPL’s face tracking system; CSLR’ Communicator system and animation toolkit) and research collaborations initiated. 6. Expected Outcomes Outcomes of the proposed work will include published reports of the workshop reflecting the activities, plans and recommendations of the participants, and a final report summarizing the accomplishments and recommendations of the project. In addition, we expect to identify critical infrastructure needs, and to design and develop annotated corpora that can be used to enable research on auditory and visual recognition and synthesis. Finally, we expect to formulate and initiate a plan to conduct collaborative work at multiple sites to develop some initial systems that will serve as test beds for research and development of perceptive animated interfaces. Success in the proposed work will be realized by the existence of a dedicated community of researchers who meet regularly, demonstrate new systems incorporating research breakthroughs, work with companies to develop products that benefit society, and who share advances in science and technology both within and outside the research community. 7. Effort Allocation Ron Cole, Javier Movellan and Jonathan Gratch will organize the workshop and take primary responsibility for authoring the final report. We will also work together to organize smaller satellite meetings at conferences that bring researchers with complementary interests together; and to organize special sessions at conferences to describe workshop recommendations and to engage other researchers. The authors of this report have organized over twenty different workshops, including several workshops funded by the NSF focused on developing new research agendas and programs. Ms. Terry Durham will organize the workshop. Ms. Durham has over ten years of experience organizing and running workshops, initially as Center Administrator at the Center for Spoken Language Understanding at OGI, and more recently as Center Administrator at CSLR. Both during and after the workshop, the PI and co-PIs will work tirelessly to promote and facilitate collaborative research projects among members of the emerging perceptive animated interface community, and enlist colleagues to participate in development of critical infrastructure. To this end, both CSLR and UCSD will work hard to share expertise, tools, technologies and systems with other researchers. 8. Guest List We will work hard to identify university researchers in the U.S., and one or two researchers from the EU (e.g., Bjorn Granstrom from KTH) who are leaders in areas related to research and development of perceptive animated interfaces and virtual humans. Leading researchers from the commercial sector may be invited at their own expense. We will work to achieve a good balance between researchers who study human communication, including individuals who study expression and generation of emotions and gestures in different areas (e.g., there is a field of research that studies social resonance and empathy in therapeutic settings), and individuals who research and develop multimodal systems and component spoken dialogue, computer vision and computer animation technologies. Individuals invited to the workshop will be selected from areas of communication, cognitive science, computer science and engineering, counseling, linguistics, psychology and speech science.
Appendix
Critical Importance of Corpora for Research and Development of Perceptive
Animated Interfaces
We believe that the lack of useful corpora stands as a main barrier to development of core visual recognition and synthesis technologies underlying perceptive animated interfaces. Annotated corpora are critically important for studying the visual dimensions of conversational behavior, for developing accurate and robust recognition and synthesis systems, and for evaluating research results. A main goal of the proposed work is to specify corpora that will optimize research and development of perceptive animated interfaces. The speech recognition community provides a good model for these activities, as it has been successful at developing and establishing a large number of publicly available speech corpora for a large number of tasks, including media broadcasts, telephone conversations (in many languages) and spoken dialogue systems. The importance of speech corpora is underscored by the efforts of NSA, DARPA and NSF in establishing the Linguistic Data Consortium, and an NSF DARPA initiative to develop language resources. Progress in speech recognition and understanding research over the years is tied directly to the development of language resources resulting from these and other initiatives. (We note that PI Cole, while director of the Center for Spoken Language Understanding at OGI, planned and supervised development of twenty speech corpora that were made available to university researchers free of charge.) In those few cases in which effort has been devoted to developing corpora for computer vision research the results have been dramatic. One example is the work of Jonathon Phillips at NIST on the FERET database of static images for testing and evaluating face recognition systems. Because of the FERET effort, companies have been founded and commercial products developed for computer vision-based person identification. To enable development of perceptive animated interfaces, similar efforts must be undertaken to develop digital video databases of spontaneous facial behavior. Corpus development efforts in this area typically involve Terabytes of data, at a cost which until very recently was prohibitive for the average research laboratory. For this reason, the application of machine perception research to recognition of video sequences is at an impasse. An initial stage of early proof of concept developments which occurred in the late eighties and early nineties has been followed by slow progress due to the lack of realistic databases to train and evaluate different approaches. For example, Eric Petajan’s dissertation in 1984 pioneered machine perception work on audio-visual speech recognition (combining information from the acoustic signal and movements of the lips and lower face region). Following this work a variety of approaches were developed in the early nineties by different research labs, establishing the seeds for an emerging community (Mase and Pentland, 1989; Sejnowski et al. 1989; Stork et. al. 1992; Bregler et al., 1993; Movellan 1994; Luettin et al. 1997). Such systems were promising but, due to the lack of standard databases, research never progressed beyond the proof of concept stage. Work on automatic expression recognition from video followed a similar path. Early systems developed in the nineties established proof of concept using small locally developed databases (Mase, 1991;Yacoob & Davis, 1994; Rosenblum, Yacoob, & Davis, 1996; Essa & Pentland, 1997; Terzopoulus & Waters, 1993; Li, Riovainen, & Forscheimer, 1993; Cottrell & Metcalfe, 1991; Bartlett et al., 2000). While these systems demonstrated that automatic recognition of expressions is feasible, progress beyond these initial demonstrations has been slow due to the lack of realistic standard databases. In the absence of such databases it is almost impossible to compare different systems and to evaluate progress towards realistic applications. In order to make serious progress on machine perception applied to video sequences of the human face and to produce accurate and graceful animation of conversational behaviors by animated characters, it is crucial to develop corpora of spontaneous facial behavior. Collaboration between machine perception experts and behavioral scientists is needed to choose recording situations that produce rich and interesting spontaneous behaviors in a variety of tasks. For both video and motion capture data, we will follow two independent paths. First, we will establish a special interest group within the emerging perceptive interface community to conceptualize and propose a set of corpora and associated corpus development activities for video data capture and motion capture. Activities and issues that will be addressed include identifying the behaviors to be measured, selecting tasks and designing protocols to elicit these behaviors, selecting subjects, placement of cameras and motion sensors, transcribing data, representing and analyzing data, and estimating efforts and costs associated with these activities. Second, we will design, collect, transcribe and analyze data to create some initial corpora. We believe it is crucial to conduct actual video and motion capture corpus development efforts so we can understand first hand the issues and costs involved for these and future efforts. Moreover these efforts will produce useful public domain corpora for computer vision and animation research. Video Data. As part of our collaborative efforts to develop perceptive animated interfaces to help children learn to read, Cole and Movellan are collecting audio and digital video data from 1000 children in first through fifth grade. The protocol for this corpus contains prompted speech, read speech, and spontaneous speech from children in kindergarten through grade 7. A variety of sub-tasks are included, such as individual phonetic sounds, isolated letters and numbers, isolated words, common human-computer related commands, words related to mathematics and time, and grade-specific sentences for the prompted-speech section. Sections of the protocol are also designed to elicit a range of expressions including concentration, confusion and stress. This corpus, which is huge by current computer vision standards, will serve as an ideal test bed to help establish transcription procedures and conventions, and for training and evaluating facial recognition systems for head tracking, speech reading and facial expression recognition. The corpus will be offered to the research community as a possible starting point to develop transcription conventions and to train and compare face recognition and visible speech recognition systems. Motion Capture. Motion capture techniques are commonly used today to create 3D animation sequences in video games and motion pictures. Sensors are placed on the body, and X, Y, Z coordinates are recorded while actors produce specific movement sequences. The sequence of coordinates can then be mapped directly to corresponding points of 3D models to produce animated sequences that are extremely close to the original human behaviors. Motion capture has also been applied to animation of speech movements and facial expressions with great accuracy. See, for example, the video clips of 3D characters producing speech at www.pyros.com produced using motion capture. A breakout group at the workshop will address issues related to describing tasks and procedures for collecting and analyzing motion capture data. |
| < Prev | Next > |
|---|