Boulder Language Technologies

  • Narrow screen resolution
  • Wide screen resolution
  • Auto width resolution
  • Decrease font size
  • Default font size
  • Increase font size
Home arrow Workshops arrow Language Resources arrow Report to the Interactive Systems Grantees' Workshop
Report to the Interactive Systems Grantees' Workshop PDF Print E-mail
These slides were prepared to summarize the presentations and discussions at the Language Resources Workshop, held 8/16/97,  for the attendees at the NSF ISGW, in a 15-minute presentation given on 8/18/97.

Along with the workshop background material and the Language Resources Primer referenced below, these are raw materials for the  final workshop report that Mark Liberman and Ron Cole will write.

Language Resources Workshop

Skamania Lodge 8/16/97

Workshop goals:

  • Community feedback about language resource needs.
  • Basis for interagency cooperation on language resource funding.
For more information, follow links on ISGW page to see The results may matter to you!

Your comments are welcome.
 

What are "language resources"?

    The term language resources (LR) refers to sets of language data and descriptions in machine readable form, used specifically for building, improving or evaluating natural language and speech algorithms or systems, and in general, as core resources for the software localization and language services industries, for language studies, electronic publishing, international transactions, subject-area specialists and end users. Examples of linguistic resources are written and spoken corpora, computational lexicons, grammars, terminology databases, basic software tools for the acquisition, preparation, collection, management, customization and use of these and other resources.

    The relevance of evaluation in Language Engineering is increasingly recognized. This involves assessment of the state-of-the-art for a given technology, measuring the progress achieved within a program, comparing different approaches to a given problem and choosing the best solution, knowing its advantages and drawbacks, assessment of the availability of technologies for a given application, and finally product benchmarking. It accompanies research and development in Human Language Technologies, and has driven important advances in the recent past in various aspects of both written and spoken language processing. Although the evaluation paradigm has been studied and used in large national and international programs, including the US ARPA HLT program, EU Language Engineering projects, the Francophone Aupelf-Uref program and others, particularly in the localization industry (LISA and LRC), it is still subject to substantial unresolved basic research problems. 

             --preliminary call, First International Conference on Language Resources and Evaluation

The good news

Generally available language resources in 1986:
  • Text:        only the Brown corpus
  • Speech:     nothing
  • Lexicons:  nothing
After 10 years of effort to create shared resources, in 1997 we have:
  • Text:          billions of words of English; lots in ~15 other languages
  • Speech:     almost 2,000 hours of transcribed speech
  • Lexicons:   WordNet, COMLEX, pronouncing dictionaries, much else
  • New types of resource, e.g.
    • task-oriented dialogues
    • multiply-annotated corpora
    • large multilingual multimedia IR collections
Saturday's questions:        What are the resource needs from now to 2007?
                                         How to satisfy them most efficiently and effectively ?


What we did on Saturday

Morning -- presentations:
  • Agency perspectives on language resources
  • Examples of  current models for resource creation and distribution
    • Linguistic Data Consortium
    • OGI CSLU corpora and tools
    • WordNet
    • Text Encoding Initiative
    • Discourse Resource Initiative
Lunch and afternoon -- breakout groups and general discussion:
  • For-instance projects in three areas
    • Multimodal machine translation
    • Multimodal understanding
    • Universal access
  • Resource needs of each
   

Some recurrent themes

1. Improved hardware/software leads to
  • new opportunities for distributed data collection and distribution
  • new resource needs, e.g.
    • speech from wider population
    • multimedia archives
    • multimodal HCI data
2. Multilingual resources are increasingly important

3. Intellectual property rights will be a continuing problem, because

  • internet makes electronic rights more valuable
    • especially for large and realistic resources
  • broadcast material is strongly protected
  • webcrawler IPR cops will hunt you down
  • WIPO treaty and other inititatives loom
4. Resources for speech synthesis research are relatively cheap but much needed.

5. Still too many barriers to resource sharing among researchers!  

 

 

Resources for Future Needs

A brainstorming exercise.

Three general areas:

  • Multimodal machine translation
  • Multimodal language understanding and summarization
  • Universal access
In each area, specify 1-3 concrete research goals.
For each goal,
       what key resources would be needed to support a research effort?

Saturday's goal was a preliminary set of examples,
       not an exhaustive listing
       or a determination of overall national priorities.
 On-going discussion will clarify and refine the results

 

I. Multimodal machine translation

A. Sample task definitions

1. text, speech, OCR in multiple languages >> text in English

2. computer-mediated multilingual communication

3. talking head in >> talking head out

B. Resources needed

    Parallel text corpora
    Additional multilingual speech and text data  (e.g. large VOA corpus)
    Multilingual lexical resources,
              including lists of fixed expressions, proper names, and idioms
    Annotated multilingual text (sense and structure)
    Multilingual speech synthesis corpora
    Resources for high-quality generation
 

II. Multimodal understanding

A. Sample task definitions

1. computer as moderator/facilitator of keyboard-based collaboration

            (email exchanges, IRC, and similar things)

2. meeting summarization based on video and audio monitoring

B. Resources needed

    Multimodal human-human interaction corpora
    Coding schemes for useful annotation
    Part of "American national multimedia corpus"?
     

III. Universal access

A. Sample task definitions

1. phone-based access to on-line information

          Obviously useful technology, but controversial as key to universal access

2. speech and language technology applied in primary education

          To teach language skills or in support of teaching other topics

B. Resources needed

    Basis for improved generation and synthesis
    Better corpora of human-human information access
    Speech-language-interaction data from school children
 
< Prev   Next >

Member Login