Language technology and the linguistic sciences are confronted with a vast array of language resources, richly structured, large and diverse. Multiple communities depend on language resources, including linguists, engineers, teachers and actual speakers. Many individuals and institutions provide key pieces of the infrastructure, including archivists, software developers, and publishers. Today we have unprecedented opportunities to connect these communities to the language resources they need.
We can observe that the individuals who use and create language resources are looking for three things: data, tools, and advice. By data we mean any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of hand-written index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar. By tools we mean computational resources that facilitate creating, viewing, querying, or otherwise using language data. Tools include not just software programs, but also the digital resources that the programs depend on, such as fonts, stylesheets, and document type definitions. By advice we mean any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data, and so forth (e.g. the Corpora List archives [http://www.hit.uib.no/corpora/]). In the context of OLAC, the term language resource is broadly construed to include all three of these: data, tools and advice.
Unfortunately, today's user does not have ready access to the resources that are needed. Figure 1 offers a diagrammatic view of this reality. Some archives (e.g. Archive 1) do have a site on the internet which the user is able to find, so the resources of that archive are accessible. Other archives (e.g. Archive 2) are on the internet, so the user could access them in theory, but the user has no idea they exist so they are not accessible in practice. Still other archives (e.g. Archive 3) are not even on the internet. And there are potentially hundreds of archives (e.g. Archive
There are many other problems inherent in the current situation. For instance, the user may not be able to find all the existing data about a language of interest because different sites have called it by different names (low recall). The user may be swamped with irrelevant resources because search terms have important meanings in other domains (low precision). The user may not be able to use an accessible data file for lack of being able to match it with the right tools. The user may locate advice that seems relevant but have no basis for judging its merits.
As web-indexing technologies improve one might hope that a general-purpose search engine should be sufficient to bridge the gap between people and the resources they need. However this is a vain hope. First, many language resources, such as audio files and software, are not text-based. Second, many language names have several variants, and these various strings regularly denote things other than languages. Third, much of the material is not-and will never be-documented in free prose on the web. In place of traditional web-indexing, two new initiatives provide the necessary infrastructure for language resource discovery.
The Dublin Core Metadata Initiative began in 1995 to develop conventions for resource discovery on the web [dublincore.org]. The Dublin Core metadata elements represent a broad, interdisciplinary consensus about the core set of elements that are likely to be widely useful to support resource discovery. The Dublin Core consists of 15 metadata elements, where each element is optional and repeatable: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage and Rights. This set can be used to describe resources that exist in digital or traditional formats.
The Open Archives Initiative (OAI) was launched in October 1999 to provide a common framework across electronic preprint archives, and it has since been broadened to include digital repositories of scholarly materials regardless of their type.1To implement the OAI infrastructure, an archive must comply with two standards: the OAI Shared Metadata Set (Dublin Core), which facilitates interoperability across all repositories participating in the OAI, and the OAI Metadata Harvesting Protocol, which allows software services to query a repository using HTTP requests.
OAI archives are called ``data providers,'' and typically have a submission procedure, a long-term storage system, and a mechanism permitting users to obtain materials from the archive. An OAI ``service provider'' provides end-user services-such as search functions over union catalogs-based on metadata harvested from one or more data providers. Figure 2 illustrates a single service provider accessing three data providers using the OAI metadata harvesting protocol. End-users only interact with service providers.
The OAI infrastructure has the bottom-up, distributed character of the web, while simultaneously having the efficient, structured nature of a centralized database. This combination is well-suited to the language resource community, where the available data is growing rapidly and where a large user-base is fairly consistent in how it describes its resource needs.
OAI data providers may support metadata standards in addition to the Dublin Core. Thus, a specialist community like the language resources community can define a metadata format tailored to its domain. Using the OAI infrastructure, the community's archives can be federated: a virtual meta-archive collects all the information into a single place and end-users can query multiple archives simultaneously. In the case of OLAC, the Linguistic Data Consortium has harvested the catalogs of ten participating archives and created a search interface which permits queries over all 9,000+ records. A single search typically returns records from multiple archives. The prototype can be accessed via [www.language-archives.org].