In order to identify and catalog a language it is crucial to define what counts as a language and to distinguish languages from dialects. The editors of the Ethnologue have made thousands of such decisions using advice from hundreds of experts around the world. However, for many languages scholarship remains patchy or else there is scholarly disagreement. In such cases, the best that the Ethnologue can do is what it does already-represent incomplete knowledge and then produce periodic updates to reflect the results of new research.
As OLAC grows, Ethnologue codes will be deployed widely. Each new OLAC-conformant archive will be faced with a range of issues in seeking to associate language codes with language resources. For concreteness, we have chosen for our examples the Formosan languages, a group of Austronesian languages spoken in Taiwan. We put ourselves in the shoes of the field research group at Academia Sinica (Elizabeth Zeitoun, personal communication) and try to envisage the problems which they might encounter in assigning Ethnologue codes to their language resources.
We see three broad categories of problem: over-splitting, over-chunking and omission. Over-splitting occurs when a language variety is treated as a distinct language. For example, Nataoran is given its own language code (AIS) even though the scholars at Academia Sinica consider it to be a dialect of Amis (ALV). Over-chunking occurs when two distinct languages are treated as dialects of a single language (there does not appear to be an example of this in the Ethnologue's treatment of Formosan languages). Omission occurs when a language is not listed. For example, two extinct languages, Luilang and Quaquat, are not listed in the Ethnologue. Another kind of omission problem occurs when the language is actually listed, but the name by which the archivist knows it is not listed, whether as a primary name or an alternate name. In such a case the archivist cannot make the match to assign the proper code. For instance, the language listed as Taroko (TRV) in the Ethnologue is known as Seediq by the Academia Sinica; several of the alternate names listed by the Ethnologue are similar, but none matches exactly.
Beyond these three problems with language identification, a further type of problem concerns scholarly disagreement over language family classification. The Ethnologue follows the Oxford International Encyclopedia of Linguistics [Bright1992] for most language families. For the Austronesian languages, including the Formosan languages, the Ethnologue follows the Comparative Austronesian Dictionary [Tryon1994]. Additionally, some changes have been entered in the light of more recent comparative studies.2Academia Sinica has developed its own language family classification scheme for Formosan languages, and this differs from the Ethnologue/Tryon scheme. Additionally, languages typically have many variant names, and scholars may disagree on the choice of a canonical name for the language. For example, the Academia Sinica scholars believe Taroko to be a variant of Seediq, while the Ethnologue/Tryon would presumably consider Seediq to be a variant of Taroko.
The consequences of these problems for classification and retrieval are obvious. In the case of over-splitting, as with AIS and ALV mentioned above, someone searching for Amis resources will need to know to search over both codes. An archivist cataloging a resource which is ambiguous with respect to the AIS/ALV distinction (perhaps because it was created by someone who did not believe in the distinction) may need to assign both codes. In the case of over-chunking, an archivist cannot specify the individual language but must use a code which designates two or more languages. Someone searching for resources in one of those languages will experience lower precision. In the case of omission, no language code can be assigned, and classification and search must fall back to using conventional string representations for language names (with the attendant precision and recall problems). In the case of differing language family classifications, the precision and recall of searches on language family names are reduced.
All of these problems can be addressed through existing Ethnologue mechanisms.3However, OLAC metadata and service providers could offer complementary remedies.
Controlling element content. The Language and Subject.language elements permit the language code to be specified in the code attribute, while the element content is unrestricted. A community of Formosan scholars could develop a controlled vocabulary for identifying speech varieties down to any level of detail they liked, and then use those terms as the content of the Language or Subject.language element. For example, the following are five varieties of the Bunun language:
<language code="x-sil-BNN">Northern/Takituduh</> <language code="x-sil-BNN">Northern/Takibakha</> <language code="x-sil-BNN">Central/Takbanuaz</> <language code="x-sil-BNN">Central/Takivatan</> <language code="x-sil-BNN">Southern/Isbukun</>
If no Ethnologue code corresponded to the group of languages in question, as in the Amis/Nataoran case, the code attribute could be omitted (though this would prevent recall on the Ethnologue code). This general approach could be formalized by permitting subcommunities to register an encoding scheme as a controlled vocabulary with a unique name. That name would be specified as the value of a new scheme attribute, and the element content would be constrained to be an item from the corresponding vocabulary. These approaches would address the problems of over-chunking and omission.
Registering language groups with an OLAC registration service. While the classification of a language is sometimes treated as metadata for resources in that language, we believe that a more appropriate location for this type of finding aid is in OLAC service providers. OLAC could maintain a language classification server which would house a comprehensive list of language family names and their extensional definitions (i.e. sets of Ethnologue codes). The server would permit users to define their own language group names or their own versions of existing group names. For instance, Academia Sinica could register a language group name AS:Amis with the extension {ALV, AIS}. Searching on their notion of ``Amis'' would return resources classified under both codes. Entire classification schemes with complex hierarchies could be represented in this fashion. OLAC service providers could index their harvested metadata using these names, allowing any user to perform searches using any classification scheme. Over time, the more respected and popular classifications could be identified and accorded due prominence. This mechanism would address the problems of over-splitting and differing classification.