It is essential, for a natural language processing system, to instantiate each object, process, attribute, and property correctly, so that all references to the same item be recognized as such and an inventory of all distinct items be accurate at all times. This problem is far from being resolved. There are both linguistic and computational reasons for this deficiency. First, there is no satisfactory microtheory of linguistic coreference. Secondly and consequently, there is no satisfactory application of such a microtheory to NLP.

A microtheory of coreference in natural language includes in its scope all the phenomena that satisfy the following condition: an object/entity, an event, an attribute, a property or its value, an attitude, or any combination of the above is referred to more than once in a natural-language text, and the understanding of the text depends on the correct interpretation of the two or more referring expressions as designating the same object, event, etc. A linguistic microtheory of coreference for a language consists of the following elements:

There has been a considerable amount of work on a few selected types of coreference, focusing almost exclusively on object coreference. Thus, significant work has been done in theoretical linguistics on anaphora and cataphora, subsuming, for the large part, earlier work on deixis. A small minority of authors have tried to extend their studies of anaphora beyond mere syntax. In the cognitive-linguistics and philosophy-of-language traditions, interesting work has been done relating anaphora and deixis to ambiguity resolution and discourse structure. At the same time, an effort in comparative-contrastive linguistics has led some writers to examining the data of more than one language at a time, still emphasizing entity or object reference.

In computational linguistics, the problem of coreference took early on the form of pronoun antecedent resolution, and this particular task, somewhat broadened to include a few other types of anaphora, still remains in the center of the problem. The most sustained effort in the computational treatment of coreference has been mounted within the Tipster/MUC-6 initiative. While it has been recognized since quite early in the game that coreference resolution is based in large part on world knowledge, most of the work done on the matter computationally and theoretically ignores and avoids world knowledge. The MUC-6 initiative makes such an orientation quite explicit: the work should be based on such simpler resources as part-of-speech tagging, simple noun phrase recognition, basic semantic category information like, gender, number, and [to a limited extent] full parse trees. Such an approach--trying to explore and maximize everything that can be done simply and cheaply towards the resolution of a complex program--is perfectly legitimate as long as it is realized that a considerable part of the problem remains unsolved, and it is indeed realized fully well within the MUC-6 initiative.

One persistent problem throughout the existing computational ventures into coreference has been the lack of a consistent theoretical approach to it. The result is that coreference phenomena are treated as self-obvious, and most of them are overlooked, especially if they are not explicit pronoun-antecedent or other equally evident anaphora cases. What is needed for a full, accurate, and reliable approach to coreference can be summarized, somewhat schematically, as involving the following steps:

  1. understanding fully the range of the phenomenon and of the rules that govern it (theory);
  2. determining the extent of machine-tractable information in the rules;
  3. taking stock of all the rules that can be computed;
  4. developing the appropriate heuristics for the computable rules;
  5. computing the rules.

