The PEKING project

developing new technology

for document processing

The PEKING project (People and Knowledge Information Gathering) is a 5th framework project (IST-25338, January 2001 - December 2002) addressing the problems of supervised and unsupervised classification and (cross-lingual) matching of documents in organizations.

The consortium consisted of the following partners:

industrial partners
- META4 R&D (a leading web-based provider of people and knowledge management software, coordinator), Spain
- Quinary (Technology Consulting Company), Italy
- Edmond (Software Company), The Netherlands
academic partners
- Gilcub, Univ. of Barcelona, Spain
- Univ. of Nijmegen (KUN), The Netherlands
user partners
- CRF-FIAT (the Italian automotive group)
- CINDOC (Spain's leading documentation institute)
- Fiscaal up to Date (Publishing Company), The Netherlands.

The project started in Januari 2001 and was successfully completed in December 2002.

In the course of the PEKING project, KUN and Edmond have addressed the real-life situation of a Dutch User (Fiscaal) which is typical for many firms and institutions which are providing access to a large amount of systematically collected documents. The documents are presently manually classified according to a hierarchical thesaurus, which is hard to keep up to date and to modify. Furthermore, certain index terms have been added to the documents manually, and a conventional keyword-based search facility is available. Since the manual classification and index term assignment is expensive, inflexible and rather subjective, there is a pressing need for an automatic disclosure mechanism to replace or at least support the manual classification process.

The following technical problems were addressed:

learning reliable classifiers from unreliably classified documents
exploiting the notion of uncertainty in improving classification results
deriving normalized phrasal representations from documents
using phrase representations in conjunction with statistical learning methods to increase precision in learning
Cross-Lingual Text Categorization.

KUN has extended the LCS (Linguistic Classification System), developed as a prototype in the course of the earlier DORO project, into an industrial quality system capable of classifying large streams of documents in many languages.

Publicly available documentation and publications:

from the current project:
- C.H.A. Koster, "From keywords to keyphrases", presentation ps.gz pdf at the 'ICT kenniscongres' in the Hague, 6/7 September, 2001
- C.H.A. Koster, M.Seutter and J.G. Beney (INSA Lyon), "Classifying Patent Applications with Winnow", Benelearn 2001, Antwerp, December 21. ps.gz pdf
- C. Peters and C.H.A. Koster (2002), "Uncertainty-based noise reduction and term selection in text categorization", ECIR 2002. ps.gz pdf
- C.H.A. Koster, P. Jones, M. Vogel and N.Gietema, "The Bootstrapping Problem", presented at the SIGIR 02 Workshop on Operational Text Categorization, Tampere, August 2002. ps.gz pdf
- C. Peters and C.H.A. Koster (2003), Uncertainty-based Noise Reduction and Term Selection in Text Categorisation, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS) Vol. 11, No. 1, pp 115-137. ps.gz pdf
- C.H.A. Koster and M. Seutter (2003), Taming Wild Phrases, Proceedings 25th European Conference on IR Research (ECIR 2003), Springer LNCS 2633, pp 161-176. ps.gz pdf
- (in Dutch) C.H.A. Koster, Automatische Document Klassificatie, presentatie op de DOCUMENT 2003 beurs, Nijkerk, 17 juni 2003 pdf
- Nuria Bel, Cornelis H.A. Koster and Marta Villegas (2003), Cross-Lingual Text Categorization, to appear in Proceedings ECDL 2003, Trondheim, August 2003. ps.gz pdf
- C.H.A. Koster, M.Seutter and J.G. Beney (INSA Lyon), "Multi-Classification of Patent Applications with Winnow", to appear in Proceedings PSI 2003, Novosibirsk, July 2003 ps.gz pdf
from the preceding DORO project
- H. Ragas (CAP Gemini) and C.H.A. Koster, "Four text classification algorithms compared on a Dutch corpus", Proceedings SIGIR 1997 ps.gz pdf
- C.H.A. Koster, C. Derksen, D. van de Ende and J. Potjer, "Normalization and matching in the DORO project", Proceedings BCS IR conference 1999 ps.gz pdf
- Paula Santalla del Rio (USC), "An architecture for document routing in Spanish: two language components, pre-processor and parser" ps.gz pdf
related publications
- Avi Arampatzis, Jean Beney, C.H.A. Koster, Th.P. van der Weide, "Incrementality, Decay, and Threshold Optimization for Adaptive Filtering Systems", The Ninth Text REtrieval Conference (TREC-9), Gaithersburg, Maryland, November 13-16, 2000. ps.gz pdf
- Christiaan Rudolfs, E@SLAVE -- an incremental approach to automated, content-based email classification, Master's Thesis KU Nijmegen, August 2002. ps.gz pdf

Requests for information can be directed to

Cornelis H.A. Koster
Department of Computing Science
University of Nijmegen
6525ED Nijmegen, The Netherlands
tel: +30.24.3653411
fax: +30.24.3553450
email: kees@cs.kun.nl