The automated categorisation (or classification) of texts into
topical categories has a long history, dating back at least to the early '60s.
Until the late '80s, the most effective approach to the problem seemed to be
that of manually building automatic classifiers by means of
knowledge-engineering techniques, i.e. manually defining a set of rules encoding
expert knowledge on how to classify documents under a given set of categories.
In the '90s, with the booming production and availability of on-line documents,
automated text categorisation has witnessed an increased and renewed interest,
prompted by which the machine learning paradigm to automatic classifier
construction has emerged and definitely superseded the knowledge-engineering
approach. Within the machine learning paradigm,a general inductive
process (called the learner) automatically builds a classifier (also called the
rule,or the hypothesis)by learning, from a set of previously
classified documents, the characteristics of one or more categories. The
advantages of this approach are a very good effectiveness, a considerable
savings in terms of expert manpower, and domain independence. In this tutorial
we look at the main approaches that have been taken towards automatic text
categorisation within the general machine learning paradigm. Issues pertaining
to document indexing, classifier construction, and classifier evaluation, will
be discussed in detail. A final section will be devoted to the techniques that
have specifically been devised for an emerging application such as the automatic
classification of Web pages into "Yahoo!-like" hierarchically structured sets of
categories.
This section details the contents of the tutorial,
including approximate timing information. A preliminary version of the slides on
which the tutorial will be based can be downloaded for inspection by clicking here
A definition of the text categorisation task Single-label and multi-label categorisation Category-pivoted and document-pivoted categorisation
Automatic indexing for Boolean information retrieval systems Document organisation Document filtering Resolution of linguistic ambiguities Yahoo!-style search space categorisation
Fabrizio
Sebastiani (born 1960) graduated in Computer Science summa cum laude
at the University of Pisa, Italy in 1986. From 1986 to 1988 he has been working
as a researcher at the Department of Linguistics of the University of Pisa;
since 1988 to date he has been a member of the research staff of CNR-IEI.
In 1989/90 he has been a Visiting Scientist at the Department of Computer
Science, University of Toronto, Canada, where he has worked on non-monotonic
reasoning; in 1993/94 he has been a Visiting Scientist at the Department of
Computing Science, University of Glasgow, UK, where has worked on the
application of logic and probability to information retrieval; in 1998 he has
been a Visiting Scientist at the Department of Computing Science, University of
Dortmund, Germany, where has worked on automated text categorization. He is
currently involved in the CEC-funded ESPRIT LTR Project EUROSEARCH, dealing with
the design of a European, multilingual federation of search engines. He has
published several papers in international journals and conferences in the areas
of natural language processing, logic-based knowledge representation, and
information retrieval. His main current interest is the application of
machine learning to automated text categorization.
Other information of interest: