Machine Learning for Automated Text Categorization

A Tutorial

Fabrizio Sebastiani

Istituto di Scienza e Tecnologie dell'Informazione

Consiglio Nazionale delle Ricerche

56124 Pisa, Italy

The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early '60s. Until the late '80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledge-engineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm,a general inductive process (called the learner) automatically builds a classifier (also called the rule,or the hypothesis)by learning, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into "Yahoo!-like" hierarchically structured sets of categories.

Duration : Full day (6 hours)
Intended audience: The tutorial is intended for representative of either academia or industry who are active in neighbouring disciplines an are interested in either getting acquainted with or investing effort on the subject of automated text categorization by machine learning techniques.
Handouts to attendees: Copies of the slides + a copy of the review paper Machine Learning in Automated Text Categorization, by the author, appearing in ACM Computing Surveys 34(1), 1--47, 2002.

Detailed Contents of the Tutorial

This section details the contents of the tutorial, including approximate timing information. A preliminary version of the slides on which the tutorial will be based can be downloaded for inspection by clicking here

Introduction [15 min.]

A definition of the text categorisation task
Single-label and multi-label categorisation
Category-pivoted and document-pivoted categorisation

Applications of document categorisation [30 min.]

Automatic indexing for Boolean information retrieval systems
Document organisation
Document filtering
Resolution of linguistic ambiguities
Yahoo!-style search space categorisation

The machine learning approach to text categorisation [20 min.]

Training set and test set
Information retrieval techniques and text categorisation

Indexing and dimensionality reduction [40 min.]

Dimensionality reduction
Dimensionality reduction by term selection
- Document frequency
- Other information-theoretic term selection functions
Dimensionality reduction by term extraction
- Term clustering
- Latent semantic indexing

Methods for the inductive construction of a classifier [150 min.]

Probabilistic classifiers
Decision tree classifiers
- The AIR/X project
Decision rule classifiers
Regression models
On-line linear classifiers
The Rocchio classifier
- Enhancements to the basic Rocchio framework
Neural networks
Example-based classifiers
- Other example-based techniques
Building classifiers by support vector machines
Classifier committees
- Boosting
Other methods

Determining thresholds [20 min.]

Evaluation issues for text categorisation [40 min.]

Measures of categorisation effectiveness
- Precision and recall
- Other measures of categorisation effectiveness
- Measures alternative to effectiveness
- Combined effectiveness measures
Benchmark collections
Which classifier is the best?

Automatic categorisation of Web pages [30 min.]

Indexing and dimensionality reduction
Classifier induction
Evaluation

Conclusion [15 min.]

Biographical sketch of the tutor

Fabrizio Sebastiani (born 1960) graduated in Computer Science summa cum laude at the University of Pisa, Italy in 1986. From 1986 to 1988 he has been working as a researcher at the Department of Linguistics of the University of Pisa; since 1988 to date he has been a member of the research staff of CNR-IEI. In 1989/90 he has been a Visiting Scientist at the Department of Computer Science, University of Toronto, Canada, where he has worked on non-monotonic reasoning; in 1993/94 he has been a Visiting Scientist at the Department of Computing Science, University of Glasgow, UK, where has worked on the application of logic and probability to information retrieval; in 1998 he has been a Visiting Scientist at the Department of Computing Science, University of Dortmund, Germany, where has worked on automated text categorization. He is currently involved in the CEC-funded ESPRIT LTR Project EUROSEARCH, dealing with the design of a European, multilingual federation of search engines. He has published several papers in international journals and conferences in the areas of natural language processing, logic-based knowledge representation, and information retrieval. His main current interest is the application of machine learning to automated text categorization.

Other information of interest: