Abstract: This paper is a comparative study of feature
selection methods in statistical learning of
text categorization. The focus is on aggressive
dimensionality reduction. Five methods
were evaluated, including term selection
based on document frequency (DF), information
gain (IG), mutual information (MI), a
Ø
2
-test (CHI), and term strength (TS). We
found IG and CHI most effective in our experiments.
Using IG thresholding with a knearest
neighbor classifier on the Reuters corpus,
removal of up to... (Update)
...selection for improved performance and robustness. The diseriminative term selection is based on the criterion of Information Gain (IG) [5,6,7]. The discriminative power of the term is measured by the average entropy variations on the categories when the term is present or...
...to distinguish among economy or soccer sub subjects. These features are selected using the document frequency measure over a group of cases [17]. The document frequency #Tr(tk) of a term tk is the number of documents (textual cases in the same group) in which the term occurs....
Yang, Y., Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412---420, 1997. This article was processed using the L A T E X macro package with LLNCS style http://citeseer.nj.nec.com/yang97comparative.html More
@inproceedings{ yang97comparative,
author = "Yiming Yang and Jan O. Pedersen",
title = "A comparative study on feature selection in text categorization",
booktitle = "Proceedings of {ICML}-97, 14th International Conference on Machine Learning",
publisher = "Morgan Kaufmann Publishers, San Francisco, US",
address = "Nashville, US",
editor = "Douglas H. Fisher",
pages = "412--420",
year = "1997",
url = "citeseer.nj.nec.com/yang97comparative.html" }