A Comparative Study on Feature Selection in Text Categorization

A Comparative Study on Feature Selection in Text Categorization (1997) (Make Corrections) (122 citations)
Yiming Yang, Jan O. PedersenProceedings of ICML-97, 14th International Conference on Machine Learning

Home/Search Context Related

View or download:
cmu.edu/~yiming/papers.yy/ml97.ps
cmu.edu/~yiming/papers.y...icml97.ps.gz
Cached:  PS.gz  PS  PDF  DjVu  Image  Update  Help

From:  cmu.edu/~yiming/publications (more)
Homepages:  Y.Yang  [2]  [3]  [4]  J.Pedersen  [2]  [3]  [4]
  HPSearch  (Update Links)

Rate this article:

(best)
Comment on this article

(Enter summary)
Abstract: This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ų 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to... (Update)

Context of citations to this paper: More

...selection for improved performance and robustness. The diseriminative term selection is based on the criterion of Information Gain (IG) [5,6,7]. The discriminative power of the term is measured by the average entropy variations on the categories when the term is present or...

...to distinguish among economy or soccer sub subjects. These features are selected using the document frequency measure over a group of cases [17]. The document frequency #Tr(tk) of a term tk is the number of documents (textual cases in the same group) in which the term occurs....

Cited by: More
Categorizing photographs for user-adapted.. - Hidalgo, Quejido, .. (Correct)
Taming Wild Phrases - Koster, Seutter (Correct)
Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization - Hidalgo (Correct)

Active bibliography (related documents): More All
0.4: Learning approaches for Detecting and Tracking News.. - Yang, Carbonell, Brown, .. (1999) (Correct)
0.4: An Evaluation of Statistical Approaches to Text Categorization - Yang (1997) (Correct)
0.3: A Comparison of Two Learning Algorithms for Text Categorization - Lewis, Ringuette (1994) (Correct)

Similar documents based on text: More All
0.3: Modified Logistic Regression: An Approximation to SVM.. - Zhang, Jin, Yang.. (Correct)
0.2: Using a Natural Constraint to Approximate Area and Volume - Yongjian Ye (Correct)
0.2: Stochastic Link and Group Detection - Kubica, Moore, Schneider, Yang (2002) (Correct)

Related documents from co-citation: More All
37: Text categorization with Support Vector Machines: Learning with many relevant fe.. - Joachims - 1998
32: An evaluation of statistical approaches to text categorization - Yang - 1999
24: Machine Learning (context) - Mitchell - 1997 Book Details from Amazon or Barnes & Noble

BibTeX entry: (Update)

Yang, Y., Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412---420, 1997. This article was processed using the L A T E X macro package with LLNCS style http://citeseer.nj.nec.com/yang97comparative.html More

@inproceedings{ yang97comparative,
    author = "Yiming Yang and Jan O. Pedersen",
    title = "A comparative study on feature selection in text categorization",
    booktitle = "Proceedings of {ICML}-97, 14th International Conference on Machine Learning",
    publisher = "Morgan Kaufmann Publishers, San Francisco, US",
    address = "Nashville, US",
    editor = "Douglas H. Fisher",
    pages = "412--420",
    year = "1997",
    url = "citeseer.nj.nec.com/yang97comparative.html" }

Citations (may not include all citations):
395 Indexing by latent semantic analysis - Deerwester, Dumais et al. - 1990
109 Accurate methods for the statistics of surprise and coincide.. - Dunning - 1993
28 Transmission of Information (context) - Fano - 1961
26 Towards language independent automated learning of text cate.. - Apte, Damerau et al. - 1994
8 Trading mips and memory for knowledge engineering: classifyi.. (context) - Creecy, Masand et al. - 1992
5 mutual information and lexicography (context) - Church, Hanks et al. - 1989
4 Contextsensitive learning metods for text categorization (context) - Cohen, Singer - 1996
3 a rule-based multistage (context) - Fuhr, Hartmanna et al.

The graph only includes citing articles where the year of publication is known.

Documents on the same site (http://www.cs.cmu.edu/~yiming/publications.html): More
Topic Detection and Tracking Pilot Study - Allan, Carbonell, Doddington.. (1998) (Correct)
Report on the CONALD Workshop on Learning from Text.. - Carbonell, Craven.. (1998) (Correct)
Noise Reduction in a Statistical Approach to Text Categorization - Yang (1995) (Correct)

Online articles have much greater impact More about CiteSeer Add search form to your site Submit documents