Text Categorization

General
Text Categorization with Learning Methods
Hierarchical Text Categorization
Text Categorization Using Knowledge Bases
Evaluation of Text Categorization
Topic Charaterization
Term Specificity
Term Discrimination Value

Rainbow <http://www.cs.cmu.edu/~mccallum/bow/rainbow/>: Rainbow is a program that performs statistical text classification. It is based on the Bow library.
ÃÖµ¿½Ã, Á¤°æÅÃ. 1995. "Ä«Å×°í¸®¿Í Å°¿öµåÀÇ ¹ÐÁ¢¼º Á¤º¸¿¡ ÀÇÇÑ ¹®¼ ÀÚµ¿ ºÐ·ù ½Ã½ºÅÛ ¼³°è ¹× ±¸Çö," ÇÑ±¹Á¤º¸°úÇÐÈ¸ '95 °¡À» ÇÐ¼ú¹ßÇ¥³í¹®Áý 22(2): 639-642.: [abstract] ´ë·«ÀÇ ¹®¼¸¦ ´Ù·ç´Â °÷¿¡¼´Â ¹®¼³»¿ë¿¡ µû¸¥ Ã¼°èÀûÀÎ ºÐ·ù °ü¸®´Â ÇÊ¼öÀûÀÌ´Ù. º» ³í¹®¿¡¼´Â ¹é°ú»çÀüÀ» ±âº» ÀÚ·á·Î ÀÚµ¿À¸·Î Å°¿öµå¸¦ ÃßÃâÇÏ°í ¼öµ¿À¸·Î Á¤ÀÇµÈ ºÐ·ù¹æ¹ý¿¡ µû¶ó ÃßÃâµÈ Å°¿öµåµé °£ÀÇ ºÐ·ù³» Áß¿äµµ¸¦ º¤ÅÍ°ªÀ¸·Î Ç¥ÇöÇÏ´Â º¤ÅÍÅ×ÀÌºí »ý¼º ½Ã½ºÅÛ°ú, ÀÌ¸¦ ÀÌ¿ëÇÑ¹®¼ ÀÚµ¿ ºÐ·ù ½Ã½ºÅÛÀÇ ¼³°è ¹× ±¸Çö¿¡ ´ëÇÏ¿© ±â¼úÇÑ´Ù. º» ½Ã½ºÅÛ¿¡¼ ±¸ÇöµÈ ºÐ·ùº° ±âÄõµå º¤ÅÍÅ×ÀÌºíÀº 76°³ÀÇ ºÐ·ù¿Í 90,000¿©°³ÀÇ Å°¿öµå ¿£Æ®¸®·Î ±¸¼ºµÇ¾î ÀÖ´Ù. »ùÇÃ ÀÚ·á·Î ÀÏÂ÷ÀûÀ¸·Î ±¸¼ºµÈ º¤ÅÍÅ×ÀÌºíÀ» ÀÌ¿ëÇÑ ¹®¼ÀÇ ºÐ·ù¿¡¼´Â ´ëÇ¥ÇÒ¸¸ÇÑ °íÀ¯¸í»ç¸¦ Æ÷ÇÔÇÑ ¹®¼¿¡ ´ëÇÏ¿©´Â °ÅÀÇ ¿Ïº®ÇÑ °á°ú¸¦ ³ªÅ¸³»Áö¸¸ º¸Åë¸í»ç ¹× Ãß»ó¸í»çÀÇ ¼ýÀÚ°¡ »ó´ëÀûÀ¸·Î ¸¹Àº ¾çÀ» Â÷ÁöÇÏ´Â ¹®¼¿¡ ´ëÇØ¼´Â ¾à 80ÆÛ¼¾Æ®ÀÇ ¼º°ø·üÀ» º¸¿´´Ù.
Apt?, Chidanand, Fred Damerau, and Sholom M. Weiss. 1994. "Automated learning of decision rules for text categorization," ACM Transactions on Information Systems 12(3): 233-251.: [abstract] We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to "read" documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.
Baker, Doug, and Andrew McCallum. 1998. "Distributional Clustering of Words for Text Classification", SIGIR-98. <http://www.cs.cmu.edu/~mccallum/papers/clustering-sigir98.ps.gz>.: [abstract] This paper applies Distributional Clustering (Pereira et al. 1993) to document classification. The approach clusters words into groups based on the distribution of class labels associated with each word. Thus, un-like some other unsupervised dimensionality-reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy---significantly better than Latent Semantic Indexing (Deerwester et al. 1990), class-based clustering (Brown et al. 1992), feature selection by mutual information (Yang and Pederson 1997), or Markovblanket-based feature selection (Koller and Sahami 1996). We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.
Basu, Chumki, Haym Hirsh, and William W. Cohen. 1998. "Recommendation as Classification: Using Social and Content-Based Information in Recommendation", Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/aaai1.ps>.: [abstract] Recommendation systems make suggestions about artifacts to a user. For instance, they may predict whether a user would be interested in seeing a particular movie. Social recomendation methods collect ratings of artifacts from many individuals and use nearest-neighbor techniques to make recommendations to a user concerning new artifacts. However, these methods do not use the significant amount of other information that is of ten available about the nature of each artifact --- such as cast lists or movie reviews, for example. This paper presents an inductive learning approach to recommendation that is able to use both ratings information and other forms of information about each artifact in predicting user preferences. We show that our method outperforms an existing social-filtering method in the domain of movie recommendations on a dataset of more than 45,000 movie ratings collected from a community of over 250 users.
Chakrabarti, Soumen, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. 1997. "Keyword Detection, Navigation, and Annotation in Hierarchical Text," Proceedings of 23rd International Conference on Very Large Data Bases: 446-455. Available in Postscript format from Internet: <http://sunsite.ust.hk/dblp/db/conf/vldb/ChakrabartiDAR97.html>.: [abstract] We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora, such as internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a flat unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. We show how to update such databases with new documents with high speed and accuracy. We use techniques from statistical pattern recognition to effciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classifier. At each node, this classifier can ignore the large number of noise words in a document. Thus the classifier has a small model size and is very fast. However, owing to the use of context-sensitive features, the classifier is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!.
Chakrabarti, Soumen, Martin van den Berg, and Byron Dom. 1999. "Focused crawling: A new approach to topic-specific web resource discovery," Proceedings of the WWW8 Conference. <http://www8.org/w8-papers/5a-search-query/crawling/index.html>.: [abstract] The rapid growth of the world-wide web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we design two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of web documents on specific topics, using modest desktop hardware.
Chen, Hsinchun, Chris Shuffels, and Rich Orwig. 1996. "Internet Categorization and Search: A Self-Organizing Approach", Journal of Visual Communication and Image Representation 7(1): 88-102. Available in Postscript format from Internet: <http://ai.bpa.arizona.edu/papers/PS/som95.ps>.: [abstract] The problems of information overload and vocabulary differences have become more pressing with the emergence of the increasingly more popular Internet services. The main information retrieval mechanisms provided by the prevailing Internet WWW software are based on either keyword search (e.g., Lycos server at CMU, Yahoo server at Stanford) or hypertext browsing (e.g., Mosaic and Netscape). This research aims to provide an alternative concept-based categorization and search capability for WWW servers based on selected machine learning algorithms.
Our proposed approach, which is grounded on automatic textual analysis of Internet documents (homepages), attempts to address the Internet search problem by first categorizing the content of Internet documents. We report results of our recent testing of a multi-layered neural network clustering algorithm employing the Kohonen self-organizing feature map to categorize (classify) Internet homepages according to their content. The category hierarchies created could serve to partition the vast Internet services into subject-specific categories and databases and improve Internet keyword searching and/or browsing.
Chen, Hsinchun, Robin Sewell. 1999. High-Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management. Digital Libraries Initiative-Phase 2 Project Summary. <http://www.dli2.nsf.gov/projects/chen.pdf>.: [abstract] The proposed research aims to develop an architecture and the associated techniques needed to automatically generate classification systems from large domain-specific textual collections and to unify them with manually created classification systems to assist in effective digital library retrieval and analysis. Both algorithmic developments and user evaluation in several sample domains will be conducted. Scalable automatic clustering methods including Ward's clustering, multi-dimensional scaling, latent semantic indexing, and the self-organizing map will be developed and compared. Most of these algorithms, which are computationally intensive, will be optimized based on the sparseness of common keywords in textual document representations. Using parallel, high-performance platforms as a time machine for simulation, we plan to parallellize and benchmark these clustering algorithms for large-scale collections (on the order of millions of documents) in several domains. Results of automatic classification systems will be represented using several novel hierarchical display methods.
Cheng, Patrick T. K., and Albert K. W. Wu. 1995. "ACS: An Automatic Classificatioin System," Journal of Information Science 21(4): 289-299.: [abstract] In this paper, we introduce ACS: an automatic classification system for school libraries. First, various approaches towards automatic classification, namely (i) rule-based, (ii)browse and search, and (iii) partial match, are critically reviewd. The central issues of scheme selection, text analysis and similarity measures are discussed. A novel approach towards detecting book-class similarity with Modified Overlap Coefficient (MOC0 is also proposed. Finally, the design and implementation of ACS is presented. The test result of over 80% correctness in automatic classification and a cost reduction of 75% compared to manual classification suggest that ACS is highly adoptable.
Clack, Chris, Johnny Farringdon, Peter Lidwell, and Tina Yu. 1997. "Autonomous Document Classification for Business", Proceedings of the First International Conference on Autonomous Agents: 201-208.: [abstract] We have developed a general purpose information classification agent architecture and are applying it to the problem of document classification and routing. Collaboration with Friends of the Earth allows us to test our ideas in a non-academic context involving high volumes of documents. We use the technique of genetic programming(GP), to evolve classiying agents. This is a novel approach for document classification, where each agent evolves a parse-tree representation of a user's particular information need. The other unusual features of our research are the longevity of our agents and the fact that they undergo a ccintinual training process; feedback from the user enables the agent to adapt th the user's long-term information requirements.
Dash, M., and H. Liu. 1997. "Feature Selection for Classification", Intelligent Data Analysis, 1(3). <http://www-east.elsevier.com/ida/browse/0103/ida00013/article.htm>.: [abstract] Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.
Dolin, R., D. Agrawal, A. El Abbadi, and J. Pearlman. 1996. "Using Automated Classification for Summarizing and Selecting Heterogeneous Information", D-Lib Magazine, January 1998. <http://www.dlib.org/dlib/january98/dolin/01dolin.html>.: [abstract] Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC.
Ellman, Jeremy. 1998. "Using the Generic Document Profile to Cluster Similar Texts", Proceedings of Computational Linguistics UK (CLUK 97), University of Sunderland. <http://osiris.sunderland.ac.uk/~cs0jel/cluk98.ps.gz>.: [abstract] The World Wide Web contains a huge quantity of text that is notoriously inefficient to use. This work aims to apply a text processing technique based on thesaurally derived lexical chains to improve Internet Information Retrieval where a lexical chain a set of words in a text that are related by both proximity, and by relations derived from an external lexical knowledge source such as WordNet, Roget's Thesaurus, LDOCE, and so on.
Finding Information on the Internet is notoriously hard, even when users have a clear focus to their queries. This situation is exacerbated when users only have vague notions about the topics they wish to explore. This could be remedied using Exemplar Texts, where an Exemplar Text is the ideal model result for Web searches. Our problem is now transformed into one of identifying similar texts.
The Generic Document Profile is designed to allow the comparison of document similarity whilst being independent of terminology and document length. It is simply a set of semantic categories derived from Roget's thesaurus with associated weights. These weights are based on lexical chain length and strength. A Generic Document Profile can be compared to another using a Case Based Reasoning approach.
Case Based Reasoning (CBR) is a problem solving method that seeks to solve existing problems by reference to previous successful solutions. Here our Exemplar Texts count as previous solutions (and in these experiments, MS Encarta is used as the source of Exemplar Texts).
In CBR a query and examples are usually represented as attribute value pairs. Thus to apply CBR to document comparison, both the text acting as a query, and the documents to be compared against need to be represented in equivalent terms. If this representation is based on simple terms (ie word stems), the problem becomes hugely complex, since there are many words in a language.
This would also be a fragile approach, since semantically equivalent words word not count as equal. However, If we represent a document as Roget categories, the problem becomes tractable. We use Roget's thesaurus since there are 1024 main thesaurus categories as opposed to 50000+ Synsets in Wordnet. This approach is providing interesting results, although it is subject to the well known Word Sense Disambiguation problem. The paper will summarise current findings, cover some implementation details, and describe to approach taken to full-text Word Sense Disambiguation.
Hesperus is a system designed to address this problem. Using an electronic encyclopedia as a source of subject defining texts, queries are made to MetaCrawler. This searches several search engines in parallel returning the best 10-20 links. Hesperus retrieves these web pages, and computes their conceptual similarity to the Exemplar Text using a method based on thesaurally determined lexical chains.
Initial results show that users prefer the Hesperus page order to MetaCrawler's statistical ordering. The technique consequently shows promise as a way to improve the effectiveness of Web searching.
Feldman, Ronen, and Haym Hirsh. 1996. "Mining Associations in Text in the Presence of Background Knowledge", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96): 343-346. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1996/kdd.ps>.: [abstract] This paper describes the FACT system for knowledge discovery from text. It discovers associations - patterns of co-ccurrence - amongst keywords labeling the items in a collection of textual documents. In addition, FACT is able to use background knowledge about the keywords labeling the documents in its discovery process. FACT takes a query-entered view of knowledge discovery, in which a discovery request is viewed as a query over the implicit set of possible results supported by a collection of documents, and where background knowledge is used to specify constraints on the desired results of this query process. Execution of a knowledge-discovery query is structured so that these background-knowledge constraints can be exploited in the search for possible results. Finally, rather than requiring a user to specify an explicit query expression in the knowledge-discovery query language, FACT presents the user with a simple-to-use graphical interface to the query language, with the language providing a well-defined semantics for the discovery actions performed by a user through the interface.
Finch, S. 1994. "Exploiting Sophisticated Representations for Classification," Proceedings of the Fourth Conference on Applied Natural Language Processing: 65-71.
Finch, S. 1995. "Partial Orders for Document Representation: A New Methodology for Combining Document Features," Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 264-272.
Gustavsson, Jonas. 1996. Text Categorization Using Acquaintance. Available in HTML format from Internet: <http://www.student.nada.kth.se/~f92-jgu/C-uppsats/cup.html>.: [abstract] Acquaintance, which is an n-gram based method for topic and language determination of texts which require no prepared database or preconditioned texts is examined. An implementation of the method which is slightly different from the one proposed by Marc Damashek is described. This implementation is then used to separate English texts from Swedish texts in a collection and to cluster the English texts according to their topics. The possibility to use Acquaintance when having to select one translation out of several alternatives in machine translation is discussed. The method is, however, found to have several severe drawbacks, reducing its versatility.
Haas, Stephanie W., Jeremy Sugarman, and Helen R. Tibbo. 1996. "A Text Filter for the Automatic Identification of Empirical Articles," Journal of the American Society for Information Science 47(2): 167-169.
Hayes, Philip J. and Gail Koerner. 1993. "Intelligent Text Technologies and Their Successful Use by the Information Industry," in Martha E. Williams (ed.) Proceedings of the 14th National Online Meeting (Medford, NJ: Learned Information, Inc): 189-196.: [abstract] The information industry is in considerable flux. Online vendors must add value to the information they sell to survive and prosper. Carnegie Group has been working for several years on prepackaged and custom computer systems that add value to large volumes of textual information, using both artificial intelligence and natural language technologies. To aid in the indexing of on-line data, Carnegie Group provides an intelligent text categorization shell, better known as TCS. Applications supported by TCS include the indexing of news stories for Reuters' product Textline and the routing of government security analyst information by HRB Systems. For the creation of company, people and advertiser name indexes, Carnegie Group provides the NameFinder - software, which supports application development for newspaper, magazine, or on-line database index creation. Going beyond identifying categories and specific references of data within a body of text, fact extraction technologies and techniques can be used to create summaries for abstracts and database filling purposes. Carnegie Group's fact extractioin technology was developed from a system that assistedjournalists in preparing earnings reports. This paper provides an overview of these Carnegie Group intelligent text technologies and their successful use by information providers.
Huffman, S., and M. Damashek. 1994. "Acquaintance: A Novel Vector-space N-gram Technique for Document Categorization," Proceedings of the TREC 3.
Iwayama, Makoto, and Takenobu Tokunaga. 1994. "A Probabilistc Model for Text Categorization: Based on a Single Random Variable with Multiple Values," Proceedings of the 4th Conference on Applied Natural Language Processing: 162-167. Also technical report 94TR-0008, Department of Computer Science, Tokyo Institute of Technology. [online] <ftp://ftp.cs.titech.ac.jp/pub/TR/94/94TR0008.ps.gz>.: [abstract] Text categorization is the classification of documents with respect to a set of predefined categories. In this paper, we propose a new probabilistic model for text categorization, that is based on a Single random Variable with Multiple Values (SVMV). Compared to previous probabilistic models, our model has the following advantages; 1) it considers within-document term frequencies, 2) considers term weighting for target documents, and 3) is less affected by having insufficient training cases. We verify our model's superiority over the others in the task of categorizing news articles from the "Wall Street Journal".
Kalt, T. 1998. A New Probabilistic Model of Text Classification and Retrieval TITLE2. UM-CS-1998-018, University of Massachusetts, Amherst. <http://cs-tr.cs.cornell.edu/Dienst/UI/1.0/Display/ncstrl.umassa_cs/UM-CS-1998-018>.: [abstract] This paper introduces the multinomial model of text classification and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models, particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.
Kargupta, Hillol, I. Kamzaoglu, B. Stafford, V. Hnagandi, and K. Buescher. 1997. PADMA: PArallel Data Mining Agents For Scalable Text Classification. Technical report LAUR-97-3491. X Division, Los Alamos National Laboratory.: [abstract] This paper introduces PADMA (PArallel Data Mining Agents), a parallel agent based system for scalable text classification. PADMA contains modules for (1)parallel dta accessing operations, (2) parallel hierarchical clustering, and (3) web-based data visualization. This paper introduces the general architecture of PADMA and presents a detailed description of its different modules.
Kessler, Brett, Geoffrey Numberg, Hinrich Schütze. 1997. "Automatic Detection of Text Genre," Proceedings of the 35th Annual Meeting of the ACL and 8th Conference of the ECACL: 32-38. Available in Postscript format from Internet: <ftp://parcftp.xerox.com/pub/qca/genre/paper.acl97.ps.Z>.: [abstract] As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.
Kudenko, Daniel, and Haym Hirsh. 1998. "Feature Generation for Sequence Categorization", Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/aaai2.ps>.: [abstract] The problem of sequence categorization is to generalize from a corpus of labeled sequences procedures for accurately labeling future unlabeled sequences. The choice of representation of sequences can have a major impact on this task, and in the absence of background knowledge a good representation is often not known and straightforward representations are often far from optimal. We propose a feature generation method (called FGEN) that creates Boolean features that check for the presence or absence of heuristically selected collections of subsequences. We show empirically that the representation computed by FGEN improves the accuracy of two commonly used learning systems (C4.5 and Ripper) when the new features are added to existing representations of sequence data. We show the superiority of FGEN across a range of tasks selected from three domains: DNA sequences, Unix command sequences, and English text.
Lam, Wai, and Chao Yang Ho. 1998. "Using A Generalized Instance Set for Automatic Text Categorization", Proceedings of the 21th International Conference on Research and Development in Information Retrieval (SIGIR '98).: [abstract] We investigate several recent approaches for text categorization under the framework of similarity-based learning. They include two families of text categorization techniques, namely the k-nearest neighbor (k-NN) algorithm and linear classifiers. After identifying the weakness and strength of each technique, we propose a new technique known as the generalized instance set (GIS) algorithm by unifying the strengths of k-NN and linear classifiers and adapting to characteristics of text categorization problems. We also explore some variants of our GIS approach. We have implemented our GIS algorithm, the ExpNet algorithm, and some linear classifiers. Extensive experiments have been conducted on two common document corpora, namely the OHSUMED collection and the Reuters-21578 collection. The results show that our new approach outperforms the latest k-NN approach and linear classifiers in all experiments.
Larkey, Leah, and W. Bruce Croft. 1996. "Combining Classifiers in Text Categorization," Proceedings of the 19th International Conference on Research and Development in Information Retrieval (SIGIR '96): 289-297. Available in Postscript format from Internet: <http://cobar.cs.umass.edu/info/psfiles/irpubs/combo.ps.gz>.: [abstract] Three different types of classifiers were investigated in the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifers were applied individually and in combination. A combination of different classifiers produced better results than any single type of classifier. For this specific medical categorization problem, new query formulation and weighting methods used in the k-nearest-neighbor classifier improved performance.
Lehnert, Wendy, S. Soderland, D. Aronow, F. Feng, and A. Shmueli. 199?. "Inductive Text Classification for Medical Applications," Proceedings of the 19th International Conference on Research and Development in Information Retrieval (SIGIR '96): 289-297. Available in HTML format from Internet: <http://cobar.cs.umass.edu/medical/wl_jetai.html>.: [abstract] Text classification poses a significant challenge for knowledge-based technologies because it touches on all the familiar demons of artificial intelligence: the knowledge engineering bottleneck, problems of scale, easy portability across multiple applications, and cost-effective system construction. Information retrieval (IR) technologies traditionally avoid all of these issues by defining a document in terms of a statistical profile of its lexical items. The IR community is willing to exploit a superficial type of knowledge found in dictionaries and thesaurae, but anything that requires customization, application-specific engineering, or any amount of manual tinkering is thought to be incompatible with practical cost-effective system designs. In this paper we will challenge those assumptions and show how machine learning techniques can operate as an effective method for automated knowledge acquisition when it is applied to a representative training corpus, and leveraged against a few hours of routine work by a domain expert. We describe a fully implemented text classification system operating on a medical testbed, and report experimental results based on that testbed.
Lewis, David Dolan. 1992. "Feature Selection and Feature Extraction for Text Categorization", Proceedings of Speech and Natural Language Workshop: 212-217. Morgan Kaufmann. Available in Postscript format from Internet: <http://www.research.att.com/~lewis/papers/lewis94b.ps>.: [abstract] The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.
Lewis, David Dolan. 1996. "Challenges in machine learning for text classification", In Proceedings of the Ninth Annual Conference on Computational Learning Theory, page 1, New York, 1996. ACM. <http://www.research.att.com/~lewis/papers/lewis96f.ps>.
Lewis, David Dolan. 1997. "A sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data," SIGIR Forum 29(2): 13-19. <http://www.research.att.com/~lewis/papers/lewis95g.ps>.
Lewis, David Dolan. 1997. "Information Retrieval and the Statistics of Large Data Sets," Proceedings of the NRC Massive Data Sets Workshop: 143-147. <http://www.research.att.com/~lewis/papers/lewis96c.ps>.: [abstract] The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quality, as well as an analysis of the results and the directions they suggest for future research.
Lewis, David Dolan, and Marc Ringuette. 1994. "A Comparison of Two Learning Algorithms for Text Categorization," Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval: 81-93. Available in Postscript format from Internet: <http://www.research.att.com/~lewis/papers/lewis94b.ps>.: [abstract] Algorithms for training Bayesian independence classifiers and decision trees were tested on two text categorization data sets. Both algorithms allow adjustable tradeoffs between recall and precision and have similar classification effectiveness. The decision tree method, while slower, produces a classifier that is easier to understand and in one case revealed an unsuspected chronological variation in the category definitions. Maximum effectiveness is reached for both algorithms when the initial set of features is pruned using collection frequency and mutual information. This supports previous suggestions that the stepwise feature selection in decision tree algorithms can be aided by prefiltering the feature set.
Lewis, David Dolan, Robert E. Schapire, James P. Callan, and Ron Papka. 1996. "Training Algorithms for Linear Text Classifiers", Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 298-306. <http://www.research.att.com/~lewis/papers/lewis96d.ps>.: [abstract] Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks.
Li, Hang, and Kenji Yamanishi. 1997. "Document Classification Using a Finite Mixture Model," Proceedings of the ACL/EACL-97 Available in Postscript format from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9705005>.: [abstract] We propose a new method of classifying documents into categories. The simple method of conducting hypothesis testing over word-based distributions in categories suffers from the data sparseness problem. In order to address this difficulty, Guthrie et.al. have developed a method using distributions based on hard clustering of words, i.e., in which a word is assigned to a single cluster and words in the same cluster are treated uniformly. This method might, however, degrade classification results, since the distributions it employs are not always precise enough for representing the differences between categories. We propose here the use of soft clustering of words, i.e., in which a word can be assigned to several different clusters and each cluster is characterized by a specific word probability distribution. We define for each document category a finite mixture model, which is a linear combination of the probability distributions of the clusters. We thereby treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models. In order to accomplish this testing, we employ the EM algorithm which helps efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms not only the method using distributions based on hard clustering, but also the method using word-based distributions and the method based on cosine-similarity.
Macskassy, Sofus, Arunava Banerjee, Brian Davison, and Haym Hirsh. 1998. "Human Performance on Clustering Web Pages: A Preliminary Study", Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/kdd2.ps>.: [abstract] With the increase in information on the World Wide Web it has become difficult to quickly find desired information without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their webpage clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages.
McCallum, Andrew, Kamal Nigam, Jason Rennie and Kristie. 1999. "Building Domain-Specific Search Engines with Machine Learning Techniques", AAAI-99 Spring Symposium. <http://www.cs.cmu.edu/~mccallum/papers/cora-aaaiss99.ps.gz>.: [abstract] Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by age-group, size, location and cost over summer camps. Unfortunately these domaiin-specific search engines are difficult and time-comsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenence of domain-specific search engines. We describe new research in reinforcement leraning, information extraction and text classification that enables efficient spidering, identifying information text segments, and populating topic hierarchies. Using these techniques, we have built a demonstration system: a search engine for computer science research papers. It already contains over 50,000 papers and is publicly available at www.com.justresearch.com.
McCallum, Andrew, and Kamal Nigam. 1998. "A Comparison of Event Models for Naive Bayes Text Classification", AAAI-98 Workshop on Learning for Text Categorization. <http://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps>.
McCallum, Andrew, and Kamal Nigam. 1998. "Employing EM in Pool-Based Active Learning for Text Classification", ICML-98. <http://www.cs.cmu.edu/~mccallum/papers/emactive-icml98.ps.gz>.
Mostafa, J., L. M. Quiroga, M. Palakal. 1998. "Filtering Medical Documents Using Automated and Human Classification Methods", Journal of the American Society for Information Science, 49(14): 1304-1318.: [abstract] The goal of this research is to clarify the role of document classification in information filtering. An important function of classification, in managing computational complexity, is described and illustrated in the context of an existing filtering system. A parameter called classification homogeneity is presented for analyzing unsupervised automated classification by employing human classification as a control. Two significant components of the automated classification approach, vocabulary discovery and classification scheme generation, are described in detail. Results of classification performance revealed considerable variability in the homogeneity of automatically produced classes. Based on the classification performance, different types of interest profiles were created. Subsequently, these profiles were used to perform filtering sessions. The filtering results showed that with increasing homogeneity, filtering performance improves, and, conversely, with decreasing homogeneity, filtering performance degrades.
ÀÌ ¿¬±¸ÀÇ ¸ñÀûÀº Á¤º¸ ÇÊÅÍ¸µ¿¡¼ ¹®Çå ºÐ·ùÀÇ ¿ªÇÒÀ» ¸íÈ®ÇÏ°Ô ¼³¸íÇÏ´Â °ÍÀÌ´Ù. ±âÁ¸ÀÇ ÇÊÅÍ¸µ ½Ã½ºÅÛ¿¡¼ °è»êº¹Àâ¼ºÀ» ´Ù·ç´Âµ¥ ÀÖ¾î¼ ºÐ·ùÀÇ Áß¿äÇÑ ±â´ÉÀ» ¼³¸íÇÏ°í ÀÖ´Ù. ¼öµ¿ ºÐ·ù·Î ÅëÁ¦ÇÏ¿© ºñ°¨µ¶ ÀÚµ¿ ºÐ·ù¸¦ ºÐ¼®ÇÏ±â À§ÇØ ºÐ·ù µ¿Áú¼ºÀÌ¶ó°í ºÒ¸®´Â ÆÄ¶ó¹ÌÅÍ¸¦ Á¦½ÃÇÏ¿´´Ù. ÀÚµ¿ ºÐ·ù ¹æ½ÄÀÇ µÎ Áß¿ä ¿ä¼Ò·Î ¾îÈÖ ¹ß°ß°ú ºÐ·ùÃ¼°è »ý¼ºÀ» ÀÚ¼¼È÷ ¼³¸íÇÏ¿´´Ù. ºÐ·ù ¼º´É¿¡ ±Ù°ÅÇÏ¿© »óÀÌÇÑ À¯ÇüÀÇ °ü½É ÇÁ·ÎÆÄÀÏÀ» »ý¼ºÇÏ°í, ÀÌ¸¦ ÇÊÅÍ¸µ ¼¼¼Ç¿¡ »ç¿ëÇÏ¿´´Ù. ÇÊÅÍ¸µ °á°ú´Â µ¿Áú¼ºÀÇ Áõ°¡¿¡ µû¶ó¼ ÇÊÅÍ¸µ ¼º´ÉÀÌ Çâ»óµÇ°í, ¹Ý´ë·Î µ¿Áú¼ºÀÌ °¨¼ÒÇÏ¸é ÇÊÅÍ¸µ ¼º´Éµµ ¶³¾îÁö´Â °ÍÀ¸·Î ³ªÅ¸³µ´Ù.
Moens, Marie-Francine, and Caroline Uyttendaele. 1997. "Automatic Text Structuring and Categorization as a First Step in Summarizing Legal Cases," Informatin Processing & Management 33(6): 727-737.: [abstract] The SALOMON system automatically summarizes Belgian criminal cases in order to improve access to the large number of existing and future court decisions. SALOMON extracts relevant text units from the case text to form a case summary. Such a case profile facilitates the rapid determination of the relevance of the case or may be employed in text search. In a first importatn abstracting step SALOMON performs an initial categorization of legal criminal cases and structures the case text into separate legally relevant and irrelevant components. A text grammar represented as a semantic network is used to automatically determine the category of the case and its components. In this way, we are able to extract from the case general data and to identify text portions relevant for further abstracting. It is argued that prior knowledge of the text structure and its indicative cues may support automatic abstracting. A text grammar is a promising form for representing the knowledge involved.
SALOMON ½Ã½ºÅÛÀº ÇöÀç¿Í ¹Ì·¡ÀÇ ¹æ´ëÇÑ ÆÇ°á»çÇ×¿¡ ´ëÇÑ Á¢±ÙÀ» ½±°ÔÇÒ ¸ñÀûÀ¸·Î º§±â¿¡ÀÇ Çü»ç»ç°ÇÀ» ÀÚµ¿À¸·Î ¿ä¾àÇØÁØ´Ù. SALOMONÀº ÆÇ·Êº»¹®À¸·ÎºÎÅÍ ÀûÀýÇÑ ÅØ½ºÆ® ´ÜÀ§¸¦ ÃßÃâÇÏ¿© ÆÇ·Ê¿ä¾àÀ» »ý¼ºÇÑ´Ù. ÆÇ·Ê ÇÁ·ÎÆÄÀÏÀº ÆÇ·ÊÀÇ ÀûÇÕ¼º ¿©ºÎ¸¦ ½Å¼ÓÈ÷ °áÁ¤ÇÏ´Âµ¥ µµ¿òÀÌ µÇ¸ç ÅØ½ºÆ® Å½»ö¿¡ ¾²ÀÌ±âµµ ÇÑ´Ù. ÃÊ·Ï»ý¼ºÀÇ Ã¹¹øÂ° ´Ü°è¿¡¼ SALOMONÀº ¸ÕÀú Çü»çÆÇ·Ê¸¦ ¹üÁÖÈÇÑ ´ÙÀ½ ÆÇ·Ê º»¹®À» ¹ý·üÀûÀ¸·Î ÀûÇÕÇÏ°Å³ª ÀûÇÕÇÏÁö ¾ÊÀº ¿©·¯ ¿ä¼Ò·Î ±¸Á¶ÈÇÑ´Ù. ÀÇ¹Ì³×Æ®¿öÅ©·Î Ç¥ÇöµÈ ÅØ½ºÆ® ¹®¹ýÀ» »ç¿ëÇÏ¿© ÆÇ·ÊÀÇ ¹üÁÖ¿Í °³º° ¿ä¼Ò¸¦ ÀÚµ¿À¸·Î °áÁ¤ÇÑ´Ù. ÀÌ·¸°Ô ÇÔÀ¸·Î½á ÆÇ·Ê·ÎºÎÅÍ ÀÏ¹ÝÀûÀÎ µ¥ÀÌÅÍ¸¦ ÃßÃâÇÒ ¼ö ÀÖÀ¸¸ç ÃÊ·Ï»ý¼ºÀÇ ´ÙÀ½ ´Ü°è¿¡ ÇÊ¿äÇÑ ÅØ½ºÆ® ºÎºÐÀ» ½Äº°ÇÒ ¼ö ÀÖ´Ù. ÅØ½ºÆ® ±¸Á¶¿Í Áö½Ã ´Ü¼¿¡ ´ëÇÑ »çÀü Áö½ÄÀÌ ÀÚµ¿ÃÊ·Ï¿¡ µµ¿òÀÌ µÇ´Â °Í °°´Ù. ÅØ½ºÆ® ¹®¹ýÀº ÇÊ¿äÇÑ Áö½ÄÀ» Ç¥ÇöÇÏ´Â À¯¿ëÇÑ Çü½ÄÀÌ´Ù.
Ng, Hwee Tou, Wei Boon Goh, and Kok Leong Low. 1997. "Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization," Proceedings of the 20th International Conference on Research and Development in Information Retrieval (SIGIR '97): 67-73.: [abstract] In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement. We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined, semi-automated approach yields accuracy close to the rule-based approach.
Nigam, Kamal, John Lafferty, and Andrew McCallum. 1999. "Using Maximum Entropy for Text Classification", To appear in IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999. <http://www.cs.cmu.edu/~knigam/papers/maxent-ijcaiws99.ps>.: [abstract] This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, part-of-speech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally non-uniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the results indicate that maximum entropy is a promising technique for text classification.
Nigam, Kamal, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. 1998. "Text Classification from Labeled and Unlabeled Documents using EM", To appear in Machine Learning, 1999. <http://www.cs.cmu.edu/~knigam/papers/emcat-mlj99.ps>.: [abstract] This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is improtant because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readlily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This base EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Nigam, Kamal, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. 1998. "Learning to Classify Text from Labeled and Unlabeled Documents", AAAI-98. <http://www.cs.cmu.edu/~mccallum/papers/emcat-aaai98.ps.gz>.: [abstract] In many inportant text classification problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a lage pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for lerningn from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classifier using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 33%.
Norton, Steven W., and Haym Hirsh. 1992. "Classifier Learning from Noisy Data as Probabilistic Evidence Combination", Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI92): 141-146. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1992/aaai3.ps>.: [abstract] This paper presents an approach to learning from noisy data that views the problem as one of reasoning under uncertainty, where prior knowledge of the noise process is applied to compute a posteriori probabilities over the hypothesis space. In preliminary experiments this maximum a posteriori (MAP) approach exhibits a learning rate advantage over the C4.5 algorithm that is statistically significant.
O'Neil, P. 1997. "An incremental approach to text representation, categorization, and retrieval", Proceedings of the Fourth International Conference on Document Analysis and Recognition, 714-717.: [abstract] Efficient and accurate information retrieval is a goal of just about everyone. Whether you are looking for information on the Internet, a book or article in the library, satellite imagery of missile silos, or a recipe for dinner, finding exactly what you want or need, even if you know exactly what you are looking for, can be an imposing and most difficult task. Many current techniques require an intimate understanding of the actual processes involved. The method presented in this paper provides for an automatic representation of text data by vectors, which can then be manipulated to categorize and organize the data. Information can be retrieved without knowledge of the underlying process. The user can ask for information using normal discourse. This technology can also be applied to data mining and visualization.
Papka, Ron. 1999. "Learning Query Bias for Improved On-Line Document Classification", a poster presentation at Snowbird 99 conference, Snowbird, Utah, April 5-10, 1999. Also technical report IR-160, CIIR, UMASS. <http://cobar.cs.umass.edu/pubfiles/ir-160.ps>.
Papka, Ron, and James Allan. 1998. "Document Classification Using Multiword Features", Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management: 124-131.: [abstract] We investigate the use of multimword query features to improve the effectiveness of text-retrieval systems that accept natural-language queries. A relevance feedback process is explained that expands an initial query with single and multiword features. The multiword features are modelled as a set of words appearing within windows of varyiing sizes. Our experimental results suggest that windows of larger span yield improvements in retrieval over windows of smaller span. This result gives rise to a query contraction process that prunes 25% of the features in an expaned query with no loss in retrieval effectiveness.
Riloff, E. 1993. "Using Cases to Represent Context for Text Classification," Proceedings of the Second International Conference on Information and Knowledge Management (CIKM-93) : 105-113.: [abstract] Research on text classifkation has fypically focused on keyword searches and statistical techniques. Keywords alone cannot always distinguish the relevant from the irrelevant texts and some relevant texts do not contain any reliable keywords at all. Our approach to text classifkation uses case-based reasoning to represent natural language contexts that can be used to ctassify texts with extremely high precision. The case base of natural language contexts is acquired automatically during sentence analysis using a training corpus of texts and their correct relevancy classifications. A text is represented as a set of cases and we classify a text as relevant if any of its cases is deemed to be relevant. We rely on the statistical properties of the case base to determine whether similar cases are highly correlated with relevance for the domain. Experiments with the MUC~ corpus suggest that case-based text class~lcation can achieve very high levels of precision and outperforms our previous algorithms based on relevancy signatures.
Riloff, E. 1994. Information Extraction as a Basis for Portable Text Classification Systems. Ph.D. Thesis, Computer Science Department, University of Massachusetts Amherst Available from <http://www.cs.utah.edu/~riloff/psfiles/single-thesis.ps>.
Riloff, E. 1995. "Little Words Can Make a Big Difference for Text Classification," Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 130-136. Also available from <http://www.cs.utah.edu/~riloff/psfiles/sigir95.ps>.
Riloff, E. 1996. "Automatically Generating Extraction Patterns from Untagged Text," Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96): 1044-1049. Also available from <http://www.cs.utah.edu/~riloff/psfiles/aaai96.ps>.
Riloff, E. 1996. "Using Learned Extraction Patterns for Text Classification," in S. Wermter, E. Riloff, and G. Scheler (eds), Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing. Springer-Verlag, Berlin. pp. 275-289. Also available from <http://www.cs.utah.edu/~riloff/psfiles/ijcai-book-chapter.ps>.
Riloff, E., and and J. Lorenzen. 1997. "Extraction-based Text Categorization: Generating Domain-specific Role Relationships Automatically," in Tomek Strzalkowski (ed.), Natural Language Information Retrieval. Springer-Verlag, Berlin. Also available from <http://www.cs.utah.edu/~riloff/psfiles/psfiles/nlp-ir-chapter.ps>.
Riloff, E., and J. Shoen. 1995. "Automatically Acquiring Conceptual Patterns Without an Annotated Corpus," Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 130-136. Also available from <http://www.cs.utah.edu/~riloff/psfiles/sigir95.ps>.
Riloff, E., and J. Shoen. 1995. "Automatically Acquiring Conceptual Patterns Without an Annotated Corpus," Proceedings of the Third Workshop on Very Large Corpora: 148-161. Also available from <http://www.cs.utah.edu/~riloff/psfiles/vlc-paper.ps>.
Riloff, E., and W. Lehnert. 1992. "Classifying Texts Using Relevancy Signatures," Proceedings of the Tenth National Conference on Artificial Intelligence: 329-334.
Riloff, E., and W. Lehnert. 1994. "Information Extraction as a Basis for High-Precision Text Classification," ACM Transactions on Information Systems12(3): 296-333. Also available from <http://www.cs.utah.edu/~riloff/psfiles/single-acm.ps>.
Shklar, Leon, and Haym Hirsh. 1997. "Imposing Bounds on the Number of Categories for Incremental Concept Formation", In R. Greiner, T. Petsche, and S.J. Hanson, editors, Computational Learning Theory and Natural Learning Systems, Volume IV, pages 36-49. MIT Press. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1997/clnl.ps>.
Takkinen, Juha. 1995. "An Adaptive Approach to Text Categorization and Understanding," Proceedings of the 5th Annual IDA Conference on Computer and Information Science: 39-42. Available in Postscript format from Internet: <http://www.ida.liu.se/~juhta/publications/IDAconf95_paper.ps>. Also in HTML format from Internet: <http://www.ida.liu.se/~juhta/publications/IDAconf95_paper.html>.: [abstract] Categorization of electronic documents is discussed and the results of a preliminary study of techniques that people use for categorizing electronic mail letters are presented. The results, mainly descriptive, are used as a basis for a discussion of the feasibility of an adaptive approach to text categorization and understanding. The adaptability is supposed to come from a design of a system that uses knowledge about documents in general, rather than any specific type of document, to categorize documents. One implication of this approach is a text categorization system which can easily be adapted to categorizing different types of documents and perform automatic text categorization.
Tokunaga, Takenobu, and Makoto Iwayama. 1994. Text categorization based on weighted inverse document frequency. Technical report 94TR-0001, Department of Computer Science, Tokyo Institute of Technology. Available in Postscript format from Internet: <ftp://ftp.cs.titech.ac.jp/pub/TR/94/94TR0001.ps.gz>.: [abstract] This paper proposes a new term weighting method called weighted inverse document frequency (WIDF). As its name indicates, WIDF is an extension of IDF (inverse document frequency) to incorporate the term frequency over the collection of texts. WIDF of a term in a text is given by dividing the frequency of the term in the text by the sum of the frequency of the term over the collection of texts. WIDF is applied to the text categorization task and proved to be superior to the other methods. The improvement of accuracy on IDF is 7.4% at the maximum.
Wiebe, Janyce, Rebecca Bruce, and Lei Duan. 1997. "Probabilistic Event Categorization," Proceedings of the Recent Advances in Natural Language Processing (RANLP-97): 163-170. Available in Postscript format from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9710008>.: [abstract] This paper describes the automation of a new text categorization task. The categories assigned in this task are more syntactically, semantically, and contextually complex than those typically assigned by fully automatic systems that process unseen test data. Our system for assigning these categories is a probabilistic classifier, developed with a recent method for formulating a probabilistic model from a predefined set of potential features. This paper focuses on feature selection. It presents a number of fully automatic features. It identifies and evaluates various approaches to organizing collocational properties into features, and presents the results of experiments covarying type of organization and type of property. We find that one organization is not best for all kinds of properties, so this is an experimental parameter worth investigating in NLP systems. In addition, the results suggest a way to take advantage of properties that are low frequency but strongly indicative of a class. The problems of recognizing and organizing the various kinds of contextual information required to perform a linguistically complex categorization task have rarely been systematically investigated in NLP.
Yaari, Yaakov. 1997. "Segmentation of Expository Texts by Hierarchical Agglomerative Clustering," Proceedings of the RANLP'97. <ftp://ftp.cs.umass.edu/pub/techrept/techreport/1996/UM-CS-1996-087.ps>. [abstract] We propose a method for segmentation of expository texts based on hierarchical agglomerative clustering. The method uses paragraphs as the basic segments for identifying hierarchical discourse structure in the text, applying lexical similarity between them as the proximity test. Linear segmentation can be induced from the identified structure through application of two simple rules. However the hierarchy can be used also for intelligent exploration of the text. The proposed segmentation algorithm is evaluated against an accepted linear segmentation method and shows comparable results.
Yang, Yiming. 1996. "Sampling Strategies and Learning Efficiency in Text Categorization", AAAI Spring Symposium on Machine Learning in Information Access 1996: 88-95. <http://www.cs.cmu.edu/~yiming/papers.yy/aaai_mlia96.ps>.: [abstract] This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbor classifier on training samples drawn from a pool of 235,401 training texts with 29,741 distinct categories. The learning curves of the classifier are analyzed with respect to the choice of training resources, the sampling methods, the size, vocabulary and category coverage of a sample, and the category distribution over the texts in the sample. A nearly-optimal categorization performance of the classifier is achieved using a relatively small training sample, showing that statistical learning can be successfully applied to very large text categorization problems with affordable computation.
Yang, Yiming. 1995. "Noise Reduction in a Statistical Approach to Text Categorization", ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95) 1995: 256-263. <http://www.cs.cmu.edu/~yiming/papers.yy/sigir95.ps>.: [abstract] This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposedand evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition to cut off noisy "latent semantic structures" during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results.
Yang, Yiming, and John Wilbur. 1996. "Using Corpus Statistics to Remove Redundant Words in Text Categorization," Journal of the American Society for Information Science 47(5): 357-369.: [abstract] This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computatinal efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with three categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique words reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63% time savings and a 74% memory savings in the computations of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus sttistics has a practical and significant impact on the computational tractability of categorization methods in large databases.

Text Categorization with Learning Methods

Benkhalifa, M.; Bensaid, A.; Mouradi, A. 1999. "Text categorization using the semi-supervised fuzzy c-means algorithm", Proceedings of the 18th International Conference of the North American Fuzzy Information (NAFIPS'99), 561-565.: [abstract] Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM's performance is also superior to that of ssAHC's. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.
Berger, A. 1999. "Error-correcting output coding for text classification", IJCAI'99: Workshop on machine learning for information filtering. Stockholm, Sweden. <http://www.cs.cmu.edu/~aberger/ps/ecoc.ps>.: [abstract] This paper applies error-correcting output coding (ECOC) to the task of document categorization. ECOC, of recent vintage in the AI literature, is a method for decomposing a multiway classification problem into many binary classification tasks, and then combining the results of the subtasks into a hypothesized solution to the original problem. There has been much recent interest in the machine learning community about algorithms which integrate "advice" from many subordinate predictors into a single classifier, and error-correcting output coding is one such technique. We provide experimental results on several real-world datasets, extracted from the Internet, which demonstrate that ECOC can offer significant improvements in accuracy over conventional classification algorithms.
Boley, Daniel, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. 1999. "Document Categorization and Query Generation on the World Wide Web Using WebACE", To appear in AI Review. <ftp://ftp.cs.umn.edu/dept/users/kumar/aireview-agent.ps>.: [abstract] Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-nearest neighbor (k-NN) classification is an instance-based learning algorithm that has shown to be very effective for a variety of problem domains including documents. The key element of this scheme is the availability of a similarity measure that is capable of identifying neighbors of a particular document. A major drawback of the similarity measure used in k-NN is that it uses all features in computing distances. In many document data sets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this paper, we propose the Weight Adjusted k-Nearest Neighbor (WAKNN) classification algorithm that is based on the k-NN classification paradigm. In WAKNN, the weights of features are learned using an iterative algorithm. In the weight adjustment step, the weight of each feature is perturbed in small steps to see if the change improves the classification objective function. The feature with the most improvement in the objective function is identified and the corresponding weight is updated. The feature weights are used in the similarity measure computation such that important features contribute more in the similarity measure. Experiments on several real life document data sets show the promise of WAKNN, as it outperforms the state of the art classification algorithms such as C4.5, RIPPER, Rainbow, PEBLS, and VSM.
Cohen, William W. 1995. "Learning to Classify English Text with ILP Methods", In L. De Raedt (Ed.), Advances in Inductive Logic Programming, IOS Press, 1995. <http://www.research.att.com/~wcohen/postscript/ilp.ps>.: [abstract] Text categorization is the task of classifying text into one of several predefined categories. In this paper we will evaluate the effectiveness of several ILP methods for text categorization, and also compare them to their propositional analogs. The methods considered are FOIL, the propositional rule-learning system RIPPER, and a first-order version of RIPPER called FLIPPER. We show that the benefit of using a first-order representation in this domain is relatively modest; in particular, the performance difference between FLIPPER and FOIL and their propositional counterparts is quite small, compared to the differences between FOIL and FLIPPER. However, a first-order representation seems to be advantageous when high-precision classifiers are desirable.
Cohen, William W. 1995. "Text Categorization and Relational Learning," Proceedings of the Twelfth International Conference on Machine Learning. Available from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/ml-95-ir.ps>.
Cohen, William W. 1996. "Learning Rules that Classify E-Mail," Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access. Available from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/aaai-ss-96.ps>.
Cohen, William W., and Yoram Singer. 1996. "Context-Sensitive Learning Methods for Text Categorization," Proceedings of the SIGIR-96. Available from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/sigir-96.ps>.
Cohen, William W., and Haym Hirsh. 1998. "Joins that Generalize: Text Classification Using WHIRL", Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/kdd1.ps>.: [abstract] WHIRL is an extension of relational databases that can perform "soft joins" based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalence of atomic values. This paper evaluates WHIRL on a number of inductive classification tasks using data from the World Wide Web. We show that although WHIRL is designed for more general similarity-based reasoning tasks, it is competitive with mature inductive classification systems on these classification tasks. In particular, WHIRL generally achieves lower generalization error than C4.5, RIPPER, and several nearest-neighbor methods. WHIRL is also fast---up to 500 times faster than C4.5 on some benchmark problems. We also show that WHIRL can be efficiently used to select from a large pool of unlabeled items those that can be classified correctly with high confidence.
Cohen, William W., and Haym Hirsh. 1998. "Context-sensitive learning methods for text categorization", ACM Transactions on Information Systems, 17(2), 141-173.: [abstract] Two recently implemented machine-learning algorithms, RIPPER and sleeping-experts for phrases, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the "context" of a word w to affect how (or even whether) the presence or absence of w will contribute to a classification. However, RIPPER and sleeping-experts differ radically in many other respects: differences include different notions as to what constitutes a context, different ways of combining contexts to construct a classifier, different methods to search for a combination of contexts, and different criteria as to what contexts should be included in such a combination. In spite of these differences, both RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods. We view this result as a confirmation of the usefulness of classifiers that represent contextual information.
Dagan, Ido, Yael Karov, Dan Roth. 1997. "Mistake-Driven Learning in Text Categorization," Proceedings of the Second Conference on Empirical Methods in NLP (EMNLP-2). Available in Postscript format from <http://xxx.lanl.gov/abs/cmp-lg/9706006>.: [abstract] Learning problems in the text processing domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Three characteristic properties of this domain are (a) very high dimensionality, (b) both the learned concepts and the instances reside very sparsely in the feature space, and (c) a high variation in the number of active features in an instance. In this work we study three mistake-driven learning algorithms for a typical task of this nature -- text categorization. We argue that these algorithms -- which categorize documents by learning a linear separator in the feature space -- have a few properties that make them ideal for this domain. We then show that a quantum leap in performance is achieved when we further modify the algorithms to better address some of the specific characteristics of the domain. In particular, we demonstrate (1) how variation in document length can be tolerated by either normalizing feature weights or by using negative weights, (2) the positive effect of applying a threshold range in training, (3) alternatives in considering feature frequency, and (4) the benefits of discarding features while training. Overall, we present an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.
Goldberg, J.L. 1995. "CDM: an approach to learning in text categorization", Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, 258-265.: [abstract] The category discrimination method (CDM) is a new learning algorithm designed for text categorization. The motivation is that there are statistical problems associated with natural language text when it is applied as input to existing machine learning algorithms (too much noise, too many features, skewed distribution). The bases of the CDM are research results about the way that humans learn categories and concepts vis-a-vis contrasting concepts. The essential formula is cue validity borrowed from cognitive psychology, and used to select from all possible single word-based features the 'best' predictors of a given category. The hypothesis that CDM's performance exceeds two non-domain specific algorithms, Bayesian classification and decision tree learners, is empirically tested.
Esposito, F., D. Malerba, L. Di Pace, and P. Leo. 1999. "A Learning Intermediary for the Automated Classification of Web Pages", ICML-99 Workshop: Machine Learning in Text Data Analysis. <http://www-ai.ijs.si/DunjaMladenic/ICML99/DonatoFinal.ps>.: [abstract] The development of adaptive tools for computer-supported collaborative work is a challenging application for machine learning. In the paper, some problems concerning the implementation of an adaptive intermediary performing content-based classification of web pages are presented, and the adopted solutions are described. Information considered by the intermediary is both the textual contents of web pages and the layout structure defined by HTML tags. The representation language adopted for web pages is the bag-of-words, where words are selected from training documents by means of a novel scoring measure. Three different classification models are empirically compared on a classification task: Decision trees, centroids, and x-nearest-neighbor. Experimental results are reported and conclusions are drawn on the relevance of the HTML layout structure for classification purposes, on the significance of words selected by the scoring measure, as well as on the performance of the different classifiers.
Han, Eui-Hong (Sam), and Vipin Kumar. 1999. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Technical Report #99-019, Department of Computer Science and Engineering, University of Minnesota. <ftp://ftp.cs.umn.edu/dept/users/kumar/waknn.ps>.: [abstract] Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-nearest neighbor (k-NN) classification is an instance-based learning algorithm that has shown to be very effective for a variety of problem domains including documents. The key element of this scheme is the availability of a similarity measure that is capable of identifying neighbors of a particular document. A major drawback of the similarity measure used in k-NN is that it uses all features in computing distances. In many document data sets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this paper, we propose the Weight Adjusted k-Nearest Neighbor (WAKNN) classification algorithm that is based on the k-NN classification paradigm. In WAKNN, the weights of features are learned using an iterative algorithm. In the weight adjustment step, the weight of each feature is perturbed in small steps to see if the change improves the classification objective function. The feature with the most improvement in the objective function is identified and the corresponding weight is updated. The feature weights are used in the similarity measure computation such that important features contribute more in the similarity measure. Experiments on several real life document data sets show the promise of WAKNN, as it outperforms the state of the art classification algorithms such as C4.5, RIPPER, Rainbow, PEBLS, and VSM.
Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., and Scholkopf, B. 1998. "Support vector machines", IEEE Intelligent Systems, 13(4), 18-28.: [abstract] My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Scholkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently.
Joachims, T. 1998. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", European Conference on Machine Learning (ECML). <http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_98a.pdf>.
[abstract] This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning.
Jonsson, Per. 1999. Evaluation of information loss when using WEBSOM for text categorization. HS-IDA-EA-99-112. <http://www.ida.his.se/ida/htbin/exjobb/1999/HS-IDA-EA-99-112>.
[abstract] Today, we live in an information technological society, where the Internet is an example of a valuable source of textual information. These sources are rapidly growing and so are needs for processing the information. Text classification brings the ability to present information to the user of a system in a more comprehensible and structural manner. In this report, the information loss when using WEBSOM - an application that organizes documents on a two-dimensional map - is evaluated. The hypothesis is that information is lost when high-dimensional vectors that represent the documents are projected onto this two-dimensional map by WEBSOM. This is empirically tested, wherein WEBSOM is implemented and criteria for comparing distances are defined and used in the measuring, where we compare distances on the map with distances measured directly between the vectors that represent the documents. Results show that information is lost and we suggest that the high-dimensional vectors should be used directly for similarity measuring, so that the information loss is avoided.
Joachims, T. 1999. "Transductive Inference for Text Classification using Support Vector Machines", Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99). <http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99c.pdf>.
[abstract] This paper introduces Transductive Support Vector Machines (TSVMs) for text classification. While regular Support Vector Machines (SVMs) try to induce a general decision function for a learning task, Transductive Support Vector Machines take into account a particular test set and try to minimize misclassifications of just those particular examples. The paper presents an analysis of why TSVMs are well suited for text classification. These theoretical findings are supported by experiments on three test collections. The experiments show substantial improvements over inductive methods, especially for small training sets, cutting the number of labeled training examples down to a twentieth on some tasks. This work also proposes an algorithm for training TSVMs efficiently, handling 10,000 examples and more.
Mitchell, T. 1999. "Supervised Text Learning from Unlabeled Data", ICML-99 Workshop: Machine Learning in Text Data Analysis. <http://www.cs.cmu.edu/~tom/iccs99.ps>.: [abstract] Most computational models of supervised learning rely only on labeled training examples, and ignore the possible role of unlabeled data. This is true for much research in machine learning, including work on learning over text. This talk will explore the potential role of unlabeled data in supervised learning over text. We present an algorithm and experimental results demonstrating that unlabeled data can significantly improve learning accuracy in problems such as learning to classify web pages. We then identify the abstract problem structure that enables the algorithm to successfully utilize this unlabeled data, and prove that unlabeled data will boost learning accuracy for problems in this class. The problem class we identify includes problems where the features describing the examples are redundantly sufficient for classifying the example; a notion we make precise in the paper. This problem class includes many learning problems involving text, such as learning a semantic lexicon over noun phrases, learning to classify web pages, and learning word sense disambiguation rules. We conclude that current research on text learning should consider more strongly the potential role of unlabeled data.
Lam, S.L.Y., and Lee, Dik Lun (1999). "Feature reduction for neural network based text categorization", Proceedings of the 6th International Conference on Database Systems for Advanced Applications, 195-202.: [abstract] In a text categorization model using an artificial neural network as the text classifier scalability is poor if the neural network is trained using the raw feature space since textural data has a very high-dimension feature space. We proposed and compared four dimensionality reduction techniques to reduce the feature space into an input space of much lower dimension for the neural network classifier. To test the effectiveness of the proposed model, experiments were conducted using a subset of the Reuters-22173 test collection for text categorization. The results showed that the proposed model was able to achieve high categorization effectiveness as measured by precision and recall. Among the four dimensionality reduction techniques proposed, principal component analysis was found to be the most effective in reducing the dimensionality of the feature space.
Mladenic, Dunja. 1998. "Turning Yahoo into an Automatic Web-Page Classifier", Proceedings of the 13th European Conference on Aritficial Intelligence ECAI'98: 473-474. <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwECAI98yr.ps.gz>.: [abstract] The paper describes an approach to automatic Web-page classification based on the Yahoo hierarchy. Machine learning techniques developed for learning on text data are used here on the hierarchical classification structure. The high number of features is reduced by taking into account the hierarchical structure and using feature subset selection based on the method known from information retrieval. Documents are represented as feature-vectors that include n-grams instead of including only single words (unigrams) as commonly used when learning on text data. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo hierarchy. The result of learning is a set of independent classifiers, each used to predict the probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. For more than a half of testing examples a correct category is among the 3 categories with the highest predicted probability.
Mladenic, Dunja. 1998. "Feature Subset Selection in Text-Learning", 10th European Conference on Machine Learning (ECML98). <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwECML98.ps.gz>.: [abstract] This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval.
Mladenic, Dunja. 1998. "Feature Selection for Clasification Based on Text Hierarchy", Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98. <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwCONALD98.ps.gz>.: [abstract] This paper describes automatic document categorization based on large text hierarchy. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature selection for large text data. We experimentally evaluate feature subset selection on real-world text data collected from the existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian classifier was used on text data using feature-vector document representation that includes n-grams instead of just single words (unigrams). Experimental evaluation on real-world data collected form the Web shows that our approach gives promising results and can potentially be used for document categorization on the Web. Additionally the best result on our data is achieved for relatively small feature subset, while for larger subset the performance substantially drops. The best performance among six tested feature scoring measure was achieved by the feature scoring measure called Odds ratio that is known from information retrieval.
Mladenic, Dunja. 1998. Machine Learning on Non-homogeneous, Distributed Text Data. PhD thesis, University of Ljubljana, Slovenia. <http://www.cs.cmu.edu/~TextLearning/pww/PhD.html>.: [abstract] This dissertation proposes new machine learning methods where the corresponding learning problem is characterized by a high number of features, unbalanced class distribution and asymmetric misclassification costs. The input is given as a set of text documents or their Web addresses (URLs). The induced target concept is appropriate for the classification of new documents including shortened documents describing individual hyperlinks. The proposed methods are based on several new solutions.
Proposed is a new, enriched document representation that extends the bag-of-words representation by adding word sequences and document topic categories. Features that represent word sequences are generated using a new efficient procedure. Features giving topic categories are obtained from background knowledge constructed using the new machine learning method for learning from class hierarchies. When learning from class hierarchy, a high number of class values, examples and features, are handled by (1) dividing a problem into subproblems based on the hierarchical structure of class values and examples, (2) by applying feature subset selection and (3) by pruning unpromising class values during classification.
Several new feature scoring measures are proposed as a result of comparison and analysis of different feature scoring measures used in feature subset selection on text data. The new measures are appropriate for text domains with several tens or hundreds of thousands of features, can handle unbalanced class distribution and asymmetric misclassification costs.
Developed methods are suitable for the classification of documents including shortened documents. We build descriptions of hyperlinks, and treat these as shortened documents. Since each hyperlink on the Web is pointing to some document, the classification of hyperlinks (corresponding shortened documents) could be potentially improved by using this information. We give the results of preliminary experiments for learning in domains with mutually dependent class attributes. Training examples are used for learning 'a next state function on the Web', where document content (class attributes) is predicted from the hyperlink (feature-vector) that points to the document. Document content we are predicting is represented as a feature-vector each feature being one of the mutually dependent class attributes.
The proposed methods and solutions are implemented and experimentally evaluated on real-world data collected from the Web in three independent projects. It is shown that document classification, categorization and prediction using the proposed methods perform well on large, realworld domains.
The experimental findings further indicate that the developed methods can efficiently be used to support analysis of large amount of text data, automatic document categorization and abstraction, document content prediction based on the hyperlink content,classification of shortened documents, development of user customized text-based systems, and user customized Web browsing. As such, the proposed machine learning methods contribute to machine learning and to related fields of text-learning, data mining, intelligent data analysis, information retrieval, intelligent user interfaces, and intelligent agents. Within machine learning, this thesis contributes an approach to learning on large, distributed text data, learning on hypertext, and learning from class hierarchies. Within computer science, it contributes to better design of Web browsers and software assistants for people using the Web.
Mooney, Raymond J., and Loriene Roy. 1999. "Content-Based Book Recommending Using Learning for Text Categorization", Submission to Fourth ACM Conference on Digital Libraries. <http://xxx.lanl.gov/abs/cs.DL/9902011>.: [abstract] Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.
Joachims, T. 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS8-Report, 23, Universitat Dortmund, LS VIII-Report. <http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_97b.pdf>.
[abstract] This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies, why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and they behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning.

Topic Characterization

Fisher, David E. 1994. Topic Characterization of Full Length Texts Using Direct and Indirect Term Evidence. Technical report CSD-94-809, Computer Science Department, University of Berkeley. Available in Postscript format from Internet: <http://sunsite.berkeley.edu/Dienst/Repository/2.0/Body/ncstrl.ucb/CSD-94-809/postscript>.
[abstract] This project evaluates two families of algorithms that can be used to automatically classify general texts within a set of conceptual categories. The first family uses indirect evidence in the form of term-category co-occurrence data. The second uses direct evidence based on the senses of the terms, where a term's senses are designated by the categories that it is a member of in a thesaurus. The direct evidence algorithms incorporate varying degrees of indirect evidence as well. For these experiments a set of 3864 conceptual categories were derived from the noun hierarchy of WordNet, an on-line thesaurus. The co-occurrence data for the associational and disambiguation algorithms was collected from a corpus of 3,711 AP newswire articles, comprising approximately 1.7 million words of text. Each of the algorithms was applied to all of the articles in the AP corpus, with their performance evaluated both qualitatively and quantitatively. The results of these experiments show that both classes of algorithms have potential as fully automatic text classifiers. The direct methods produce qualitatively better classifications than the indirect ones when applied to AP newswire texts. The direct methods also achieve both a higher precision, 86.75% correctly classified (best case) versus 72.34%, and a higher approximate recall. The experiments identify limiting factors on the performance of the algorithms. The primary limitations stem from the quality of the thesaural categories, which were derived automatically, and from the performance of the term sense disambiguation algorithm. The former can be addressed with human intervention, the latter with a larger training set for the statistical database.

Hierarchical Text Categorization

D'Alessio, Stephen, Aaron Kershenbaum, Keitha Murray, and Robert Schiaffino. 1998. Hierarchical Text Categorization. <http://pride-i2.poly.edu/textmining/>.: [abstract] The problem of assigning documents to categories in a hierarchically organized taxonomy is considered. Given a training corpus of documents already placed in one or more categories, vocabulary is extracted. The vocabulary, words that appear with high frequency within a given category, characterizes each subject area by being associated with nodes in the hierarchy. Each node¡¯s vocabulary is filtered and its words assigned weights with respect to the specific category. Then, test documents are scanned for the presence of this vocabulary and categories are ranked with respect to the document based on the presence of terms from this vocabulary. Finally, documents are assigned to categories based on these rankings. Precision and recall are measured.
We present an algorithm for associating words with individual categories within the hierarchy and demonstrate that precision and recall can be significantly improved by solving the categorization problem taking the hierarchy into account. We also show that these results can be improved even further by intelligently selecting intermediate categories in the hierarchy. Solving the problem iteratively, moving downward from the root of the taxonomy to the leaf nodes, we improve precision from 82% to 89% and recall from 82% to 87% on the much-studied Reuters-21578 corpus with 135 categories organized in a three-level hierarchy of categories.
Iwayama, Makoto, and Takenobu Tokunaga. 1995. "Hierarchical Bayesian Clustering for Automatic Text Classification," Proceedings of IJCAI '95. Also technical report 95TR-0015, Department of Computer Science, Tokyo Institute of Technology. <ftp://ftp.cs.titech.ac.jp/pub/TR/95/95TR0015.ps.gz>.: [abstract] Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveness of text retrieval/categorization. In this paper we propose a hierarchical clustering algorithm that constructs a set of clusters having the maximum Bayesian posterior probability, the probability that the given texts are classified into clusters. We call the algorithm Hierarchical Bayesian Clustering (HBC). The advantages of HBC are experimentally verified from several viewpoints. (1) HBC can re-construct the original clusters more accurately than do other non probabilistic algorithms. (2) When a probabilistic text categorization is extended to a cluster-based one, the use of HBC offers better performance than does the use of non probabilistic algorithms.
[contents] 1. Introduction; 2. Hierarchical Bayesian Clustering; 3. Comparison Two Other Clustering Algorithms; 4. Evaluation in Text Categorization; 5. Discussion
Iwayama, Makoto, and Takenobu Tokunaga. 1995a. "Cluster-based Text Categorization: A Comparison of Category Search Strategies," Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 273-280. Also technical report 95TR-0016, Department of Computer Science, Tokyo Institute of Technology. [online] <ftp://ftp.cs.titech.ac.jp/pub/TR/95/95TR0016.ps.gz>.: [abstract] Text categorization can be viewed as a process of category search, in which one or more categories for a test document are searched for by using given training documents with known categories. In this paper a cluster-based search with a probabilistic clustering algorithm is proposed and evaluated on two data sets. The efficiency, effectiveness, and noise tolerance of this search were confirmed to be better than those of a full search, a category-based search, and a cluster-based search with nonprobabilistic clustering.
[contents] 1. Introduction; 2. Category Search Strategies; 3. Hierarchical Bayesian Clustering; 4. Experiments; 5. Results and Discussions; 6. Summary
Kita, Kenji, Minoru Sasaki, Tai Xiao Ying. 1998. "Rule-Based Hierarchical Document Categorization for the World Wide Web", Asia Pacific Web Conference (APWeb98): 269-273. <http://www-a2k.is.tokushima-u.ac.jp/member/kita/EPrint/Kita/008.doc>.: [abstract] Document categorization, which is defined as the classification of text documents into one of several fixed classes or categories, has become important with the explosive growth of the World Wide Web. The goal of the work described here is to automatically categorize Web documents in order to enable effective retrieval of Web information. Based on the rule learning algorithm RIPPER (Repeated Incremental Pruning to Produce Error Reduction), we propose an efficient method for hierarchical document categorization.
[contents] 1 Introduction; 2 Rule-Based Text Categorization; 3 Introducing Hierarchical Document Categories; 4 Experimental Results; 5. 5 Conclusions
Koller, D., and Mehran Sahami. 1997. "Rule-Based Hierarchical Document Categorization for the World Wide Web", Proceedings of the 14th International Conference on Machine Learning (ML), 170-178. <http://robotics.stanford.edu/~koller/papers/ml97.html>.: [abstract] The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties.
Lin, Shian-Hua, Chi-Sheng Shih, Meng ChangChen, Jan-Ming Ho, Ming-Tat Ko, and Yueh-Ming Huang. 1998. "Extracting Classification Knowledge of Internet Documents with Mining Term Associations: a Semantic Approach", Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 241-249.: [abstract] In this paper, we present a system that extracts and generalizes terms from Internet documents to represent classification knowledge of a given class hierarchy. We propose a measurement to evaluate the importance of a term with respect to a class in the class hierarchy, and denote it as support. With a given threshold, terms with high supports are sifted as keywords of a class, and terms with low supports are filtered out. To further enhance the recall of this approach, Mining Association Rules technique is applied to mine the association between terms. An inference model is composed of these association relations and the previously computed supports of the terms in the class. To increase the recall rate of the keyword selection process. we then present a polynomial-time inference algorithm to promote a term, strongly associated to a known keyword, to a keyword. According to our experiment results on the collected Internet documents from Yam search engine, we show that the proposed methods In the paper contribute to refine the classification knowledge and increase the recall of keyword selection.
McCallum, Andrew, Ronald Rosenfeld, Tom Mitchell, and Andrew Ng. 1998. "Improving Text Classification by Shrinkage in a Hierarchy of Classes", ICML-98. <http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz>.
Sasaki,M., and Kita, K. 1998. "Rule-based text categorization using hierarchical categories", 1998 IEEE International Conference on Systems, Man, and Cybernetics, 2827 - 2830.: [abstract] Document categorization, which is defined as the classification of text documents into one of several fixed classes or categories, has become important with the explosive growth of the World Wide Web. The goal of the work described here is to automatically categorize Web documents in order to enable effective retrieval of Web information. In this paper, based on the rule learning algorithm RIPPER (for Repeated Incremental Pruning to Produce Error Reduction), we propose an efficient method for hierarchical document categorization.
Tanaka, Hideki. 1997. "An Efficient Document Clustering Algorithm Based on the Topic Binder Hypothesis", Proceedings of the Natural Language Processing Pacific Rim Symposium 1997 (NLPRS'97). <ftp://www.links.nectec.or.th/pub/NLPRS/paper/dana7r.ps.gz>.: [abstract] We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features:
1) it consumes relatively little memory space and runs fast,
2) it produces a hierarchy in the form of a document classification tree, and
3) the hierarchy obtained by the algorithm explicitly reveals the collection structure.
We confirm these features and thus show the algorithm's feasibility with clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701. We also introduce two applications of this algorithm.
Weigend,A.S., Wiener,E.D., and Pedersen, Jan O. 1999. "Exploiting Hierarchy in Text Categorization", Information Retrieval, 1(3): 193-216. <http://www.stern.nyu.edu/~aweigend/Research/Papers/TextCategorization/hierarchy.ps>.: [abstract] With the recent dramatic increase in electronic access to documents, text categorization--the task of assigning topic(s) to a given document--has moved to the center of the information sciences and knowledge management. Unfortunately, empirical comparisons in the last few years have indicated that it is very hard to make significant progress by focusing on the input space, e.g., by varying the representation of the document. However, structure in the output space has largely been neglected: Topics can be grouped together well according to their meaning (e.g., gold, silver, copper are all metals). This paper presents a hierarchical learning system that matches the hierarchical structure of the topics. The key idea of this divide-and-conquer approch is to mimic the semantic relations of knowledge, as opposed to trying to build one flat inference model. We propose an architecture that is designed to accommodate multiple topic assignments for each document. It has a clean probabilistic interpretation that allows the topic prediction to be combined in a principled way with information from other sources in decision support systems. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows significant improvement particularly for rare classes. Since the first level of the architecture predicts the probabilities of the meta-topic groups, the individual models for each topic of the second level can focus on finer discriminations within the group. The different parts of the hierarchical architecture can be trained independently; this solves the otherwise very serious problem of scaling up to very large and heterogeneous text collections such as the Web.

Text Categorization Using Knowledge Bases

Buenaga, M., Gomez Hidalgo, J., Diaz Agudo, B. 1997. "Using WordNet to Complement Training Information in Text Categorization", 2nd International Conference on Recent Advances in NaturalLanguage Processing (RANLP), Tzigov Chark, Bulgaria, September, 11th to 13th, 1997. <http://xxx.unizar.es/ps/cmp-lg/9709007>.
[abstract] Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed through the use of a set of manually classified documents, a training collection. We suggest the utilization of additional resources like lexical databases to increase the amount of information that TC systems make use of, and thus, to improve their performance. Our approach integrates WordNet information with two training approaches through the Vector Space Model. The training approaches we test are the Rocchio (relevance feedback) and the Widrow-Hoff (machine learning) algorithms. Results obtained from evaluation show that the integration of WordNet clearly outperforms training approaches, and that an integrated technique can effectively address the classification of low frequency categories.
Diaz, A., M. Buenaga, L. Urena, and M. Garcia-Vega. 1998. "Integrating Linguistic Resources in an Uniform Way for Text Classification Tasks", 1st International Conference on Language Resources and Evaluation, Granada, Espania, May. 1998. <http://www.esi.uem.es/laboratorios/sinai/postscripts/lrec98.ps>.
[abstract] Applications based on automatic text classification tasks, like text categorization (TC), word sense disambiguation (WSD), text filtering or routing, monoligual or multilingual information retrieval, and text summarization could obtain serios improvements by integrating linguistic resources in the current methods. We present an approach using the Vector Space Model to integrate two different kind of resources: a lexical database and training collections, in text content analysis tasks. The training approaches we test are the Rocchio (relevance feedback) and the Widrow-Hoff (machine learning) algorithms and WordNet as the lexical database. We have developed experimental systems for TC and WSD. Results obtained from evaluation show that the integration of WordNet can outperform approaches based only on training.
Fukumoto, Fumiyo, and Yoshimi Suzuki. 1996. "An Automatic Clustering of Articles Using Dictionary Definitions," Proceedings of the 16th International Conference on Computational Linguistics: 406-411.

Evaluation of Text Categorization

Lewis, David Dolan. 1991. Representation and Learning in Information Retrieval. Technical report CSD-94-809, UM-CS-1991-093, Computer Science Department, University of Massachusetts, Amherst. Available in Postscript format from Internet: <ftp://ftp.cs.umass.edu/pub/techrept/techreport/1991/UM-CS-1991-093.ps>.: [abstract] This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improvements in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its dsirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probablistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, including dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church's stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.
Lewis, David Dolan. 1990. "Representation Quality in Text Classification: An Introduction and Experiment," Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval: 81-93. <http://www.research.att.com/~lewis/papers/lewis90g.ps>.: [abstract] The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quality, as well as an analysis of the results and the directions they suggest for future research.
Lewis, David Dolan. 1991. "Evaluating Text Categorization," Proceedings of Speech and Natural Language Workshop: 312-318. Morgan Kaufmann. <http://www.research.att.com/~lewis/papers/lewis91c.ps>.: [abstract] While certain standard procedures are widely used for evaluating text retrieval systems and algorithms, the same is not true for text categorization. Omission of important data from reports is common and methods of measuring effectiveness vary widely. This has made judging the relative merits of techniques for text categorization difficult and has disguised important research issues. In this paper I discuss a variety of ways of evaluating the effectiveness of text categorization systems, drawing both on reported categorization experiments and on methods used in evaluating query-driven retrieval. I also consider the extent to which the same evaluation methods may be used with systems for text extraction, a more complex task. In evaluating either kind of system, the purpose for which the output is to be used is crucial in choosing appropriate evaluation methods.
Lewis, David Dolan. 1997. "Evaluating and Optimizing Autonomous Text Classification Systems," Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 246-254. <http://www.research.att.com/~lewis/papers/lewis95b.ps>.
Yang, Yiming. 1997. An Evaluation of Statistical Approaches to Text Categorization. Computer science technical report CMU-CS-97-127, School of Computer Science, Carnegie Mellon University. <http://reports-archive.adm.cs.cmu.edu/anon/1997/abstracts/97-127.html>.: [abstract] This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had relatively poor results compared to the other learning methods. KNN is the only learning method that has scaled to the full domain of MEDLINE categories, showing a graceful behavior when the target space grows from the level of one hundred categories to a level of tens of thousands.
Yang, Yiming. 1999. "An Evaluation of Statistical Approaches to Text Categorization," Information Retrieval Journal, 1999 (to appear). <http://www.cs.cmu.edu/~yiming/papers.yy/irj99.ps>.: [abstract] This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.
Yang, Yiming, and Xin Liu. 1999. A re-examination of text categorization methods. Proceedings on the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 42-49. <http://www.cs.cmu.edu/~yiming/papers.yy/sigir99.ps>.: [abstract] This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Least-squares Fit (LLSF) mapping and a Naive Bayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).
Yang, Yiming, and J.P. Pedersen. 1997. "A Comparative Study on Feature Selection in Text Categorization," Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 1997.: [abstract] This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a chi-square test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a k-nearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision). DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary reduction but is not competitive at higher vocabulary reduction levels. In contrast, MI had relatively poor performance due to its bias towards favoring rare terms, and its sensitivity to probability estimation errors.

Term Specification

Wong, S.K.M., and Y.Y. Yao. 1992. "An Information-Theoretic Measure of Term Specificity," Journal of the American Society for Information Science 43(1): 54-61.
[abstract] The inverse document frequency (IDF) and signal-noise ratio (S/N) approaches are two well known term weighting schemes based on term specificity. However, the existing justifications for these methods are still somewhat inconclusive and sometimes even based on incompatible assumptions. Although both methods are related to term specificity, their relationship has not been thoroughly investigated. An information-theoretic measure for term specificity is introduced in this stidy. It is explicitly shown that the IDF weighting scheme can be derived from the proposed approach by assuming that the frequency of occurrence of each index term is uniform within the set of documents containing the term. The information-theoretic interpretation of term specificity also establishes the relationship between the IDF and S/N methods.

Term Discrimination Value

El-Hamdouchi, A., and P. Willett. 1988. "An Improved Algorithm for the Calculation of Exact Term Discrimination Values," Information Processing & Management 24(1): 17-22.
[abstract] The term discrimination model provides a means of evaluating indexing terms in automatic document retrieval systems. This article describes an efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme. The algorithm has an expected running time proportional to Nn^2 for a collection of N documents, each of which has been assigned an average of n terms.