Text Categorization
General
Text Categorization with Learning Methods
Hierarchical Text Categorization
Text Categorization Using Knowledge Bases
Evaluation of Text Categorization
Topic Charaterization
Term Specificity
Term Discrimination Value
General
- Rainbow <http://www.cs.cmu.edu/~mccallum/bow/rainbow/>
Rainbow is a program that performs statistical
text classification. It is based on the Bow library.
- ÃÖµ¿½Ã, Á¤°æÅÃ. 1995. "Ä«Å×°í¸®¿Í Å°¿öµåÀÇ ¹ÐÁ¢¼º Á¤º¸¿¡ ÀÇÇÑ
¹®¼ ÀÚµ¿ ºÐ·ù ½Ã½ºÅÛ ¼³°è ¹× ±¸Çö," Çѱ¹Á¤º¸°úÇÐȸ '95 °¡À»
Çмú¹ßÇ¥³í¹®Áý 22(2): 639-642.
[abstract] ´ë·«ÀÇ ¹®¼¸¦ ´Ù·ç´Â
°÷¿¡¼´Â ¹®¼³»¿ë¿¡ µû¸¥ ü°èÀûÀÎ ºÐ·ù °ü¸®´Â ÇʼöÀûÀÌ´Ù. º»
³í¹®¿¡¼´Â ¹é°ú»çÀüÀ» ±âº» ÀÚ·á·Î ÀÚµ¿À¸·Î Å°¿öµå¸¦ ÃßÃâÇÏ°í ¼öµ¿À¸·Î
Á¤ÀÇµÈ ºÐ·ù¹æ¹ý¿¡ µû¶ó ÃßÃâµÈ Å°¿öµåµé °£ÀÇ ºÐ·ù³» Áß¿äµµ¸¦ º¤ÅÍ°ªÀ¸·Î
Ç¥ÇöÇÏ´Â º¤ÅÍÅ×ÀÌºí »ý¼º ½Ã½ºÅÛ°ú, À̸¦ ÀÌ¿ëÇѹ®¼ ÀÚµ¿ ºÐ·ù ½Ã½ºÅÛÀÇ ¼³°è
¹× ±¸Çö¿¡ ´ëÇÏ¿© ±â¼úÇÑ´Ù. º» ½Ã½ºÅÛ¿¡¼ ±¸ÇöµÈ ºÐ·ùº° ±âÄõµå º¤ÅÍÅ×À̺íÀº
76°³ÀÇ ºÐ·ù¿Í 90,000¿©°³ÀÇ Å°¿öµå ¿£Æ®¸®·Î ±¸¼ºµÇ¾î ÀÖ´Ù. »ùÇà ÀÚ·á·Î
ÀÏÂ÷ÀûÀ¸·Î ±¸¼ºµÈ º¤ÅÍÅ×À̺íÀ» ÀÌ¿ëÇÑ ¹®¼ÀÇ ºÐ·ù¿¡¼´Â ´ëÇ¥ÇÒ¸¸ÇÑ
°íÀ¯¸í»ç¸¦ Æ÷ÇÔÇÑ ¹®¼¿¡ ´ëÇÏ¿©´Â °ÅÀÇ ¿Ïº®ÇÑ °á°ú¸¦ ³ªÅ¸³»Áö¸¸ º¸Åë¸í»ç
¹× Ãß»ó¸í»çÀÇ ¼ýÀÚ°¡ »ó´ëÀûÀ¸·Î ¸¹Àº ¾çÀ» Â÷ÁöÇÏ´Â ¹®¼¿¡ ´ëÇؼ´Â ¾à
80ÆÛ¼¾Æ®ÀÇ ¼º°ø·üÀ» º¸¿´´Ù.
- Apt?, Chidanand, Fred Damerau, and Sholom M. Weiss. 1994.
"Automated learning of decision rules for text categorization," ACM
Transactions on Information Systems 12(3): 233-251.
[abstract] We describe the results of
extensive experiments using optimized rule-based induction methods on large
document collections. The goal of these methods is to discover automatically
classification patterns that can be used for general document categorization
or personalized filtering of free text. Previous reports indicate that
human-engineered rule-based systems, requiring many man-years of developmental
efforts, have been successfully built to "read" documents and assign topics to
them. We show that machine-generated decision rules appear comparable to human
performance, while using the identical rule-based representation. In
comparison with other machine-learning techniques, results on a key benchmark
from the Reuters collection show a large gain in performance, from a
previously reported 67% recall/precision breakeven point to 80.5%. In the
context of a very high-dimensional feature space, several methodological
alternatives are examined, including universal versus local dictionaries, and
binary versus frequency-related features.
- Baker, Doug, and Andrew
McCallum. 1998. "Distributional Clustering of Words for Text
Classification", SIGIR-98. <http://www.cs.cmu.edu/~mccallum/papers/clustering-sigir98.ps.gz>.
[abstract] This paper applies
Distributional Clustering (Pereira et al. 1993) to document classification.
The approach clusters words into groups based on the distribution of class
labels associated with each word. Thus, un-like some other unsupervised
dimensionality-reduction techniques, such as Latent Semantic Indexing, we are
able to compress the feature space much more aggressively, while still
maintaining high document classification accuracy. Experimental results
obtained on three real-world data sets show that we can reduce the feature
dimensionality by three orders of magnitude and lose only 2%
accuracy---significantly better than Latent Semantic Indexing (Deerwester et
al. 1990), class-based clustering (Brown et al. 1992), feature selection by
mutual information (Yang and Pederson 1997), or Markovblanket-based feature
selection (Koller and
Sahami 1996). We also show that less aggressive clustering sometimes results
in improved classification accuracy over classification without clustering.
- Basu, Chumki, Haym
Hirsh, and William W.
Cohen. 1998. "Recommendation as Classification: Using Social and
Content-Based Information in Recommendation", Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI98).
<ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/aaai1.ps>.
[abstract] Recommendation systems make
suggestions about artifacts to a user. For instance, they may predict whether
a user would be interested in seeing a particular movie. Social recomendation
methods collect ratings of artifacts from many individuals and use
nearest-neighbor techniques to make recommendations to a user concerning new
artifacts. However, these methods do not use the significant amount of other
information that is of ten available about the nature of each artifact ---
such as cast lists or movie reviews, for example. This paper presents an
inductive learning approach to recommendation that is able to use both ratings
information and other forms of information about each artifact in predicting
user preferences. We show that our method outperforms an existing
social-filtering method in the domain of movie recommendations on a dataset of
more than 45,000 movie ratings collected from a community of over 250 users.
- Chakrabarti, Soumen, Byron Dom, Rakesh Agrawal, and Prabhakar
Raghavan. 1997. "Keyword Detection, Navigation, and Annotation in Hierarchical
Text," Proceedings of 23rd
International Conference on Very Large Data Bases: 446-455.
Available in Postscript format from Internet: <http://sunsite.ust.hk/dblp/db/conf/vldb/ChakrabartiDAR97.html>.
[abstract] We explore how to organize a
text database hierarchically to aid better searching and browsing. We propose
to exploit the natural hierarchy of topics, or taxonomy, that many corpora,
such as internet directories, digital libraries, and patent databases enjoy.
In our system, the user navigates through the query response not as a flat
unstructured list, but embedded in the familiar taxonomy, and annotated with
document signatures computed dynamically with respect to where the user is
located at any time. We show how to update such databases with new documents
with high speed and accuracy. We use techniques from statistical pattern
recognition to effciently separate the feature words or discriminants from the
noise words at each node of the taxonomy. Using these, we build a multi-level
classifier. At each node, this classifier can ignore the large number of noise
words in a document. Thus the classifier has a small model size and is very
fast. However, owing to the use of context-sensitive features, the classifier
is very accurate. We report on experiences with the Reuters newswire
benchmark, the US Patent database, and web document samples from Yahoo!.
- Chakrabarti, Soumen, Martin van den Berg, and Byron Dom. 1999.
"Focused crawling: A new approach to topic-specific web resource discovery,"
Proceedings of the WWW8
Conference. <http://www8.org/w8-papers/5a-search-query/crawling/index.html>.
[abstract] The rapid growth of the
world-wide web poses unprecedented scaling challenges for general-purpose
crawlers and search engines. In this paper we describe a new hypertext
resource discovery system called a Focused Crawler. The goal of a focused
crawler is to selectively seek out pages that are relevant to a pre-defined
set of topics. The topics are specified not using keywords, but using
exemplary documents. Rather than collecting and indexing all accessible web
documents to be able to answer all possible ad-hoc queries, a focused crawler
analyzes its crawl boundary to find the links that are likely to be most
relevant for the crawl, and avoids irrelevant regions of the web. This leads
to significant savings in hardware and network resources, and helps keep the
crawl more up-to-date. To achieve such goal-directed crawling, we design two
hypertext mining programs that guide our crawler: a classifier that evaluates
the relevance of a hypertext document with respect to the focus topics, and a
distiller that identifies hypertext nodes that are great access points to many
relevant pages within a few links. We report on extensive focused-crawling
experiments using several topics at different levels of specificity. Focused
crawling acquires relevant pages steadily while standard crawling quickly
loses its way, even though they are started from the same root set. Focused
crawling is robust against large perturbations in the starting set of URLs. It
discovers largely overlapping sets of resources in spite of these
perturbations. It is also capable of exploring out and discovering valuable
resources that are dozens of links away from the start set, while carefully
pruning the millions of pages that may lie within this same radius. Our
anecdotes suggest that focused crawling is very effective for building
high-quality collections of web documents on specific topics, using modest
desktop hardware.
- Chen, Hsinchun,
Chris Shuffels, and Rich Orwig. 1996. "Internet Categorization and Search: A
Self-Organizing Approach", Journal of Visual Communication and Image
Representation 7(1): 88-102. Available in Postscript format from
Internet: <http://ai.bpa.arizona.edu/papers/PS/som95.ps>.
[abstract] The problems of information
overload and vocabulary differences have become more pressing with the
emergence of the increasingly more popular Internet services. The main
information retrieval mechanisms provided by the prevailing Internet WWW
software are based on either keyword search (e.g., Lycos server at CMU, Yahoo
server at Stanford) or hypertext browsing (e.g., Mosaic and Netscape). This
research aims to provide an alternative concept-based categorization and
search capability for WWW servers based on selected machine learning
algorithms.
Our proposed approach, which is grounded on automatic textual
analysis of Internet documents (homepages), attempts to address the Internet
search problem by first categorizing the content of Internet documents. We
report results of our recent testing of a multi-layered neural network
clustering algorithm employing the Kohonen self-organizing feature map to
categorize (classify) Internet homepages according to their content. The
category hierarchies created could serve to partition the vast Internet
services into subject-specific categories and databases and improve Internet
keyword searching and/or browsing.
- Chen, Hsinchun,
Robin Sewell. 1999. High-Performance Digital Library Classification
Systems: From Information Retrieval to Knowledge Management. Digital
Libraries Initiative-Phase 2 Project Summary. <http://www.dli2.nsf.gov/projects/chen.pdf>.
[abstract] The proposed research aims to
develop an architecture and the associated techniques needed to automatically
generate classification systems from large domain-specific textual collections
and to unify them with manually created classification systems to assist in
effective digital library retrieval and analysis. Both algorithmic
developments and user evaluation in several sample domains will be conducted.
Scalable automatic clustering methods including Ward's clustering,
multi-dimensional scaling, latent semantic indexing, and the self-organizing
map will be developed and compared. Most of these algorithms, which are
computationally intensive, will be optimized based on the sparseness of common
keywords in textual document representations. Using parallel, high-performance
platforms as a time machine for simulation, we plan to parallellize and
benchmark these clustering algorithms for large-scale collections (on the
order of millions of documents) in several domains. Results of automatic
classification systems will be represented using several novel hierarchical
display methods.
- Cheng, Patrick T. K., and Albert K. W. Wu. 1995. "ACS: An
Automatic Classificatioin System," Journal of Information Science
21(4): 289-299.
[abstract] In this paper, we introduce
ACS: an automatic classification system for school libraries. First, various
approaches towards automatic classification, namely (i) rule-based, (ii)browse
and search, and (iii) partial match, are critically reviewd. The central
issues of scheme selection, text analysis and similarity measures are
discussed. A novel approach towards detecting book-class similarity with
Modified Overlap Coefficient (MOC0 is also proposed. Finally, the design and
implementation of ACS is presented. The test result of over 80% correctness in
automatic classification and a cost reduction of 75% compared to manual
classification suggest that ACS is highly adoptable.
- Clack, Chris, Johnny Farringdon, Peter Lidwell, and Tina Yu. 1997.
"Autonomous Document Classification for Business", Proceedings of the
First International Conference on Autonomous Agents: 201-208.
[abstract] We have developed a general
purpose information classification agent architecture and are applying it to
the problem of document classification and routing. Collaboration with Friends
of the Earth allows us to test our ideas in a non-academic context involving
high volumes of documents. We use the technique of genetic programming(GP), to
evolve classiying agents. This is a novel approach for document
classification, where each agent evolves a parse-tree representation of a
user's particular information need. The other unusual features of our research
are the longevity of our agents and the fact that they undergo a ccintinual
training process; feedback from the user enables the agent to adapt th the
user's long-term information requirements.
- Dash, M., and H. Liu. 1997. "Feature Selection for
Classification", Intelligent Data
Analysis, 1(3). <http://www-east.elsevier.com/ida/browse/0103/ida00013/article.htm>.
[abstract] Feature selection has been the focus of
interest for quite some time and much work has been done. With the creation of
huge databases and the consequent requirements for good machine learning
techniques, new problems arise and novel approaches to feature selection are
in demand. This survey is a comprehensive overview of many existing methods
from the 1970's to the present. It identifies four steps of a typical feature
selection method, and categorizes the different existing methods in terms of
generation procedures and evaluation functions, and reveals hitherto
unattempted combinations of generation procedures and evaluation functions.
Representative methods are chosen from each category for detailed explanation
and discussion via example. Benchmark datasets with different characteristics
are used for comparative study. The strengths and weaknesses of different
methods are explained. Guidelines for applying feature selection methods are
given based on data types and domain characteristics. This survey identifies
the future research areas in feature selection, introduces newcomers to this
field, and paves the way for practitioners who search for suitable methods for
solving domain-specific real-world applications.
- Dolin, R., D. Agrawal, A. El Abbadi, and J. Pearlman. 1996. "Using
Automated Classification for Summarizing and Selecting Heterogeneous
Information", D-Lib Magazine, January 1998. <http://www.dlib.org/dlib/january98/dolin/01dolin.html>.
[abstract] Information retrieval over the Internet
increasingly requires the filtering of thousands of heterogeneous information
sources. As the number and variability of sources increases, new ways of
automatically summarizing, discovering, and selecting collections relevant to
a user's query are needed. One such method involves the use of classification
schemes, such as the Library of Congress Classification (LCC), within which a
collection may be represented based on its content, irrespective of the
structure of the actual data or documents. We are currently experimenting with
newsgroups as collections. We have built an initial prototype which
automatically classifies and summarizes newsgroups within the LCC.
- Ellman,
Jeremy. 1998. "Using the Generic Document Profile to Cluster Similar
Texts", Proceedings of Computational Linguistics UK (CLUK 97),
University of Sunderland. <http://osiris.sunderland.ac.uk/~cs0jel/cluk98.ps.gz>.
[abstract] The World Wide Web contains a huge
quantity of text that is notoriously inefficient to use. This work aims to
apply a text processing technique based on thesaurally derived lexical chains
to improve Internet Information Retrieval where a lexical chain a set of words
in a text that are related by both proximity, and by relations derived from an
external lexical knowledge source such as WordNet, Roget's Thesaurus, LDOCE,
and so on.
Finding Information on the Internet is notoriously hard, even
when users have a clear focus to their queries. This situation is exacerbated
when users only have vague notions about the topics they wish to explore. This
could be remedied using Exemplar Texts, where an Exemplar Text is the ideal
model result for Web searches. Our problem is now transformed into one of
identifying similar texts.
The Generic Document Profile is designed to
allow the comparison of document similarity whilst being independent of
terminology and document length. It is simply a set of semantic categories
derived from Roget's thesaurus with associated weights. These weights are
based on lexical chain length and strength. A Generic Document Profile can be
compared to another using a Case Based Reasoning approach.
Case Based
Reasoning (CBR) is a problem solving method that seeks to solve existing
problems by reference to previous successful solutions. Here our Exemplar
Texts count as previous solutions (and in these experiments, MS Encarta is
used as the source of Exemplar Texts).
In CBR a query and examples are
usually represented as attribute value pairs. Thus to apply CBR to document
comparison, both the text acting as a query, and the documents to be compared
against need to be represented in equivalent terms. If this representation is
based on simple terms (ie word stems), the problem becomes hugely complex,
since there are many words in a language.
This would also be a fragile
approach, since semantically equivalent words word not count as equal.
However, If we represent a document as Roget categories, the problem becomes
tractable. We use Roget's thesaurus since there are 1024 main thesaurus
categories as opposed to 50000+ Synsets in Wordnet. This approach is providing
interesting results, although it is subject to the well known Word Sense
Disambiguation problem. The paper will summarise current findings, cover some
implementation details, and describe to approach taken to full-text Word Sense
Disambiguation.
Hesperus is a system designed to address this problem.
Using an electronic encyclopedia as a source of subject defining texts,
queries are made to MetaCrawler. This searches several search engines in
parallel returning the best 10-20 links. Hesperus retrieves these web pages,
and computes their conceptual similarity to the Exemplar Text using a method
based on thesaurally determined lexical chains.
Initial results show that
users prefer the Hesperus page order to MetaCrawler's statistical ordering.
The technique consequently shows promise as a way to improve the effectiveness
of Web searching.
- Feldman, Ronen, and Haym Hirsh. 1996. "Mining
Associations in Text in the Presence of Background Knowledge",
Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD96): 343-346. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1996/kdd.ps>.
[abstract] This paper describes the FACT system
for knowledge discovery from text. It discovers associations - patterns of
co-ccurrence - amongst keywords labeling the items in a collection of textual
documents. In addition, FACT is able to use background knowledge about the
keywords labeling the documents in its discovery process. FACT takes a
query-entered view of knowledge discovery, in which a discovery request is
viewed as a query over the implicit set of possible results supported by a
collection of documents, and where background knowledge is used to specify
constraints on the desired results of this query process. Execution of a
knowledge-discovery query is structured so that these background-knowledge
constraints can be exploited in the search for possible results. Finally,
rather than requiring a user to specify an explicit query expression in the
knowledge-discovery query language, FACT presents the user with a
simple-to-use graphical interface to the query language, with the language
providing a well-defined semantics for the discovery actions performed by a
user through the interface.
- Finch, S. 1994. "Exploiting Sophisticated Representations for
Classification," Proceedings of the Fourth Conference on Applied Natural
Language Processing: 65-71.
- Finch, S. 1995. "Partial Orders for Document Representation: A New
Methodology for Combining Document Features," Proceedings of the 18th
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval: 264-272.
- Gustavsson, Jonas. 1996. Text Categorization Using
Acquaintance. Available in HTML format from Internet: <http://www.student.nada.kth.se/~f92-jgu/C-uppsats/cup.html>.
[abstract] Acquaintance, which is an
n-gram based method for topic and language determination of texts which
require no prepared database or preconditioned texts is examined. An
implementation of the method which is slightly different from the one proposed
by Marc Damashek is described. This implementation is then used to separate
English texts from Swedish texts in a collection and to cluster the English
texts according to their topics. The possibility to use Acquaintance when
having to select one translation out of several alternatives in machine
translation is discussed. The method is, however, found to have several severe
drawbacks, reducing its versatility.
- Haas, Stephanie W., Jeremy Sugarman, and Helen R. Tibbo. 1996. "A
Text Filter for the Automatic Identification of Empirical Articles,"
Journal of the American Society for Information Science 47(2):
167-169.
- Hayes, Philip J. and Gail Koerner. 1993. "Intelligent Text
Technologies and Their Successful Use by the Information Industry," in Martha
E. Williams (ed.) Proceedings of the 14th National Online Meeting
(Medford, NJ: Learned Information, Inc): 189-196.
[abstract] The information industry is in
considerable flux. Online vendors must add value to the information they sell
to survive and prosper. Carnegie Group has been working for several years on
prepackaged and custom computer systems that add value to large volumes of
textual information, using both artificial intelligence and natural language
technologies. To aid in the indexing of on-line data, Carnegie Group provides
an intelligent text categorization shell, better known as TCS. Applications
supported by TCS include the indexing of news stories for Reuters' product
Textline and the routing of government security analyst information by HRB
Systems. For the creation of company, people and advertiser name indexes,
Carnegie Group provides the NameFinder - software, which supports application
development for newspaper, magazine, or on-line database index creation. Going
beyond identifying categories and specific references of data within a body of
text, fact extraction technologies and techniques can be used to create
summaries for abstracts and database filling purposes. Carnegie Group's fact
extractioin technology was developed from a system that assistedjournalists in
preparing earnings reports. This paper provides an overview of these Carnegie
Group intelligent text technologies and their successful use by information
providers.
- Huffman, S., and M. Damashek. 1994. "Acquaintance: A Novel
Vector-space N-gram Technique for Document Categorization," Proceedings
of the TREC 3.
- Iwayama, Makoto, and Takenobu Tokunaga. 1994. "A Probabilistc
Model for Text Categorization: Based on a Single Random Variable with Multiple
Values," Proceedings of the 4th Conference on Applied Natural Language
Processing: 162-167. Also technical report 94TR-0008, Department of
Computer Science, Tokyo Institute of Technology. [online] <ftp://ftp.cs.titech.ac.jp/pub/TR/94/94TR0008.ps.gz>.
[abstract] Text categorization is the
classification of documents with respect to a set of predefined categories. In
this paper, we propose a new probabilistic model for text categorization, that
is based on a Single random Variable with Multiple Values (SVMV).
Compared to previous probabilistic models, our model has the following
advantages; 1) it considers within-document term frequencies, 2) considers
term weighting for target documents, and 3) is less affected by having
insufficient training cases. We verify our model's superiority over the others
in the task of categorizing news articles from the "Wall Street Journal".
- Kalt, T. 1998. A New Probabilistic Model of Text
Classification and Retrieval TITLE2. UM-CS-1998-018, University of
Massachusetts, Amherst. <http://cs-tr.cs.cornell.edu/Dienst/UI/1.0/Display/ncstrl.umassa_cs/UM-CS-1998-018>.
[abstract] This paper introduces the
multinomial model of text classification and retrieval. One important feature
of the model is that the tf statistic, which usually appears in probabilistic
IR models as a heuristic, is an integral part of the model. Another is that
the variable length of documents is accounted for, without either making a
uniform length assumption or using length normalization. The multinomial model
employs independence assumptions which are similar to assumptions made in
previous probabilistic models, particularly the binary independence model and
the 2-Poisson model. The use of simulation to study the model is described.
Performance of the model is evaluated on the TREC-3 routing task. Results are
compared with the binary independence model and with the simulation studies.
- Kargupta, Hillol, I. Kamzaoglu, B. Stafford, V. Hnagandi, and K.
Buescher. 1997. PADMA: PArallel Data Mining Agents For Scalable Text
Classification. Technical report LAUR-97-3491. X Division, Los Alamos
National Laboratory.
[abstract] This paper introduces PADMA
(PArallel Data Mining Agents), a parallel agent based system for scalable text
classification. PADMA contains modules for (1)parallel dta accessing
operations, (2) parallel hierarchical clustering, and (3) web-based data
visualization. This paper introduces the general architecture of PADMA and
presents a detailed description of its different modules.
- Kessler, Brett, Geoffrey Numberg, Hinrich Schütze. 1997.
"Automatic Detection of Text Genre," Proceedings of the 35th Annual
Meeting of the ACL and 8th Conference of the ECACL: 32-38. Available in
Postscript format from Internet: <ftp://parcftp.xerox.com/pub/qca/genre/paper.acl97.ps.Z>.
[abstract] As the text databases available
to users become larger and more heterogeneous, genre becomes increasingly
important for computational linguistics as a complement to topical and
structural principles of classification. We propose a theory of genres as
bundles of facets, which correlate with various surface cues, and argue that
genre detection based on surface cues is as successful as detection based on
deeper structural properties.
- Kudenko, Daniel, and Haym Hirsh. 1998. "Feature
Generation for Sequence Categorization", Proceedings of the Fifteenth
National Conference on Artificial Intelligence (AAAI98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/aaai2.ps>.
[abstract] The problem of sequence categorization
is to generalize from a corpus of labeled sequences procedures for accurately
labeling future unlabeled sequences. The choice of representation of sequences
can have a major impact on this task, and in the absence of background
knowledge a good representation is often not known and straightforward
representations are often far from optimal. We propose a feature generation
method (called FGEN) that creates Boolean features that check for the presence
or absence of heuristically selected collections of subsequences. We show
empirically that the representation computed by FGEN improves the accuracy of
two commonly used learning systems (C4.5 and Ripper) when the new features are
added to existing representations of sequence data. We show the superiority of
FGEN across a range of tasks selected from three domains: DNA sequences, Unix
command sequences, and English text.
- Lam, Wai, and Chao Yang Ho. 1998. "Using A Generalized Instance
Set for Automatic Text Categorization", Proceedings of the 21th
International Conference on Research and Development in Information Retrieval
(SIGIR '98).
[abstract] We investigate several recent
approaches for text categorization under the framework of similarity-based
learning. They include two families of text categorization techniques, namely
the k-nearest neighbor (k-NN) algorithm and linear classifiers. After
identifying the weakness and strength of each technique, we propose a new
technique known as the generalized instance set (GIS) algorithm by unifying
the strengths of k-NN and linear classifiers and adapting to characteristics
of text categorization problems. We also explore some variants of our GIS
approach. We have implemented our GIS algorithm, the ExpNet algorithm, and
some linear classifiers. Extensive experiments have been conducted on two
common document corpora, namely the OHSUMED collection and the Reuters-21578
collection. The results show that our new approach outperforms the latest k-NN
approach and linear classifiers in all experiments.
- Larkey, Leah, and W. Bruce Croft. 1996. "Combining Classifiers in
Text Categorization," Proceedings of the 19th International Conference
on Research and Development in Information Retrieval (SIGIR '96):
289-297. Available in Postscript format from Internet: <http://cobar.cs.umass.edu/info/psfiles/irpubs/combo.ps.gz>.
[abstract] Three different types of classifiers
were investigated in the context of a text categorization problem in the
medical domain: the automatic assignment of ICD9 codes to dictated inpatient
discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian
independence classifers were applied individually and in combination. A
combination of different classifiers produced better results than any single
type of classifier. For this specific medical categorization problem, new
query formulation and weighting methods used in the k-nearest-neighbor
classifier improved performance.
- Lehnert, Wendy, S. Soderland, D. Aronow, F. Feng, and A. Shmueli.
199?. "Inductive Text Classification for Medical Applications,"
Proceedings of the 19th International Conference on Research and
Development in Information Retrieval (SIGIR '96): 289-297. Available in
HTML format from Internet: <http://cobar.cs.umass.edu/medical/wl_jetai.html>.
[abstract] Text classification poses a significant
challenge for knowledge-based technologies because it touches on all the
familiar demons of artificial intelligence: the knowledge engineering
bottleneck, problems of scale, easy portability across multiple applications,
and cost-effective system construction. Information retrieval (IR)
technologies traditionally avoid all of these issues by defining a document in
terms of a statistical profile of its lexical items. The IR community is
willing to exploit a superficial type of knowledge found in dictionaries and
thesaurae, but anything that requires customization, application-specific
engineering, or any amount of manual tinkering is thought to be incompatible
with practical cost-effective system designs. In this paper we will challenge
those assumptions and show how machine learning techniques can operate as an
effective method for automated knowledge acquisition when it is applied to a
representative training corpus, and leveraged against a few hours of routine
work by a domain expert. We describe a fully implemented text classification
system operating on a medical testbed, and report experimental results based
on that testbed.
- Lewis, David
Dolan. 1992. "Feature Selection and Feature Extraction for Text
Categorization", Proceedings of Speech and Natural Language
Workshop: 212-217. Morgan Kaufmann. Available in Postscript format from
Internet: <http://www.research.att.com/~lewis/papers/lewis94b.ps>.
[abstract] The effect of selecting varying numbers
and kinds of features for use in predicting category membership was
investigated on the Reuters and MUC-3 text categorization data sets. Good
categorization performance was achieved using a statistical classifier and a
proportional assignment strategy. The optimal feature set size for word-based
indexing was found to be surprisingly low (10 to 15 features) despite the
large training sets. The extraction of new text features by syntactic analysis
and feature clustering was investigated on the Reuters data set. Syntactic
indexing phrases, clusters of these phrases, and clusters of words were all
found to provide less effective representations than individual words.
- Lewis, David
Dolan. 1996. "Challenges in machine learning for text classification", In
Proceedings of the Ninth Annual Conference on Computational Learning
Theory, page 1, New York, 1996. ACM. <http://www.research.att.com/~lewis/papers/lewis96f.ps>.
-
- Lewis, David
Dolan. 1997. "A sequential Algorithm for Training Text Classifiers:
Corrigendum and Additional Data," SIGIR Forum 29(2): 13-19.
<http://www.research.att.com/~lewis/papers/lewis95g.ps>.
- Lewis, David
Dolan. 1997. "Information Retrieval and the Statistics of Large Data
Sets," Proceedings of the NRC Massive Data Sets Workshop:
143-147. <http://www.research.att.com/~lewis/papers/lewis96c.ps>.
[abstract] The way in which text is represented
has a strong impact on the performance of text classification (retrieval and
categorization) systems. We discuss the operation of text classification
systems, introduce a theoretical model of how text representation impacts
their performance, and describe how the performance of text classification
systems is evaluated. We then present the results of an experiment on
improving text representation quality, as well as an analysis of the results
and the directions they suggest for future research.
- Lewis, David
Dolan, and Marc Ringuette. 1994. "A Comparison of Two Learning Algorithms
for Text Categorization," Proceedings of the Third Annual Symposium on
Document Analysis and Information Retrieval: 81-93. Available in
Postscript format from Internet: <http://www.research.att.com/~lewis/papers/lewis94b.ps>.
[abstract] Algorithms for training Bayesian
independence classifiers and decision trees were tested on two text
categorization data sets. Both algorithms allow adjustable tradeoffs between
recall and precision and have similar classification effectiveness. The
decision tree method, while slower, produces a classifier that is easier to
understand and in one case revealed an unsuspected chronological variation in
the category definitions. Maximum effectiveness is reached for both algorithms
when the initial set of features is pruned using collection frequency and
mutual information. This supports previous suggestions that the stepwise
feature selection in decision tree algorithms can be aided by prefiltering the
feature set.
- Lewis, David
Dolan, Robert E. Schapire, James P. Callan, and Ron Papka. 1996. "Training
Algorithms for Linear Text Classifiers", Proceedings of the 19th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval: 298-306. <http://www.research.att.com/~lewis/papers/lewis96d.ps>.
[abstract] Systems for text retrieval, routing,
categorization and other IR tasks rely heavily on linear classifiers. We
propose that two machine learning algorithms, the Widrow-Hoff and EG
algorithms, be used in training linear text classifiers. In contrast to most
IR methods, theoretical analysis provides performance guarantees and guidance
on parameter settings for these algorithms. Experimental data is presented
showing Widrow-Hoff and EG to be more effective than the widely used Rocchio
algorithm on several categorization and routing tasks.
- Li, Hang, and Kenji Yamanishi. 1997. "Document Classification
Using a Finite Mixture Model," Proceedings of the ACL/EACL-97
Available in Postscript format from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9705005>.
[abstract] We propose a new method of classifying
documents into categories. The simple method of conducting hypothesis testing
over word-based distributions in categories suffers from the data sparseness
problem. In order to address this difficulty, Guthrie et.al. have developed a
method using distributions based on hard clustering of words, i.e., in which a
word is assigned to a single cluster and words in the same cluster are treated
uniformly. This method might, however, degrade classification results, since
the distributions it employs are not always precise enough for representing
the differences between categories. We propose here the use of soft clustering
of words, i.e., in which a word can be assigned to several different clusters
and each cluster is characterized by a specific word probability distribution.
We define for each document category a finite mixture model, which is a linear
combination of the probability distributions of the clusters. We thereby treat
the problem of classifying documents as that of conducting statistical
hypothesis testing over finite mixture models. In order to accomplish this
testing, we employ the EM algorithm which helps efficiently estimate
parameters in a finite mixture model. Experimental results indicate that our
method outperforms not only the method using distributions based on hard
clustering, but also the method using word-based distributions and the method
based on cosine-similarity.
- Macskassy, Sofus, Arunava Banerjee, Brian Davison, and Haym Hirsh. 1998. "Human
Performance on Clustering Web Pages: A Preliminary Study", Proceedings
of the Fourth International Conference on Knowledge Discovery and Data Mining
(KDD98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/kdd2.ps>.
[abstract] With the increase in information on the
World Wide Web it has become difficult to quickly find desired information
without using multiple queries or using a topic-specific search engine. One
way to help in the search is by grouping HTML pages together that appear in
some way to be related. In order to better understand this task, we performed
an initial study of human clustering of web pages, in the hope that it would
provide some insight into the difficulty of automating this task. Our results
show that subjects did not cluster identically; in fact, on average, any two
subjects had little similarity in their webpage clusters. We also found that
subjects generally created rather small clusters, and those with access only
to URLs created fewer clusters than those with access to the full text of each
web page. Generally the overlap of documents between clusters for any given
subject increased when given the full text, as did the percentage of documents
clustered. When analyzing individual subjects, we found that each had
different behavior across queries, both in terms of overlap, size of clusters,
and number of clusters. These results provide a sobering note on any quest for
a single clearly correct clustering method for web pages.
- McCallum, Andrew,
Kamal Nigam, Jason Rennie and
Kristie. 1999. "Building Domain-Specific Search Engines with Machine Learning
Techniques", AAAI-99 Spring Symposium. <http://www.cs.cmu.edu/~mccallum/papers/cora-aaaiss99.ps.gz>.
[abstract] Domain-specific search engines are
growing in popularity because they offer increased accuracy and extra
functionality not possible with the general, Web-wide search engines. For
example, www.campsearch.com allows complex queries by age-group, size,
location and cost over summer camps. Unfortunately these domaiin-specific
search engines are difficult and time-comsuming to maintain. This paper
proposes the use of machine learning techniques to greatly automate the
creation and maintenence of domain-specific search engines. We describe new
research in reinforcement leraning, information extraction and text
classification that enables efficient spidering, identifying information text
segments, and populating topic hierarchies. Using these techniques, we have
built a demonstration system: a search engine for computer science research
papers. It already contains over 50,000 papers and is publicly available at
www.com.justresearch.com.
- McCallum, Andrew,
and Kamal Nigam. 1998. "A
Comparison of Event Models for Naive Bayes Text Classification", AAAI-98
Workshop on Learning for Text Categorization. <http://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps>.
- McCallum, Andrew,
and Kamal Nigam. 1998. "Employing
EM in Pool-Based Active Learning for Text Classification",
ICML-98. <http://www.cs.cmu.edu/~mccallum/papers/emactive-icml98.ps.gz>.
- Mostafa, J., L. M. Quiroga, M. Palakal. 1998. "Filtering Medical
Documents Using Automated and Human Classification Methods", Journal of
the American Society for Information Science, 49(14): 1304-1318.
[abstract] The goal of this research is to clarify
the role of document classification in information filtering. An important
function of classification, in managing computational complexity, is described
and illustrated in the context of an existing filtering system. A parameter
called classification homogeneity is presented for analyzing unsupervised
automated classification by employing human classification as a control. Two
significant components of the automated classification approach, vocabulary
discovery and classification scheme generation, are described in detail.
Results of classification performance revealed considerable variability in the
homogeneity of automatically produced classes. Based on the classification
performance, different types of interest profiles were created. Subsequently,
these profiles were used to perform filtering sessions. The filtering results
showed that with increasing homogeneity, filtering performance improves, and,
conversely, with decreasing homogeneity, filtering performance degrades.
ÀÌ ¿¬±¸ÀÇ ¸ñÀûÀº Á¤º¸ ÇÊÅ͸µ¿¡¼ ¹®Çå ºÐ·ùÀÇ ¿ªÇÒÀ» ¸íÈ®ÇÏ°Ô ¼³¸íÇÏ´Â
°ÍÀÌ´Ù. ±âÁ¸ÀÇ ÇÊÅ͸µ ½Ã½ºÅÛ¿¡¼ °è»êº¹À⼺À» ´Ù·ç´Âµ¥ ÀÖ¾î¼ ºÐ·ùÀÇ
Áß¿äÇÑ ±â´ÉÀ» ¼³¸íÇÏ°í ÀÖ´Ù. ¼öµ¿ ºÐ·ù·Î ÅëÁ¦ÇÏ¿© ºñ°¨µ¶ ÀÚµ¿ ºÐ·ù¸¦ ºÐ¼®Çϱâ
À§ÇØ ºÐ·ù µ¿Áú¼ºÀ̶ó°í ºÒ¸®´Â ÆĶó¹ÌÅ͸¦ Á¦½ÃÇÏ¿´´Ù. ÀÚµ¿ ºÐ·ù ¹æ½ÄÀÇ µÎ Áß¿ä
¿ä¼Ò·Î ¾îÈÖ ¹ß°ß°ú ºÐ·ùü°è »ý¼ºÀ» ÀÚ¼¼È÷ ¼³¸íÇÏ¿´´Ù. ºÐ·ù ¼º´É¿¡ ±Ù°ÅÇÏ¿©
»óÀÌÇÑ À¯ÇüÀÇ °ü½É ÇÁ·ÎÆÄÀÏÀ» »ý¼ºÇÏ°í, À̸¦ ÇÊÅ͸µ ¼¼¼Ç¿¡ »ç¿ëÇÏ¿´´Ù. ÇÊÅ͸µ
°á°ú´Â µ¿Áú¼ºÀÇ Áõ°¡¿¡ µû¶ó¼ ÇÊÅ͸µ ¼º´ÉÀÌ Çâ»óµÇ°í, ¹Ý´ë·Î µ¿Áú¼ºÀÌ
°¨¼ÒÇϸé ÇÊÅ͸µ ¼º´Éµµ ¶³¾îÁö´Â °ÍÀ¸·Î ³ªÅ¸³µ´Ù.
- Moens, Marie-Francine, and Caroline Uyttendaele. 1997. "Automatic
Text Structuring and Categorization as a First Step in Summarizing Legal
Cases," Informatin Processing & Management 33(6): 727-737.
[abstract] The SALOMON system automatically
summarizes Belgian criminal cases in order to improve access to the large
number of existing and future court decisions. SALOMON extracts relevant text
units from the case text to form a case summary. Such a case profile
facilitates the rapid determination of the relevance of the case or may be
employed in text search. In a first importatn abstracting step SALOMON
performs an initial categorization of legal criminal cases and structures the
case text into separate legally relevant and irrelevant components. A text
grammar represented as a semantic network is used to automatically determine
the category of the case and its components. In this way, we are able to
extract from the case general data and to identify text portions relevant for
further abstracting. It is argued that prior knowledge of the text structure
and its indicative cues may support automatic abstracting. A text grammar is a
promising form for representing the knowledge involved.
SALOMON ½Ã½ºÅÛÀº
ÇöÀç¿Í ¹Ì·¡ÀÇ ¹æ´ëÇÑ ÆÇ°á»çÇ׿¡ ´ëÇÑ Á¢±ÙÀ» ½±°ÔÇÒ ¸ñÀûÀ¸·Î º§±â¿¡ÀÇ
Çü»ç»ç°ÇÀ» ÀÚµ¿À¸·Î ¿ä¾àÇØÁØ´Ù. SALOMONÀº ÆǷʺ»¹®À¸·ÎºÎÅÍ ÀûÀýÇÑ ÅؽºÆ®
´ÜÀ§¸¦ ÃßÃâÇÏ¿© ÆÇ·Ê¿ä¾àÀ» »ý¼ºÇÑ´Ù. ÆÇ·Ê ÇÁ·ÎÆÄÀÏÀº ÆÇ·ÊÀÇ ÀûÇÕ¼º ¿©ºÎ¸¦
½Å¼ÓÈ÷ °áÁ¤Çϴµ¥ µµ¿òÀÌ µÇ¸ç ÅؽºÆ® Ž»ö¿¡ ¾²À̱⵵ ÇÑ´Ù. ÃÊ·Ï»ý¼ºÀÇ Ã¹¹ø°
´Ü°è¿¡¼ SALOMONÀº ¸ÕÀú Çü»çÆǷʸ¦ ¹üÁÖÈÇÑ ´ÙÀ½ ÆÇ·Ê º»¹®À»
¹ý·üÀûÀ¸·Î ÀûÇÕÇϰųª ÀûÇÕÇÏÁö ¾ÊÀº ¿©·¯ ¿ä¼Ò·Î ±¸Á¶ÈÇÑ´Ù. Àǹ̳×Æ®¿öÅ©·Î
Ç¥ÇöµÈ ÅؽºÆ® ¹®¹ýÀ» »ç¿ëÇÏ¿© ÆÇ·ÊÀÇ ¹üÁÖ¿Í °³º° ¿ä¼Ò¸¦ ÀÚµ¿À¸·Î °áÁ¤ÇÑ´Ù.
ÀÌ·¸°Ô ÇÔÀ¸·Î½á ÆǷʷκÎÅÍ ÀϹÝÀûÀÎ µ¥ÀÌÅ͸¦ ÃßÃâÇÒ ¼ö ÀÖÀ¸¸ç ÃÊ·Ï»ý¼ºÀÇ ´ÙÀ½
´Ü°è¿¡ ÇÊ¿äÇÑ ÅؽºÆ® ºÎºÐÀ» ½Äº°ÇÒ ¼ö ÀÖ´Ù. ÅؽºÆ® ±¸Á¶¿Í Áö½Ã ´Ü¼¿¡ ´ëÇÑ
»çÀü Áö½ÄÀÌ ÀÚµ¿ÃÊ·Ï¿¡ µµ¿òÀÌ µÇ´Â °Í °°´Ù. ÅؽºÆ® ¹®¹ýÀº ÇÊ¿äÇÑ Áö½ÄÀ»
Ç¥ÇöÇÏ´Â À¯¿ëÇÑ Çü½ÄÀÌ´Ù.
- Ng, Hwee Tou, Wei Boon Goh, and Kok Leong Low. 1997. "Feature
Selection, Perceptron Learning, and a Usability Case Study for Text
Categorization," Proceedings of the 20th International Conference on
Research and Development in Information Retrieval (SIGIR '97): 67-73.
[abstract] In this paper, we describe an automated
learning approach to text categorization based on perceptron learning and a
new feature selection metric, called correlation coefficient. Our approach has
been tested on the standard Reuters text categorization collection. Empirical
results indicate that our approach outperforms the best published results on
this Reuters collection. In particular, our new feature selection method
yields considerable improvement. We also investigate the usability of our
automated learning approach by actually developing a system that categorizes
texts into a tree of categories. We compare the accuracy of our learning
approach to a rule-based, expert system approach that uses a text
categorization shell built by Carnegie Group. Although our automated learning
approach still gives a lower accuracy, by appropriately incorporating a set of
manually chosen words to use as features, the combined, semi-automated
approach yields accuracy close to the rule-based approach.
- Nigam, Kamal, John
Lafferty, and Andrew McCallum.
1999. "Using Maximum Entropy for Text Classification", To appear in
IJCAI-99 Workshop on Machine Learning for Information Filtering,
1999. <http://www.cs.cmu.edu/~knigam/papers/maxent-ijcaiws99.ps>.
[abstract] This paper proposes the use of maximum
entropy techniques for text classification. Maximum entropy is a probability
distribution estimation technique widely used for a variety of natural
language tasks, such as language modeling, part-of-speech tagging, and text
segmentation. The underlying principle of maximum entropy is that without
external knowledge, one should prefer distributions that are uniform.
Constraints on the distribution, derived from labeled training data, inform
the technique where to be minimally non-uniform. The maximum entropy
formulation has a unique solution which can be found by the improved iterative
scaling algorithm. In this paper, maximum entropy is used for text
classification by estimating the conditional distribution of the class
variable given the document. In experiments on several text datasets we
compare accuracy to naive Bayes and show that maximum entropy is sometimes
significantly better, but also sometimes worse. Much future work remains, but
the results indicate that maximum entropy is a promising technique for text
classification.
- Nigam, Kamal, Andrew McCallum, Sebastian Thrun,
and Tom Mitchell. 1998. "Text
Classification from Labeled and Unlabeled Documents using EM", To appear in
Machine Learning, 1999. <http://www.cs.cmu.edu/~knigam/papers/emcat-mlj99.ps>.
[abstract] This paper shows that the accuracy of
learned text classifiers can be improved by augmenting a small number of
labeled training documents with a large pool of unlabeled documents. This is
improtant because in many text classification problems obtaining training
labels is expensive, while large quantities of unlabeled documents are
readlily available. We introduce an algorithm for learning from labeled and
unlabeled documents based on the combination of Expectation-Maximization (EM)
and a naive Bayes classifier. The algorithm first trains a classifier using
the available labeled documents, and probabilistically labels the unlabeled
documents. It then trains a new classifier using the labels for all the
documents, and iterates to convergence. This base EM procedure works well when
the data conform to the generative assumptions of the model. However these
assumptions are often violated in practice, and poor performance can result.
We present two extensions to the algorithm that improve classification
accuracy under these conditions: (1) a weighting factor to modulate the
contribution of the unlabeled data, and (2) the use of multiple mixture
components per class. Experimental results, obtained using text from three
different real-world tasks, show that the use of unlabeled data reduces
classification error by up to 30%.
- Nigam, Kamal, Andrew McCallum, Sebastian Thrun,
and Tom Mitchell. 1998. "Learning to Classify Text from Labeled and Unlabeled
Documents", AAAI-98. <http://www.cs.cmu.edu/~mccallum/papers/emcat-aaai98.ps.gz>.
[abstract] In many inportant text classification
problems, acquiring class labels for training documents is costly, while
gathering large quantities of unlabeled data is cheap. This paper shows that
the accuracy of text classifiers trained with a small number of labeled
documents can be improved by augmenting this small training set with a lage
pool of unlabeled documents. We present a theoretical argument showing that,
under common assumptions, unlabeled data contain information about the target
function. We then introduce an algorithm for lerningn from labeled and
unlabeled text based on the combination of Expectation-Maximization with a
naive Bayes classifier. The algorithm first trains a classifier using the
available labeled documents, and probabilistically labels the unlabeled
documents; it then trains a new classifier using the labels for all the
documents, and iterates to convergence. Experimental results, obtained using
text from three different real-world tasks, show that the use of unlabeled
data reduces classification error by up to 33%.
- Norton, Steven W., and Haym Hirsh. 1992. "Classifier
Learning from Noisy Data as Probabilistic Evidence Combination",
Proceedings of the Tenth National Conference on Artificial Intelligence
(AAAI92): 141-146. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1992/aaai3.ps>.
[abstract] This paper presents an approach to
learning from noisy data that views the problem as one of reasoning under
uncertainty, where prior knowledge of the noise process is applied to compute
a posteriori probabilities over the hypothesis space. In preliminary
experiments this maximum a posteriori (MAP) approach exhibits a learning rate
advantage over the C4.5 algorithm that is statistically significant.
- O'Neil, P. 1997. "An incremental approach to text representation,
categorization, and retrieval", Proceedings of the Fourth International
Conference on Document Analysis and Recognition, 714-717.
[abstract] Efficient and accurate information
retrieval is a goal of just about everyone. Whether you are looking for
information on the Internet, a book or article in the library, satellite
imagery of missile silos, or a recipe for dinner, finding exactly what you
want or need, even if you know exactly what you are looking for, can be an
imposing and most difficult task. Many current techniques require an intimate
understanding of the actual processes involved. The method presented in this
paper provides for an automatic representation of text data by vectors, which
can then be manipulated to categorize and organize the data. Information can
be retrieved without knowledge of the underlying process. The user can ask for
information using normal discourse. This technology can also be applied to
data mining and visualization.
- Papka, Ron. 1999. "Learning Query Bias for Improved On-Line
Document Classification", a poster presentation at Snowbird 99 conference,
Snowbird, Utah, April 5-10, 1999. Also technical report IR-160, CIIR, UMASS. <http://cobar.cs.umass.edu/pubfiles/ir-160.ps>.
- Papka, Ron, and James Allan. 1998. "Document Classification Using
Multiword Features", Proceedings of the 1998 ACM 7th International
Conference on Information and Knowledge Management: 124-131.
[abstract] We investigate the use of multimword
query features to improve the effectiveness of text-retrieval systems that
accept natural-language queries. A relevance feedback process is explained
that expands an initial query with single and multiword features. The
multiword features are modelled as a set of words appearing within windows of
varyiing sizes. Our experimental results suggest that windows of larger span
yield improvements in retrieval over windows of smaller span. This result
gives rise to a query contraction process that prunes 25% of the features in
an expaned query with no loss in retrieval effectiveness.
- Riloff, E. 1993.
"Using Cases to Represent Context for Text Classification," Proceedings
of the Second International Conference on Information and Knowledge Management
(CIKM-93) : 105-113.
[abstract] Research on text classifkation has
fypically focused on keyword searches and statistical techniques. Keywords
alone cannot always distinguish the relevant from the irrelevant texts and
some relevant texts do not contain any reliable keywords at all. Our approach
to text classifkation uses case-based reasoning to represent natural language
contexts that can be used to ctassify texts with extremely high precision. The
case base of natural language contexts is acquired automatically during
sentence analysis using a training corpus of texts and their correct relevancy
classifications. A text is represented as a set of cases and we classify a
text as relevant if any of its cases is deemed to be relevant. We rely on the
statistical properties of the case base to determine whether similar cases are
highly correlated with relevance for the domain. Experiments with the MUC~
corpus suggest that case-based text class~lcation can achieve very high levels
of precision and outperforms our previous algorithms based on relevancy
signatures.
- Riloff, E. 1994.
Information Extraction as a Basis for Portable Text Classification
Systems. Ph.D. Thesis, Computer Science Department, University of
Massachusetts Amherst Available from <http://www.cs.utah.edu/~riloff/psfiles/single-thesis.ps>.
- Riloff, E. 1995.
"Little Words Can Make a Big Difference for Text Classification,"
Proceedings of the 18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval: 130-136. Also
available from <http://www.cs.utah.edu/~riloff/psfiles/sigir95.ps>.
- Riloff, E. 1996.
"Automatically Generating Extraction Patterns from Untagged Text,"
Proceedings of the Thirteenth National Conference on Artificial
Intelligence (AAAI-96): 1044-1049. Also available from <http://www.cs.utah.edu/~riloff/psfiles/aaai96.ps>.
- Riloff, E. 1996.
"Using Learned Extraction Patterns for Text Classification," in S. Wermter, E.
Riloff, and G. Scheler (eds), Connectionist, Statistical, and Symbolic
Approaches to Learning for Natural Language Processing.
Springer-Verlag, Berlin. pp. 275-289. Also available from <http://www.cs.utah.edu/~riloff/psfiles/ijcai-book-chapter.ps>.
- Riloff, E., and and
J. Lorenzen. 1997. "Extraction-based Text Categorization: Generating
Domain-specific Role Relationships Automatically," in Tomek Strzalkowski
(ed.), Natural Language Information Retrieval. Springer-Verlag,
Berlin. Also available from <http://www.cs.utah.edu/~riloff/psfiles/psfiles/nlp-ir-chapter.ps>.
- Riloff, E., and J.
Shoen. 1995. "Automatically Acquiring Conceptual Patterns Without an Annotated
Corpus," Proceedings of the 18th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval:
130-136. Also available from <http://www.cs.utah.edu/~riloff/psfiles/sigir95.ps>.
- Riloff, E., and J.
Shoen. 1995. "Automatically Acquiring Conceptual Patterns Without an Annotated
Corpus," Proceedings of the Third Workshop on Very Large Corpora:
148-161. Also available from <http://www.cs.utah.edu/~riloff/psfiles/vlc-paper.ps>.
- Riloff, E., and W.
Lehnert. 1992. "Classifying Texts Using Relevancy Signatures,"
Proceedings of the Tenth National Conference on Artificial
Intelligence: 329-334.
- Riloff, E., and W.
Lehnert. 1994. "Information Extraction as a Basis for High-Precision Text
Classification," ACM Transactions on Information Systems12(3):
296-333. Also available from <http://www.cs.utah.edu/~riloff/psfiles/single-acm.ps>.
- Shklar, Leon, and Haym
Hirsh. 1997. "Imposing Bounds on the Number of Categories for Incremental
Concept Formation", In R. Greiner, T. Petsche, and S.J. Hanson, editors,
Computational Learning Theory and Natural Learning Systems,
Volume IV, pages 36-49. MIT Press. <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1997/clnl.ps>.
- Takkinen, Juha. 1995. "An Adaptive Approach to Text Categorization
and Understanding," Proceedings of the 5th Annual IDA Conference on
Computer and Information Science: 39-42. Available in Postscript format
from Internet: <http://www.ida.liu.se/~juhta/publications/IDAconf95_paper.ps>.
Also in HTML format from Internet: <http://www.ida.liu.se/~juhta/publications/IDAconf95_paper.html>.
[abstract] Categorization of electronic documents
is discussed and the results of a preliminary study of techniques that people
use for categorizing electronic mail letters are presented. The results,
mainly descriptive, are used as a basis for a discussion of the feasibility of
an adaptive approach to text categorization and understanding. The
adaptability is supposed to come from a design of a system that uses knowledge
about documents in general, rather than any specific type of document, to
categorize documents. One implication of this approach is a text
categorization system which can easily be adapted to categorizing different
types of documents and perform automatic text categorization.
-
- Tokunaga, Takenobu, and Makoto Iwayama. 1994. Text
categorization based on weighted inverse document frequency. Technical
report 94TR-0001, Department of Computer Science, Tokyo Institute of
Technology. Available in Postscript format from Internet: <ftp://ftp.cs.titech.ac.jp/pub/TR/94/94TR0001.ps.gz>.
[abstract] This paper proposes a new term
weighting method called weighted inverse document frequency (WIDF). As its
name indicates, WIDF is an extension of IDF (inverse document frequency) to
incorporate the term frequency over the collection of texts. WIDF of a term in
a text is given by dividing the frequency of the term in the text by the sum
of the frequency of the term over the collection of texts. WIDF is applied to
the text categorization task and proved to be superior to the other methods.
The improvement of accuracy on IDF is 7.4% at the maximum.
-
- Wiebe, Janyce, Rebecca Bruce, and Lei Duan. 1997. "Probabilistic
Event Categorization," Proceedings of the Recent Advances in Natural
Language Processing (RANLP-97): 163-170. Available in Postscript format
from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9710008>.
[abstract] This paper describes the automation of
a new text categorization task. The categories assigned in this task are more
syntactically, semantically, and contextually complex than those typically
assigned by fully automatic systems that process unseen test data. Our system
for assigning these categories is a probabilistic classifier, developed with a
recent method for formulating a probabilistic model from a predefined set of
potential features. This paper focuses on feature selection. It presents a
number of fully automatic features. It identifies and evaluates various
approaches to organizing collocational properties into features, and presents
the results of experiments covarying type of organization and type of
property. We find that one organization is not best for all kinds of
properties, so this is an experimental parameter worth investigating in NLP
systems. In addition, the results suggest a way to take advantage of
properties that are low frequency but strongly indicative of a class. The
problems of recognizing and organizing the various kinds of contextual
information required to perform a linguistically complex categorization task
have rarely been systematically investigated in NLP.
-
- Yaari, Yaakov. 1997. "Segmentation of Expository Texts by
Hierarchical Agglomerative Clustering," Proceedings of the
RANLP'97. <ftp://ftp.cs.umass.edu/pub/techrept/techreport/1996/UM-CS-1996-087.ps>.
[abstract] We propose a method for segmentation of
expository texts based on hierarchical agglomerative clustering. The method
uses paragraphs as the basic segments for identifying hierarchical discourse
structure in the text, applying lexical similarity between them as the
proximity test. Linear segmentation can be induced from the identified
structure through application of two simple rules. However the hierarchy can
be used also for intelligent exploration of the text. The proposed
segmentation algorithm is evaluated against an accepted linear segmentation
method and shows comparable results.
- Yang, Yiming. 1996.
"Sampling Strategies and Learning Efficiency in Text Categorization", AAAI
Spring Symposium on Machine Learning in Information Access 1996: 88-95. <http://www.cs.cmu.edu/~yiming/papers.yy/aaai_mlia96.ps>.
[abstract] This paper studies training set
sampling strategies in the context of statistical learning for text
categorization. It is argued sampling strategies favoring common categories is
superior to uniform coverage or mistake-driven approaches, if performance is
measured by globally assessed precision and recall. The hypothesis is
empirically validated by examining the performance of a nearest neighbor
classifier on training samples drawn from a pool of 235,401 training texts
with 29,741 distinct categories. The learning curves of the classifier are
analyzed with respect to the choice of training resources, the sampling
methods, the size, vocabulary and category coverage of a sample, and the
category distribution over the texts in the sample. A nearly-optimal
categorization performance of the classifier is achieved using a relatively
small training sample, showing that statistical learning can be successfully
applied to very large text categorization problems with affordable
computation.
-
- Yang, Yiming. 1995.
"Noise Reduction in a Statistical Approach to Text Categorization", ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR'95)
1995: 256-263. <http://www.cs.cmu.edu/~yiming/papers.yy/sigir95.ps>.
[abstract] This paper studies noise reduction for
computational efficiency improvements in a statistical learning method for
text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple
noise reduction strategies are proposedand evaluated, including: an aggressive
removal of "non-informative words" from texts before training; the use of a
truncated singular value decomposition to cut off noisy "latent semantic
structures" during training; the elimination of non-influential components in
the LLSF solution (a word-concept association matrix) after training. Text
collections in different domains were used for evaluation. Significant
improvements in computational efficiency without losing categorization
accuracy were evident in the testing results.
-
- Yang, Yiming, and
John Wilbur. 1996. "Using Corpus Statistics to Remove Redundant Words in Text
Categorization," Journal of the American Society for Information
Science 47(5): 357-369.
[abstract] This article studies aggressive word
removal in text categorization to reduce the noise in free texts and to
enhance the computatinal efficiency of categorization. We use a novel stop
word identification method to automatically generate domain specific stoplists
which are much larger than a conventional domain-independent stoplist. In our
tests with three categorization methods on text collections from different
domains/applications, significant numbers of words were removed without
sacrificing categorization effectiveness. In the test of the Expert Network
method on CACM documents, for example, an 87% removal of unique words reduced
the vocabulary of documents from 8,002 distinct words to 1,045 words, which
resulted in a 63% time savings and a 74% memory savings in the computations of
category ranking, with a 10% precision improvement on average over not using
word removal. It is evident in this study that automated word removal based on
corpus sttistics has a practical and significant impact on the computational
tractability of categorization methods in large databases.
Text Categorization with Learning Methods
- Benkhalifa, M.; Bensaid, A.; Mouradi, A. 1999. "Text
categorization using the semi-supervised fuzzy c-means algorithm",
Proceedings of the 18th International Conference of the North American
Fuzzy Information (NAFIPS'99), 561-565.
[abstract] Text categorization (TC) is the
automated assignment of text documents to predefined categories based on
document contents. TC has become very important in the information retrieval
area, where information needs have tremendously increased with the rapid
growth of textual information sources such as the Internet. We compare, for
text categorization, two partially supervised (or semi-supervised) clustering
algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC)
algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM)
algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm
falls somewhere between the fully supervised and the fully unsupervised
learning schemes, in the sense that it exploits both class information
contained in labeled data (training documents) and structure information
possessed by unlabeled data (test documents) in order to produce better
partitions for test documents. Our experiments, make use of the Reuters 21578
database of documents and consist of a binary classification for each of the
ten most populous categories of the Reuters database. To convert the documents
into vector form, we experiment with different numbers of features, which we
select, based on an information gain criterion. We verify experimentally that
ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM)
algorithm. With a smaller number of features, ssFCM's performance is also
superior to that of ssAHC's. Finally ssFCM results in improved performance and
faster execution time as more weight is given to training documents.
- Berger, A. 1999.
"Error-correcting output coding for text classification", IJCAI'99:
Workshop on machine learning for information filtering. Stockholm,
Sweden. <http://www.cs.cmu.edu/~aberger/ps/ecoc.ps>.
[abstract] This paper applies error-correcting
output coding (ECOC) to the task of document categorization. ECOC, of recent
vintage in the AI literature, is a method for decomposing a multiway
classification problem into many binary classification tasks, and then
combining the results of the subtasks into a hypothesized solution to the
original problem. There has been much recent interest in the machine learning
community about algorithms which integrate "advice" from many subordinate
predictors into a single classifier, and error-correcting output coding is one
such technique. We provide experimental results on several real-world
datasets, extracted from the Internet, which demonstrate that ECOC can offer
significant improvements in accuracy over conventional classification
algorithms.
- Boley, Daniel, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle
Hastings, George Karypis, Vipin
Kumar, Bamshad Mobasher, and Jerome Moore. 1999. "Document Categorization
and Query Generation on the World Wide Web Using WebACE", To appear in
AI Review. <ftp://ftp.cs.umn.edu/dept/users/kumar/aireview-agent.ps>.
[abstract] Text categorization is the task of
deciding whether a document belongs to a set of prespecified classes of
documents. Automatic classification schemes can greatly facilitate the process
of categorization. Categorization of documents is challenging, as the number
of discriminating words can be very large. Many existing algorithms simply
would not work with these many number of features. k-nearest neighbor (k-NN)
classification is an instance-based learning algorithm that has shown to be
very effective for a variety of problem domains including documents. The key
element of this scheme is the availability of a similarity measure that is
capable of identifying neighbors of a particular document. A major drawback of
the similarity measure used in k-NN is that it uses all features in computing
distances. In many document data sets, only smaller number of the total
vocabulary may be useful in categorizing documents. A possible approach to
overcome this problem is to learn weights for different features (or words in
document data sets). In this paper, we propose the Weight Adjusted k-Nearest
Neighbor (WAKNN) classification algorithm that is based on the k-NN
classification paradigm. In WAKNN, the weights of features are learned using
an iterative algorithm. In the weight adjustment step, the weight of each
feature is perturbed in small steps to see if the change improves the
classification objective function. The feature with the most improvement in
the objective function is identified and the corresponding weight is updated.
The feature weights are used in the similarity measure computation such that
important features contribute more in the similarity measure. Experiments on
several real life document data sets show the promise of WAKNN, as it
outperforms the state of the art classification algorithms such as C4.5,
RIPPER, Rainbow, PEBLS, and VSM.
- Cohen, William
W. 1995. "Learning to Classify English Text with ILP Methods", In L. De
Raedt (Ed.), Advances in Inductive Logic Programming, IOS Press,
1995. <http://www.research.att.com/~wcohen/postscript/ilp.ps>.
[abstract] Text categorization is the task
of classifying text into one of several predefined categories. In this paper
we will evaluate the effectiveness of several ILP methods for text
categorization, and also compare them to their propositional analogs. The
methods considered are FOIL, the propositional rule-learning system RIPPER,
and a first-order version of RIPPER called FLIPPER. We show that the benefit
of using a first-order representation in this domain is relatively modest; in
particular, the performance difference between FLIPPER and FOIL and their
propositional counterparts is quite small, compared to the differences between
FOIL and FLIPPER. However, a first-order representation seems to be
advantageous when high-precision classifiers are desirable.
- Cohen,
William W. 1995. "Text Categorization and Relational Learning,"
Proceedings of the Twelfth International Conference on Machine
Learning. Available from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/ml-95-ir.ps>.
- Cohen,
William W. 1996. "Learning Rules that Classify E-Mail," Proceedings
of the 1996 AAAI Spring Symposium on Machine Learning in Information
Access. Available from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/aaai-ss-96.ps>.
- Cohen,
William W., and Yoram Singer. 1996. "Context-Sensitive Learning Methods
for Text Categorization," Proceedings of the SIGIR-96. Available
from <http://portal.research.bell-labs.com/orgs/ssr/people/wcohen/postscript/sigir-96.ps>.
- Cohen,
William W., and Haym
Hirsh. 1998. "Joins that Generalize: Text Classification Using WHIRL",
Proceedings of the Fourth International Conference on Knowledge
Discovery and Data Mining (KDD98). <ftp://ftp.cs.rutgers.edu/pub/hirsh/papers/1998/kdd1.ps>.
[abstract] WHIRL is an extension of relational
databases that can perform "soft joins" based on the similarity of textual
identifiers; these soft joins extend the traditional operation of joining
tables based on the equivalence of atomic values. This paper evaluates WHIRL
on a number of inductive classification tasks using data from the World Wide
Web. We show that although WHIRL is designed for more general similarity-based
reasoning tasks, it is competitive with mature inductive classification
systems on these classification tasks. In particular, WHIRL generally achieves
lower generalization error than C4.5, RIPPER, and several nearest-neighbor
methods. WHIRL is also fast---up to 500 times faster than C4.5 on some
benchmark problems. We also show that WHIRL can be efficiently used to select
from a large pool of unlabeled items those that can be classified correctly
with high confidence.
- Cohen,
William W., and Haym
Hirsh. 1998. "Context-sensitive learning methods for text categorization",
ACM Transactions on Information Systems, 17(2), 141-173.
[abstract] Two recently implemented
machine-learning algorithms, RIPPER and sleeping-experts for phrases, are
evaluated on a number of large text categorization problems. These algorithms
both construct classifiers that allow the "context" of a word w to affect how
(or even whether) the presence or absence of w will contribute to a
classification. However, RIPPER and sleeping-experts differ radically in many
other respects: differences include different notions as to what constitutes a
context, different ways of combining contexts to construct a classifier,
different methods to search for a combination of contexts, and different
criteria as to what contexts should be included in such a combination. In
spite of these differences, both RIPPER and sleeping-experts perform extremely
well across a wide variety of categorization problems, generally outperforming
previously applied learning methods. We view this result as a confirmation of
the usefulness of classifiers that represent contextual information.
- Dagan, Ido, Yael Karov, Dan Roth. 1997. "Mistake-Driven Learning
in Text Categorization," Proceedings of the Second
Conference on Empirical Methods in NLP (EMNLP-2). Available in
Postscript format from <http://xxx.lanl.gov/abs/cmp-lg/9706006>.
[abstract] Learning problems in the text
processing domain often map the text to a space whose dimensions are the
measured features of the text, e.g., its words. Three characteristic
properties of this domain are (a) very high dimensionality, (b) both the
learned concepts and the instances reside very sparsely in the feature space,
and (c) a high variation in the number of active features in an instance. In
this work we study three mistake-driven learning algorithms for a typical task
of this nature -- text categorization. We argue that these algorithms -- which
categorize documents by learning a linear separator in the feature space --
have a few properties that make them ideal for this domain. We then show that
a quantum leap in performance is achieved when we further modify the
algorithms to better address some of the specific characteristics of the
domain. In particular, we demonstrate (1) how variation in document length can
be tolerated by either normalizing feature weights or by using negative
weights, (2) the positive effect of applying a threshold range in training,
(3) alternatives in considering feature frequency, and (4) the benefits of
discarding features while training. Overall, we present an algorithm, a
variation of Littlestone's Winnow, which performs significantly better than
any other algorithm tested on this task using a similar feature set.
- Goldberg, J.L. 1995. "CDM: an approach to learning in text
categorization", Proceedings of the Seventh International Conference on
Tools with Artificial Intelligence, 258-265.
[abstract] The category discrimination method
(CDM) is a new learning algorithm designed for text categorization. The
motivation is that there are statistical problems associated with natural
language text when it is applied as input to existing machine learning
algorithms (too much noise, too many features, skewed distribution). The bases
of the CDM are research results about the way that humans learn categories and
concepts vis-a-vis contrasting concepts. The essential formula is cue validity
borrowed from cognitive psychology, and used to select from all possible
single word-based features the 'best' predictors of a given category. The
hypothesis that CDM's performance exceeds two non-domain specific algorithms,
Bayesian classification and decision tree learners, is empirically tested.
- Esposito, F., D. Malerba, L. Di Pace, and P. Leo. 1999. "A
Learning Intermediary for the Automated Classification of Web Pages", ICML-99
Workshop: Machine Learning in Text Data Analysis. <http://www-ai.ijs.si/DunjaMladenic/ICML99/DonatoFinal.ps>.
[abstract] The development of adaptive tools for
computer-supported collaborative work is a challenging application for machine
learning. In the paper, some problems concerning the implementation of an
adaptive intermediary performing content-based classification of web pages are
presented, and the adopted solutions are described. Information considered by
the intermediary is both the textual contents of web pages and the layout
structure defined by HTML tags. The representation language adopted for web
pages is the bag-of-words, where words are selected from training documents by
means of a novel scoring measure. Three different classification models are
empirically compared on a classification task: Decision trees, centroids, and
x-nearest-neighbor. Experimental results are reported and conclusions are
drawn on the relevance of the HTML layout structure for classification
purposes, on the significance of words selected by the scoring measure, as
well as on the performance of the different classifiers.
- Han, Eui-Hong
(Sam), and Vipin Kumar.
1999. Text Categorization Using Weight Adjusted k-Nearest Neighbor
Classification. Technical Report #99-019, Department of Computer
Science and Engineering, University of Minnesota. <ftp://ftp.cs.umn.edu/dept/users/kumar/waknn.ps>.
[abstract] Text categorization is the task of
deciding whether a document belongs to a set of prespecified classes of
documents. Automatic classification schemes can greatly facilitate the process
of categorization. Categorization of documents is challenging, as the number
of discriminating words can be very large. Many existing algorithms simply
would not work with these many number of features. k-nearest neighbor (k-NN)
classification is an instance-based learning algorithm that has shown to be
very effective for a variety of problem domains including documents. The key
element of this scheme is the availability of a similarity measure that is
capable of identifying neighbors of a particular document. A major drawback of
the similarity measure used in k-NN is that it uses all features in computing
distances. In many document data sets, only smaller number of the total
vocabulary may be useful in categorizing documents. A possible approach to
overcome this problem is to learn weights for different features (or words in
document data sets). In this paper, we propose the Weight Adjusted k-Nearest
Neighbor (WAKNN) classification algorithm that is based on the k-NN
classification paradigm. In WAKNN, the weights of features are learned using
an iterative algorithm. In the weight adjustment step, the weight of each
feature is perturbed in small steps to see if the change improves the
classification objective function. The feature with the most improvement in
the objective function is identified and the corresponding weight is updated.
The feature weights are used in the similarity measure computation such that
important features contribute more in the similarity measure. Experiments on
several real life document data sets show the promise of WAKNN, as it
outperforms the state of the art classification algorithms such as C4.5,
RIPPER, Rainbow, PEBLS, and VSM.
- Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., and Scholkopf,
B. 1998. "Support vector machines", IEEE Intelligent Systems,
13(4), 18-28.
[abstract] My first exposure to Support Vector
Machines came this spring when heard Sue Dumais present impressive results on
text categorization using this analysis technique. This issue's collection of
essays should help familiarize our readers with this interesting new racehorse
in the Machine Learning stable. Bernhard Scholkopf, in an introductory
overview, points out that a particular advantage of SVMs over other learning
algorithms is that it can be analyzed theoretically using concepts from
computational learning theory, and at the same time can achieve good
performance when applied to real problems. Examples of these real-world
applications are provided by Sue Dumais, who describes the aforementioned
text-categorization problem, yielding the best results to date on the Reuters
collection, and Edgar Osuna, who presents strong results on application to
face detection. Our fourth author, John Platt, gives us a practical guide and
a new technique for implementing the algorithm efficiently.
- Joachims,
T. 1998. "Text Categorization with Support Vector Machines: Learning with
Many Relevant Features", European Conference on Machine Learning
(ECML). <http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_98a.pdf>.
[abstract] This paper explores the use of
Support Vector Machines (SVMs) for learning text classifiers from examples. It
analyzes the particular properties of learning with text data and identifies
why SVMs are appropriate for this task. Empirical results support the
theoretical findings. SVMs achieve substantial improvements over the currently
best performing methods and behave robustly over a variety of different
learning tasks. Furthermore, they are fully automatic, eliminating the need
for manual parameter tuning.
- Jonsson,
Per. 1999. Evaluation of information loss when using WEBSOM for text
categorization. HS-IDA-EA-99-112. <http://www.ida.his.se/ida/htbin/exjobb/1999/HS-IDA-EA-99-112>.
[abstract] Today, we live in an information
technological society, where the Internet is an example of a valuable source
of textual information. These sources are rapidly growing and so are needs for
processing the information. Text classification brings the ability to present
information to the user of a system in a more comprehensible and structural
manner. In this report, the information loss when using WEBSOM - an
application that organizes documents on a two-dimensional map - is evaluated.
The hypothesis is that information is lost when high-dimensional vectors that
represent the documents are projected onto this two-dimensional map by WEBSOM.
This is empirically tested, wherein WEBSOM is implemented and criteria for
comparing distances are defined and used in the measuring, where we compare
distances on the map with distances measured directly between the vectors that
represent the documents. Results show that information is lost and we suggest
that the high-dimensional vectors should be used directly for similarity
measuring, so that the information loss is avoided.
- Joachims,
T. 1999. "Transductive Inference for Text Classification using Support
Vector Machines", Proceedings of the Sixteenth International Conference
on Machine Learning (ICML'99).
<http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99c.pdf>.
[abstract] This paper introduces Transductive
Support Vector Machines (TSVMs) for text classification. While regular Support
Vector Machines (SVMs) try to induce a general decision function for a
learning task, Transductive Support Vector Machines take into account a
particular test set and try to minimize misclassifications of just those
particular examples. The paper presents an analysis of why TSVMs are well
suited for text classification. These theoretical findings are supported by
experiments on three test collections. The experiments show substantial
improvements over inductive methods, especially for small training sets,
cutting the number of labeled training examples down to a twentieth on some
tasks. This work also proposes an algorithm for training TSVMs efficiently,
handling 10,000 examples and more.
- Mitchell, T. 1999.
"Supervised Text Learning from Unlabeled Data", ICML-99
Workshop: Machine Learning in Text Data Analysis. <http://www.cs.cmu.edu/~tom/iccs99.ps>.
[abstract] Most computational models of supervised
learning rely only on labeled training examples, and ignore the possible role
of unlabeled data. This is true for much research in machine learning,
including work on learning over text. This talk will explore the potential
role of unlabeled data in supervised learning over text. We present an
algorithm and experimental results demonstrating that unlabeled data can
significantly improve learning accuracy in problems such as learning to
classify web pages. We then identify the abstract problem structure that
enables the algorithm to successfully utilize this unlabeled data, and prove
that unlabeled data will boost learning accuracy for problems in this class.
The problem class we identify includes problems where the features describing
the examples are redundantly sufficient for classifying the example; a notion
we make precise in the paper. This problem class includes many learning
problems involving text, such as learning a semantic lexicon over noun
phrases, learning to classify web pages, and learning word sense
disambiguation rules. We conclude that current research on text learning
should consider more strongly the potential role of unlabeled data.
- Lam, S.L.Y., and Lee, Dik Lun (1999). "Feature reduction for
neural network based text categorization", Proceedings of the 6th
International Conference on Database Systems for Advanced Applications,
195-202.
[abstract] In a text categorization model using an
artificial neural network as the text classifier scalability is poor if the
neural network is trained using the raw feature space since textural data has
a very high-dimension feature space. We proposed and compared four
dimensionality reduction techniques to reduce the feature space into an input
space of much lower dimension for the neural network classifier. To test the
effectiveness of the proposed model, experiments were conducted using a subset
of the Reuters-22173 test collection for text categorization. The results
showed that the proposed model was able to achieve high categorization
effectiveness as measured by precision and recall. Among the four
dimensionality reduction techniques proposed, principal component analysis was
found to be the most effective in reducing the dimensionality of the feature
space.
- Mladenic,
Dunja. 1998. "Turning Yahoo into an Automatic Web-Page Classifier",
Proceedings of the 13th European Conference on Aritficial Intelligence
ECAI'98: 473-474. <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwECAI98yr.ps.gz>.
[abstract] The paper describes an approach to
automatic Web-page classification based on the Yahoo hierarchy. Machine
learning techniques developed for learning on text data are used here on the
hierarchical classification structure. The high number of features is reduced
by taking into account the hierarchical structure and using feature subset
selection based on the method known from information retrieval. Documents are
represented as feature-vectors that include n-grams instead of including only
single words (unigrams) as commonly used when learning on text data. Based on
the hierarchical structure the problem is divided into subproblems, each
representing one on the categories included in the Yahoo hierarchy. The result
of learning is a set of independent classifiers, each used to predict the
probability that a new example is a member of the corresponding category.
Experimental evaluation on real-world data shows that the proposed approach
gives good results. For more than a half of testing examples a correct
category is among the 3 categories with the highest predicted probability.
- Mladenic,
Dunja. 1998. "Feature Subset Selection in Text-Learning", 10th
European Conference on Machine Learning (ECML98). <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwECML98.ps.gz>.
[abstract] This paper describes several known and
some new methods for feature subset selection on large text data. Experimental
comparison given on real-world data collected from Web users shows that
characteristics of the problem domain and machine learning algorithm should be
considered when feature scoring measure is selected. Our problem domain
consists of hyperlinks given in a form of small-documents represented with
word vectors. In our learning experiments naive Bayesian classifier was used
on text data. The best performance was achieved by the feature selection
methods based on the feature scoring measure called Odds ratio that is known
from information retrieval.
- Mladenic,
Dunja. 1998. "Feature Selection for Clasification Based on Text
Hierarchy", Working notes of Learning from Text and the Web, Conference
on Automated Learning and Discovery CONALD-98. <http://www-ai.ijs.si/DunjaMladenic/papers/PWW/pwwCONALD98.ps.gz>.
[abstract] This paper describes automatic document
categorization based on large text hierarchy. We handle the large number of
features and training examples by taking into account hierarchical structure
of examples and using feature selection for large text data. We experimentally
evaluate feature subset selection on real-world text data collected from the
existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian
classifier was used on text data using feature-vector document representation
that includes n-grams instead of just single words (unigrams). Experimental
evaluation on real-world data collected form the Web shows that our approach
gives promising results and can potentially be used for document
categorization on the Web. Additionally the best result on our data is
achieved for relatively small feature subset, while for larger subset the
performance substantially drops. The best performance among six tested feature
scoring measure was achieved by the feature scoring measure called Odds ratio
that is known from information retrieval.
- Mladenic,
Dunja. 1998. Machine Learning on Non-homogeneous, Distributed Text
Data. PhD thesis, University of Ljubljana, Slovenia. <http://www.cs.cmu.edu/~TextLearning/pww/PhD.html>.
[abstract] This dissertation
proposes new machine learning methods where the corresponding learning problem
is characterized by a high number of features, unbalanced class distribution
and asymmetric misclassification costs. The input is given as a set of text
documents or their Web addresses (URLs). The induced target concept is
appropriate for the classification of new documents including shortened
documents describing individual hyperlinks. The proposed methods are based on
several new solutions.
Proposed is a new, enriched document
representation that extends the bag-of-words representation by adding word
sequences and document topic categories. Features that represent word
sequences are generated using a new efficient procedure. Features giving topic
categories are obtained from background knowledge constructed using the new
machine learning method for learning from class hierarchies. When learning
from class hierarchy, a high number of class values, examples and features,
are handled by (1) dividing a problem into subproblems based on the
hierarchical structure of class values and examples, (2) by applying feature
subset selection and (3) by pruning unpromising class values during
classification.
Several new feature scoring measures are
proposed as a result of comparison and analysis of different feature scoring
measures used in feature subset selection on text data. The new measures are
appropriate for text domains with several tens or hundreds of thousands of
features, can handle unbalanced class distribution and asymmetric
misclassification costs.
Developed methods are suitable for
the classification of documents including shortened documents. We build
descriptions of hyperlinks, and treat these as shortened documents. Since each
hyperlink on the Web is pointing to some document, the classification of
hyperlinks (corresponding shortened documents) could be potentially improved
by using this information. We give the results of preliminary experiments for
learning in domains with mutually dependent class attributes. Training
examples are used for learning 'a next state function on the Web', where
document content (class attributes) is predicted from the hyperlink
(feature-vector) that points to the document. Document content we are
predicting is represented as a feature-vector each feature being one of the
mutually dependent class attributes.
The proposed methods
and solutions are implemented and experimentally evaluated on real-world data
collected from the Web in three independent projects. It is shown that
document classification, categorization and prediction using the proposed
methods perform well on large, realworld domains.
The
experimental findings further indicate that the developed methods can
efficiently be used to support analysis of large amount of text data,
automatic document categorization and abstraction, document content prediction
based on the hyperlink content,classification of shortened documents,
development of user customized text-based systems, and user customized Web
browsing. As such, the proposed machine learning methods contribute to machine
learning and to related fields of text-learning, data mining, intelligent data
analysis, information retrieval, intelligent user interfaces, and intelligent
agents. Within machine learning, this thesis contributes an approach to
learning on large, distributed text data, learning on hypertext, and learning
from class hierarchies. Within computer science, it contributes to better
design of Web browsers and software assistants for people using the Web.
- Mooney, Raymond J., and Loriene Roy. 1999. "Content-Based Book
Recommending Using Learning for Text Categorization", Submission to
Fourth ACM Conference on Digital Libraries. <http://xxx.lanl.gov/abs/cs.DL/9902011>.
[abstract] Recommender systems improve access to
relevant products and information by making personalized suggestions based on
previous examples of a user's likes and dislikes. Most existing recommender
systems use social filtering methods that base recommendations on other users'
preferences. By contrast, content-based methods use information about an item
itself to make suggestions. This approach has the advantage of being able to
recommended previously unrated items to users with unique interests and to
provide explanations for its recommendations. We describe a content-based book
recommending system that utilizes information extraction and a
machine-learning algorithm for text categorization. Initial experimental
results demonstrate that this approach can produce accurate recommendations.
- Joachims,
T. 1997. Text Categorization with Support Vector Machines: Learning
with Many Relevant Features. LS8-Report, 23, Universitat Dortmund, LS
VIII-Report. <http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_97b.pdf>.
[abstract] This paper explores the use of
Support Vector Machines (SVMs) for learning text classifiers from examples. It
analyzes the particular properties of learning with text data and identifies,
why SVMs are appropriate for this task. Empirical results support the
theoretical findings. SVMs achieve substantial improvements over the currently
best performing methods and they behave robustly over a variety of different
learning tasks. Furthermore, they are fully automatic, eliminating the need
for manual parameter tuning.
Topic Characterization
- Fisher, David E. 1994. Topic Characterization of Full Length
Texts Using Direct and Indirect Term Evidence. Technical report
CSD-94-809, Computer Science Department, University of Berkeley. Available in
Postscript format from Internet: <http://sunsite.berkeley.edu/Dienst/Repository/2.0/Body/ncstrl.ucb/CSD-94-809/postscript>.
[abstract] This project evaluates two families of algorithms that
can be used to automatically classify general texts within a set of conceptual
categories. The first family uses indirect evidence in the form of
term-category co-occurrence data. The second uses direct evidence based on the
senses of the terms, where a term's senses are designated by the categories
that it is a member of in a thesaurus. The direct evidence algorithms
incorporate varying degrees of indirect evidence as well. For these
experiments a set of 3864 conceptual categories were derived from the noun
hierarchy of WordNet, an on-line thesaurus. The co-occurrence data for the
associational and disambiguation algorithms was collected from a corpus of
3,711 AP newswire articles, comprising approximately 1.7 million words of
text. Each of the algorithms was applied to all of the articles in the AP
corpus, with their performance evaluated both qualitatively and
quantitatively. The results of these experiments show that both classes of
algorithms have potential as fully automatic text classifiers. The direct
methods produce qualitatively better classifications than the indirect ones
when applied to AP newswire texts. The direct methods also achieve both a
higher precision, 86.75% correctly classified (best case) versus 72.34%, and a
higher approximate recall. The experiments identify limiting factors on the
performance of the algorithms. The primary limitations stem from the quality
of the thesaural categories, which were derived automatically, and from the
performance of the term sense disambiguation algorithm. The former can be
addressed with human intervention, the latter with a larger training set for
the statistical database.
Hierarchical Text Categorization
- D'Alessio, Stephen, Aaron Kershenbaum, Keitha Murray, and Robert
Schiaffino. 1998. Hierarchical Text Categorization. <http://pride-i2.poly.edu/textmining/>.
[abstract] The problem of assigning documents to
categories in a hierarchically organized taxonomy is considered. Given a
training corpus of documents already placed in one or more categories,
vocabulary is extracted. The vocabulary, words that appear with high frequency
within a given category, characterizes each subject area by being associated
with nodes in the hierarchy. Each node¡¯s vocabulary is filtered and its words
assigned weights with respect to the specific category. Then, test documents
are scanned for the presence of this vocabulary and categories are ranked with
respect to the document based on the presence of terms from this vocabulary.
Finally, documents are assigned to categories based on these rankings.
Precision and recall are measured.
We present an algorithm for associating
words with individual categories within the hierarchy and demonstrate that
precision and recall can be significantly improved by solving the
categorization problem taking the hierarchy into account. We also show that
these results can be improved even further by intelligently selecting
intermediate categories in the hierarchy. Solving the problem iteratively,
moving downward from the root of the taxonomy to the leaf nodes, we improve
precision from 82% to 89% and recall from 82% to 87% on the much-studied
Reuters-21578 corpus with 135 categories organized in a three-level hierarchy
of categories.
- Iwayama, Makoto, and Takenobu Tokunaga. 1995. "Hierarchical
Bayesian Clustering for Automatic Text Classification," Proceedings of
IJCAI '95. Also technical report 95TR-0015, Department of Computer
Science, Tokyo Institute of Technology. <ftp://ftp.cs.titech.ac.jp/pub/TR/95/95TR0015.ps.gz>.
[abstract] Text classification, the
grouping of texts into several clusters, has been used as a means of improving
both the efficiency and the effectiveness of text retrieval/categorization. In
this paper we propose a hierarchical clustering algorithm that constructs a
set of clusters having the maximum Bayesian posterior probability, the
probability that the given texts are classified into clusters. We call the
algorithm Hierarchical Bayesian Clustering (HBC). The advantages of HBC are
experimentally verified from several viewpoints. (1) HBC can re-construct the
original clusters more accurately than do other non probabilistic algorithms.
(2) When a probabilistic text categorization is extended to a cluster-based
one, the use of HBC offers better performance than does the use of non
probabilistic algorithms.
[contents] 1. Introduction; 2.
Hierarchical Bayesian Clustering; 3. Comparison Two Other Clustering
Algorithms; 4. Evaluation in Text Categorization; 5. Discussion
- Iwayama, Makoto, and Takenobu Tokunaga. 1995a. "Cluster-based Text
Categorization: A Comparison of Category Search Strategies," Proceedings
of the 18th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval: 273-280. Also technical report
95TR-0016, Department of Computer Science, Tokyo Institute of Technology.
[online] <ftp://ftp.cs.titech.ac.jp/pub/TR/95/95TR0016.ps.gz>.
[abstract] Text categorization can be
viewed as a process of category search, in which one or more categories for a
test document are searched for by using given training documents with known
categories. In this paper a cluster-based search with a probabilistic
clustering algorithm is proposed and evaluated on two data sets. The
efficiency, effectiveness, and noise tolerance of this search were confirmed
to be better than those of a full search, a category-based search, and a
cluster-based search with nonprobabilistic clustering.
[contents]
1. Introduction; 2. Category Search Strategies; 3. Hierarchical Bayesian
Clustering; 4. Experiments; 5. Results and Discussions; 6. Summary
- Kita, Kenji, Minoru Sasaki, Tai Xiao Ying. 1998. "Rule-Based
Hierarchical Document Categorization for the World Wide Web", Asia
Pacific Web Conference (APWeb98): 269-273. <http://www-a2k.is.tokushima-u.ac.jp/member/kita/EPrint/Kita/008.doc>.
[abstract] Document categorization, which
is defined as the classification of text documents into one of several fixed
classes or categories, has become important with the explosive growth of the
World Wide Web. The goal of the work described here is to automatically
categorize Web documents in order to enable effective retrieval of Web
information. Based on the rule learning algorithm RIPPER (Repeated Incremental
Pruning to Produce Error Reduction), we propose an efficient method for
hierarchical document categorization.
[contents] 1 Introduction; 2
Rule-Based Text Categorization; 3 Introducing Hierarchical Document
Categories; 4 Experimental Results; 5. 5 Conclusions
- Koller, D.,
and Mehran Sahami. 1997. "Rule-Based Hierarchical Document Categorization for
the World Wide Web", Proceedings of the 14th International Conference on
Machine Learning (ML), 170-178. <http://robotics.stanford.edu/~koller/papers/ml97.html>.
[abstract] The proliferation of topic
hierarchies for text documents has resulted in a need for tools that
automatically classify new documents within such hierarchies. Existing
classification schemes which ignore the hierarchical structure and treat the
topics as separate classes are often inadequate in text classification where
the there is a large number of classes and a huge number of relevant features
needed to distinguish between them. We propose an approach that utilizes the
hierarchical topic structure to decompose the classification task into a set
of simpler problems, one at each node in the classification tree. As we show,
each of these smaller problems can be solved accurately by focusing only on a
very small set of features, those relevant to the task at hand. This set of
relevant features varies widely throughout the hierarchy, so that, while the
overall relevant feature set may be large, each classifier only examines a
small subset. The use of reduced feature sets allows us to utilize more
complex (probabilistic) models, without encountering many of the standard
computational and robustness difficulties.
- Lin, Shian-Hua, Chi-Sheng Shih, Meng ChangChen, Jan-Ming Ho,
Ming-Tat Ko, and Yueh-Ming Huang. 1998. "Extracting Classification Knowledge
of Internet Documents with Mining Term Associations: a Semantic Approach",
Proceedings of the 21st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval: 241-249.
[abstract] In this paper, we present a
system that extracts and generalizes terms from Internet documents to
represent classification knowledge of a given class hierarchy. We propose a
measurement to evaluate the importance of a term with respect to a class in
the class hierarchy, and denote it as support. With a given threshold, terms
with high supports are sifted as keywords of a class, and terms with low
supports are filtered out. To further enhance the recall of this approach,
Mining Association Rules technique is applied to mine the association between
terms. An inference model is composed of these association relations and the
previously computed supports of the terms in the class. To increase the recall
rate of the keyword selection process. we then present a polynomial-time
inference algorithm to promote a term, strongly associated to a known keyword,
to a keyword. According to our experiment results on the collected Internet
documents from Yam search engine, we show that the proposed methods In the
paper contribute to refine the classification knowledge and increase the
recall of keyword selection.
- McCallum, Andrew,
Ronald Rosenfeld, Tom Mitchell, and Andrew Ng. 1998. "Improving Text
Classification by Shrinkage in a Hierarchy of Classes", ICML-98.
<http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz>.
- Sasaki,M., and Kita, K. 1998. "Rule-based text categorization
using hierarchical categories", 1998 IEEE International Conference on
Systems, Man, and Cybernetics, 2827 - 2830.
[abstract] Document categorization, which
is defined as the classification of text documents into one of several fixed
classes or categories, has become important with the explosive growth of the
World Wide Web. The goal of the work described here is to automatically
categorize Web documents in order to enable effective retrieval of Web
information. In this paper, based on the rule learning algorithm RIPPER (for
Repeated Incremental Pruning to Produce Error Reduction), we propose an
efficient method for hierarchical document categorization.
-
- Tanaka, Hideki. 1997. "An Efficient Document Clustering Algorithm
Based on the Topic Binder Hypothesis", Proceedings of the Natural
Language Processing Pacific Rim Symposium 1997 (NLPRS'97). <ftp://www.links.nectec.or.th/pub/NLPRS/paper/dana7r.ps.gz>.
[abstract] We present an efficient
document clustering algorithm that uses a term frequency vector for each
document instead of using a huge proximity matrix. The algorithm has the
following features:
1) it consumes relatively little memory space and runs
fast,
2) it produces a hierarchy in the form of a document classification
tree, and
3) the hierarchy obtained by the algorithm explicitly reveals the
collection structure.
We confirm these features and thus show the
algorithm's feasibility with clustering experiments in which we use two
collections of Japanese documents, the sizes of which are 83,099 and 14,701.
We also introduce two applications of this algorithm.
-
- Weigend,A.S.,
Wiener,E.D., and Pedersen, Jan O. 1999. "Exploiting Hierarchy in Text
Categorization", Information Retrieval, 1(3): 193-216. <http://www.stern.nyu.edu/~aweigend/Research/Papers/TextCategorization/hierarchy.ps>.
[abstract] With the recent dramatic
increase in electronic access to documents, text categorization--the task of
assigning topic(s) to a given document--has moved to the center of the
information sciences and knowledge management. Unfortunately, empirical
comparisons in the last few years have indicated that it is very hard to make
significant progress by focusing on the input space, e.g., by varying the
representation of the document. However, structure in the output space has
largely been neglected: Topics can be grouped together well according to their
meaning (e.g., gold, silver, copper are all metals). This paper presents a
hierarchical learning system that matches the hierarchical structure of the
topics. The key idea of this divide-and-conquer approch is to mimic the
semantic relations of knowledge, as opposed to trying to build one flat
inference model. We propose an architecture that is designed to accommodate
multiple topic assignments for each document. It has a clean probabilistic
interpretation that allows the topic prediction to be combined in a principled
way with information from other sources in decision support systems.
Evaluating the performance of a two-level implementation on the Reuters-22173
testbed of newswire articles shows significant improvement particularly for
rare classes. Since the first level of the architecture predicts the
probabilities of the meta-topic groups, the individual models for each topic
of the second level can focus on finer discriminations within the group. The
different parts of the hierarchical architecture can be trained independently;
this solves the otherwise very serious problem of scaling up to very large and
heterogeneous text collections such as the Web.
Text Categorization Using Knowledge Bases
- Buenaga, M., Gomez Hidalgo,
J., Diaz Agudo, B. 1997. "Using WordNet to Complement Training Information
in Text Categorization", 2nd International Conference on Recent Advances
in NaturalLanguage Processing (RANLP), Tzigov Chark, Bulgaria,
September, 11th to 13th, 1997. <http://xxx.unizar.es/ps/cmp-lg/9709007>.
[abstract] Automatic Text Categorization (TC) is a complex and
useful task for many natural language applications, and is usually performed
through the use of a set of manually classified documents, a training
collection. We suggest the utilization of additional resources like lexical
databases to increase the amount of information that TC systems make use of,
and thus, to improve their performance. Our approach integrates WordNet
information with two training approaches through the Vector Space Model. The
training approaches we test are the Rocchio (relevance feedback) and the
Widrow-Hoff (machine learning) algorithms. Results obtained from evaluation
show that the integration of WordNet clearly outperforms training approaches,
and that an integrated technique can effectively address the classification of
low frequency categories.
- Diaz, A., M. Buenaga, L. Urena, and M. Garcia-Vega. 1998.
"Integrating Linguistic Resources in an Uniform Way for Text Classification
Tasks", 1st International Conference on Language Resources and
Evaluation, Granada, Espania, May. 1998. <http://www.esi.uem.es/laboratorios/sinai/postscripts/lrec98.ps>.
[abstract] Applications based on automatic text classification
tasks, like text categorization (TC), word sense disambiguation (WSD), text
filtering or routing, monoligual or multilingual information retrieval, and
text summarization could obtain serios improvements by integrating linguistic
resources in the current methods. We present an approach using the Vector
Space Model to integrate two different kind of resources: a lexical database
and training collections, in text content analysis tasks. The training
approaches we test are the Rocchio (relevance feedback) and the Widrow-Hoff
(machine learning) algorithms and WordNet as the lexical database. We have
developed experimental systems for TC and WSD. Results obtained from
evaluation show that the integration of WordNet can outperform approaches
based only on training.
- Fukumoto, Fumiyo, and Yoshimi Suzuki. 1996. "An Automatic
Clustering of Articles Using Dictionary Definitions," Proceedings of the
16th International Conference on Computational Linguistics: 406-411.
- Gomez
Hidalgo, Jose Maria, and Manuel de Buenaga Rodriguez. 1997. "Integrating a
Lexical Database and a Training Collection for Text Categorization,"
Proceedings of the ACL/EACL Workshop on Automatic Extraction and
Building of Lexical Semantic Resources for Natural Language
Applications. Available in Postscript format from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9709004>.
[abstract] Automatic text categorization is a complex and useful
task for many natural language processing applications. Recent approaches to
text categorization focus more on algorithms than on resources involved in
this operation. In contrast to this trend, we present an approach based on the
integration of widely available resources as lexical databases and training
collections to overcome current limitations of the task. Our approach makes
use of WordNet synonymy information to increase evidence for bad trained
categories. When testing a direct categorization, a WordNet based one, a
training algorithm, and our integrated approach, the latter exhibits a better
perfomance than any of the others. Incidentally, WordNet based approach
perfomance is comparable with the training approach one.
- Liddy, Elizabeth D. 1994. "Text Categorization for Multiple Users
based on Semantic Features from a Machine-Readable Dictionary," ACM
Transactions on Information Systems 12(3): 278-295
[abstract] The text categorization module described here provides a
front-end filtering function for the larger DR-LINK text retrieval system
[Liddy and Myaeing 1993]. The model evaluates a large incoming stream of
documents to determine which documents are sufficiently similar to a profile
at the broad subject level to warrant more refined representation and
matching. To accomplish this task, each substantive word in a text is first
categorized using a feature set based on the semantic Subject Field Codes
(SFCs) assigned to individual word senses in a machine-readable dictionary.
When tested on 50 user profiles and 550 megabytes of documents, results
indicate that the feature set that is the basis of the text categorization
module and the algorithm that establishes the boundary of categories of
potentially relevant documents accomplish their tasks with a high level of
performance. This means that the category of potentially relevant documents
for most profiles would contain at least 80% of all documents later determined
to be relevant to the profile. The number of documents in this set would be
uniquely determined by the system's category-boundary predictor, and this set
is likely to contain less than 5% of the incoming stream of documents.
- Riloff, E. 1993.
"Automatically Constructing a Dictionary for Information Extraction Tasks,"
Proceedings of the Eleventh National Conference on Artificial
Intelligence: 811-816. Also available from <http://www.cs.utah.edu/~riloff/psfiles/aaai93.ps>.
- Riloff, E. 1995.
"Dictionary Requirements for Text Classification: A Comparison of Three
Domains," Working Notes of the AAAI Spring Symposium on Representation
and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and
Generativity: 123-128. Also available from <http://www.cs.utah.edu/~riloff/psfiles/aaai-spr-symp95.ps>.
- Riloff, E. 1996. "An
Empirical Study of Automated Dictionary Construction for Information
Extraction in Three Domains," AI Journal 85(August 1996): ?-?.
Also available from <http://www.cs.utah.edu/~riloff/psfiles/aij.ps>.
- Rodriguez, Manuel de Buenaga, Jose Maria Gomez Hidalgo, and Belen
Diaz Agudo. 1997. "Using WordNet to Complement Training Information in Text
Categorization," Proceedings of the Second International Conference on
Recent Advances in Natural Language Processing. Available in Postscript
format from Internet: <http://xxx.lanl.gov/abs/cmp-lg/9709007>.
[abstract] Automatic Text Categorization (TC) is a complex and
useful task for many natural language applications, and is usually performed
through the use of a set of manually classified documents, a training
collection. We suggest the utilization of additional resources like lexical
databases to increase the amount of information that TC systems make use of,
and thus, to improve their performance. Our approach integrates WordNet
information with two training approaches through the Vector Space Model. The
training approaches we test are the Rocchio (relevance feedback) and the
Widrow-Hoff (machine learning) algorithms. Results obtained from evaluation
show that the integration of WordNet clearly outperforms training approaches,
and that an integrated technique can effectively address the classification of
low frequency categories.
- Scott, S., and
Matwin, S. 1999. "Feature Engineering for Text Classification",
Proceedings of the Sixteenth Annual Conference
on Machine Learning (ICML-99). <http://wabakimi.carleton.ca/~sscott2/sam/ICML99_Camera.zip>.
[abstract] Most research in text classification to
date has used a "bag of words" representation in which each feature
corresponds to a single word. This paper examines some alternative ways to
represent text based on syntactic and semantic relationships between words
(phrases, synonyms and hypernyms). We describe the new representations and try
to justify our hypothesis that they could improve the performance of a
rule-based learner. The representations are evaluated using the RIPPER
learning algorithm on the Reuters-21578 and DigiTrad test corpora. On their
own the new representations are not found to produce significant performance
improvements. We also try combining classifiers based on different
representations using a majority voting technique, and this improves
performance on both test collections. In our opinion, more sophisticated
Natural Language Processing techniques need to be developed before better text
representations can be produced for classification.
- Scott, Sam, and
Stan Matwin. 1998. "Text Classification Using WordNet Hypernyms," Proceedings
of the COLING-ACL 98 Workshop on the Usage of WordNet in Natural Language
Processing System. <http://www.ai.sri.com/~harabagi/coling-acl98/acl_work/sscot.ps.gz>.
[abstract] This paper describes experiments in Machine Learning for
text classification using a new representation of text based on WordNet
hypernyms. Six binary classification tasks of varying difficulty are defined,
and the Ripper system is used to produce discrimination rules for each task
using the new hypernym density representation. Rules are also produced with
the commonly used bag-of-words representation, incorporating no knowledge from
WordNet. Experiments show that for some of the more difficult tasks the
hypernym density representation leads to significantly more accurate and more
comprehensible rules.
- Yamazaki, Takefumi, and Ido Dagan. 1997. "Mistake-driven Learning
with Thesaurus for Text Categorization", Proceedings of the Natural
Language Processing Pacific Rim Symposium 1997 (NLPRS'97). <ftp://www.links.nectec.or.th/pub/NLPRS/paper/dana4r.ps.gz>.
[abstract] This paper extends the
mistake-driven learner WINNOW to better utilize thesauri for text
categorization.
In our method not only words but also semantic categories
given by the thesaurus are used as features in a classifier. New filtering and
disambiguation methods are used as pre-processing to solve the problems caused
by the use of the thesaurus. In order to verify our methods, we test a large
body of tagged Japanese newspaper articles. Experimental results show that
WINNOW with thesauri attains high accuracy and that the proposed filtering and
disambiguation methods also contribute to the improved accuracy.
Evaluation of Text Categorization
- Lewis, David
Dolan. 1991. Representation and Learning in Information
Retrieval. Technical report CSD-94-809, UM-CS-1991-093, Computer
Science Department, University of Massachusetts, Amherst. Available in
Postscript format from Internet: <ftp://ftp.cs.umass.edu/pub/techrept/techreport/1991/UM-CS-1991-093.ps>.
[abstract] This dissertation introduces a new
theoretical model for text classification systems, including systems for
document retrieval, automated indexing, electronic mail filtering, and similar
tasks. The Concept Learning model emphasizes the role manual and automated
feature selection and classifier formation in text classification. It enables
drawing on results from statistics and machine learning in explaining the
effectiveness of alternate representations of text, and specifies desirable
characteristics of text representations. The use of syntactic parsing to
produce indexing phrases has been widely investigated as a possible route to
better text representations. Experiments with syntactic phrase indexing,
however, have never yielded significant improvements in text retrieval
performance. The Concept Learning model suggests that the poor statistical
characteristics of a syntactic indexing phrase representation negate its
dsirable semantic characteristics. The application of term clustering to this
representation to improve its statistical properties while retaining its
desirable meaning properties is proposed. Standard term clustering strategies
from information retrieval (IR), based on cooccurence of indexing terms in
documents or groups of documents, were tested on a syntactic indexing phrase
representation. In experiments using a standard text retrieval test
collection, small effectiveness improvements were obtained. As a means of
evaluating representation quality, a text retrieval test collection introduces
a number of confounding factors. In contrast, the text categorization task
allows much cleaner determination of text representation properties. In
preparation for the use of text categorization to study text representation, a
more effective and theoretically well-founded probablistic text categorization
algorithm was developed, building on work by Maron, Fuhr, and others. Text
categorization experiments supported a number of predictions of the Concept
Learning model about properties of phrasal representations, including
dimensionality properties not previously measured for text representations.
However, in carefully controlled experiments using syntactic phrases produced
by Church's stochastic bracketer, in conjunction with reciprocal nearest
neighbor clustering, term clustering was found to produce essentially no
improvement in the properties of the phrasal representation. New cluster
analysis approaches are proposed to remedy the problems found in traditional
term clustering methods.
- Lewis, David
Dolan. 1990. "Representation Quality in Text Classification: An
Introduction and Experiment," Proceedings of the Third Annual Symposium
on Document Analysis and Information Retrieval: 81-93. <http://www.research.att.com/~lewis/papers/lewis90g.ps>.
[abstract] The way in which text is represented
has a strong impact on the performance of text classification (retrieval and
categorization) systems. We discuss the operation of text classification
systems, introduce a theoretical model of how text representation impacts
their performance, and describe how the performance of text classification
systems is evaluated. We then present the results of an experiment on
improving text representation quality, as well as an analysis of the results
and the directions they suggest for future research.
- Lewis, David
Dolan. 1991. "Evaluating Text Categorization," Proceedings of Speech
and Natural Language Workshop: 312-318. Morgan Kaufmann. <http://www.research.att.com/~lewis/papers/lewis91c.ps>.
[abstract] While certain standard procedures are
widely used for evaluating text retrieval systems and algorithms, the same is
not true for text categorization. Omission of important data from reports is
common and methods of measuring effectiveness vary widely. This has made
judging the relative merits of techniques for text categorization difficult
and has disguised important research issues. In this paper I discuss a variety
of ways of evaluating the effectiveness of text categorization systems,
drawing both on reported categorization experiments and on methods used in
evaluating query-driven retrieval. I also consider the extent to which the
same evaluation methods may be used with systems for text extraction, a more
complex task. In evaluating either kind of system, the purpose for which the
output is to be used is crucial in choosing appropriate evaluation methods.
- Lewis, David
Dolan. 1997. "Evaluating and Optimizing Autonomous Text Classification
Systems," Proceedings of the 18th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval:
246-254. <http://www.research.att.com/~lewis/papers/lewis95b.ps>.
- Yang, Yiming. 1997.
An Evaluation of Statistical Approaches to Text Categorization. Computer
science technical report CMU-CS-97-127, School of Computer Science, Carnegie
Mellon University. <http://reports-archive.adm.cs.cmu.edu/anon/1997/abstracts/97-127.html>.
[abstract] This paper is a comparative study of
text categorization methods. Fourteen methods are investigated, based on
previously published results and newly obtained results from additional
experiments. Corpus biases in commonly used document collections are examined
using the performance of three classifiers. Problems in previously published
experiments are analyzed, and the results of flawed experiments are excluded
from the cross-method evaluation. As a result, eleven out of the fourteen
methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the
performance baseline on several collections; on each collection, the
performance scores of other methods were normalized using the score of kNN.
This provides a common basis for a global observation on methods whose results
are only available on individual collections. Widrow-Hoff, k-nearest neighbor,
neural networks and the Linear Least Squares Fit mapping are the
top-performing classifiers, while the Rocchio approaches had relatively poor
results compared to the other learning methods. KNN is the only learning
method that has scaled to the full domain of MEDLINE categories, showing a
graceful behavior when the target space grows from the level of one hundred
categories to a level of tens of thousands.
-
- Yang, Yiming. 1999.
"An Evaluation of Statistical Approaches to Text Categorization,"
Information Retrieval Journal, 1999 (to appear). <http://www.cs.cmu.edu/~yiming/papers.yy/irj99.ps>.
[abstract] This paper focuses on a comparative
evaluation of a wide-range of text categorization methods, including
previously published results on the Reuters corpus and new results of
additional experiments. A controlled study using three classifiers, kNN, LLSF
and WORD, was conducted to examine the impact of configuration variations in
five versions of Reuters on the observed performance of classifiers. Analysis
and empirical evidence suggest that the evaluation results on some versions of
Reuters were significantly affected by the inclusion of a large portion of
unlabelled documents, mading those results difficult to interpret and leading
to considerable confusions in the literature. Using the results evaluated on
the other versions of Reuters which exclude the unlabelled documents, the
performance of twelve methods are compared directly or indirectly. For
indirect compararions, kNN, LLSF and WORD were used as baselines, since they
were evaluated on all versions of Reuters that exclude the unlabelled
documents. As a global observation, kNN, LLSF and a neural network method had
the best performance; except for a Naive Bayes approach, the other learning
algorithms also performed relatively well.
-
- Yang, Yiming, and Xin
Liu. 1999. A re-examination of text categorization methods. Proceedings
on the 22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval: 42-49. <http://www.cs.cmu.edu/~yiming/papers.yy/sigir99.ps>.
[abstract] This paper reports a controlled study
with statistical significance tests on five text categorization methods: the
Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural
network (NNet) approach, the Linear Least-squares Fit (LLSF) mapping and a
Naive Bayes (NB) classifier. We focus on the robustness of these methods in
dealing with a skewed category distribution, and their performance as function
of the training-set category frequency. Our results show that SVM, kNN and
LLSF signi cantly outperform NNet and NB when the number of positive training
instances per category are small (less than ten), and that all the methods
perform comparably when the categories are sufficiently common (over 300
instances).
-
- Yang, Yiming, and
J.P. Pedersen. 1997. "A Comparative Study on Feature Selection in Text
Categorization," Proceedings of the Fourteenth International Conference
on Machine Learning (ICML'97), 1997.
[abstract] This paper is a comparative study of
feature selection methods in statistical learning of text categorization. The
focus is on aggressive dimensionality reduction. Five methods were evaluated,
including term selection based on document frequency (DF), information gain
(IG), mutual information (MI), a chi-square test (CHI), and term strength
(TS). We found IG and CHI most effective in our experiments. Using IG
thresholding with a k-nearest neighbor classifier on the Reuters corpus,
removal of up to 98% removal of unique terms actually yielded an improved
classification accuracy (measured by average precision). DF thresholding
performed similarly. Indeed we found strong correlations between the DF, IG
and CHI values of a term. This suggests that DF thresholding, the simplest
method with the lowest cost in computation, can be reliably used instead of IG
or CHI when the computation of these measures are too expensive. TS compares
favorably with the other methods with up to 50% vocabulary reduction but is
not competitive at higher vocabulary reduction levels. In contrast, MI had
relatively poor performance due to its bias towards favoring rare terms, and
its sensitivity to probability estimation errors.
Term Specification
- Wong, S.K.M., and Y.Y. Yao. 1992. "An Information-Theoretic
Measure of Term Specificity," Journal of the American Society for
Information Science 43(1): 54-61.
[abstract] The inverse
document frequency (IDF) and signal-noise ratio (S/N) approaches are two well
known term weighting schemes based on term specificity. However, the existing
justifications for these methods are still somewhat inconclusive and sometimes
even based on incompatible assumptions. Although both methods are related to
term specificity, their relationship has not been thoroughly investigated. An
information-theoretic measure for term specificity is introduced in this
stidy. It is explicitly shown that the IDF weighting scheme can be derived
from the proposed approach by assuming that the frequency of occurrence of
each index term is uniform within the set of documents containing the term.
The information-theoretic interpretation of term specificity also establishes
the relationship between the IDF and S/N methods.
Term Discrimination Value
- El-Hamdouchi, A., and P. Willett. 1988. "An Improved Algorithm for
the Calculation of Exact Term Discrimination Values," Information
Processing & Management 24(1): 17-22.
[abstract] The
term discrimination model provides a means of evaluating indexing terms in
automatic document retrieval systems. This article describes an efficient
algorithm for the calculation of term discrimination values that may be used
when the interdocument similarity measure used is the cosine coefficient and
when the document representatives have been weighted using one particular
term-weighting scheme. The algorithm has an expected running time proportional
to Nn^2 for a collection of N documents, each of which has been assigned an
average of n terms.