AUTOMATIC CATEGORIZATION OF MAGAZINE ARTICLES

MARIE-FRANCINE MOENS and JOS DUMORTIER

Katholieke Universiteit Leuven, Belgium

Interdisciplinary Centre for Law & IT (ICRI)

Tiensestraat 41

B-3000 Leuven

Belgium

Tel: xx/32/16/325383

Fax: xx/32/16/325438

e-mail: {marie-france.moens,jos.dumortier}@law.kuleuven.ac.be

Abstract

Automatic text categorization is an important research area and has a potential for many text-based applications including text routing and filtering. Typical text classifiers learn from example texts that are manually categorized. In this paper we discuss the categorization of magazine articles with broad subject descriptors. We especially focus upon the following aspects of text classification: effective selection of feature words and proper names that reflect the main topics of the text, and training of text classifiers. The c² test, which is sometimes used for selecting terms that are highly related to a text class, is applied in a novel way when constructing a category weight vector. Despite a limited number of training examples, combining an effective feature selection with the c² learning algorithm for training the text classifier results in a satisfactory categorization of new magazine articles.

1. Introduction

An important Belgian publisher provides his magazine articles on the Internet for on-line purchase. Controlled subject descriptors are assigned to the articles. The descriptors are mainly used to route the articles to magazine subscribers who are interested in electronic articles that treat specific topics. A fast routing of articles immediately after their publication is important, hence the interest in automating the descriptor assignment or text categorization.

Automatic text categorization is an important research area and has a potential for many text-based applications including text routing and filtering. The purpose of our research is to develop adequate techniques for classifying a variety of articles of different magazines and columns and to test the techniques upon a large corpus of articles. The descriptors regard the broad subjects of the stories (e.g., music, film, investments). Our research regards experiments with different text categorization algorithms. The algorithms learn the classification patterns from example texts that are manually classified. One algorithm uses the c² test for pattern recognition with satisfying results. We put a strong focus upon effective selection of content terms and proper name phrases. A detailed overview of the research is given in Moens and Dumortier (in press).

2. Text corpus and output of the demonstrator

The articles of the text corpus were published in 1998 in magazines such as “Knack”, “Weekend Knack”, “Trends”, and “Cash!”. They are written in Dutch and are very heterogeneous in content and structure. The articles cover a variety of subjects in domains, such as politics, economy, finance, life style, arts, sports, and many others, and often interweave different subject domains. The articles belong to different columns of the magazines. This is reflected in their structure. Most of them follow the schema of the written news story consisting of a headline, lead paragraph, attribution section, and body of the story. Other schemata are also present, such as a list of film titles with explanatory sentences, or a simple address of a restaurant. A few articles are so-called satellite articles: They are small texts that elaborate on a subtopic of another large article. The article texts are of varying length ranging from a few paragraphs to multiple pages (most articles fall in the range of 6000 to 25000 bytes, some very short ones of less than 1000 bytes exceptionally occur).

The articles have descriptors attached that were assigned by the professional indexers of the publisher. Usually, one, sometimes two, and in very rare cases three subject descriptors are ascribed per article. These broad subject descriptors are used in matching article profiles with users' profiles in a routing task.

We used articles of the text classes CAR, INVESTMENTS, STOCK MARKET, CULINARY, FILM, COMPUTER SCIENCE, INTERNATIONAL, LITERATURE, MARKETING, MUSIC, POLITICS, SPORTS, TOURISM, and REAL ESTATE. The publisher has defined other classes, but the number of their members in the text corpus is too small to consider them in the experiments. For each class, a sample of articles was manually analyzed, yielding the following characteristics of the texts of the classes.

1. The texts in the class CAR often describe new car and motor models and give their technical characteristics. They often exhibit a technical vocabulary. Sometimes, a text has the form of an index card that contains the technical details of the car.
2. The texts of the class INVESTMENTS bear upon different forms of investments (e.g., bonds, stocks, art, and real estate). They often overlap with texts in the classes STOCK MARKET and REAL ESTATE.
3. The texts of the class STOCK MARKET describe stock exchanges. They sometimes describe products of companies that offer stocks, which might result in a rich vocabulary.
4. CULINARY describes culinary books, recipes, wines, restaurants, and cafés. The texts often exhibit a rich vocabulary due to descriptions of historical settings and locations of the places or to the variety of the ingredients of recipes.
5. The texts of the class FILM describe new films. Part of this description is rather technical, but another part of it gives a summary of the story of the film. The stories enrich the vocabulary of this class. Sometimes, an article contains a list of film titles and a short description of each film.
6. Texts of the class COMPUTER SCIENCE are technical in nature. They contain descriptions of companies and products.
7. The texts of the class INTERNATIONAL cover events outside Belgium. The main linguistic expressions that cue this class are often the names of foreign countries and important foreign personalities. The vocabulary in this text class is very rich. There is an overlap with the class POLITICS.
8. The texts of the class LITERATURE are commonly reviews of newly published books. As in the class FILM, the content is shortly described. But, there are some specific, technical cues such as references to the type of work, ISBN number, number of pages, and publisher.
9. The class MARKETING mostly contains detailed descriptions of products that are marketed. The products can be almost anything (e.g., animal food, insurance, and clothing). Sometimes, an article discusses multiple products. The vocabulary in these texts is very heterogeneous.
10. The class MUSIC contains texts about classical and modern music. The texts contain technical details and references to known composers.
11. The texts of the class POLITICS contain political events. They are rich in names of political personalities, parties, and organizations.
12. The texts of the class SPORTS describe sport events. Often, the technical vocabulary of a specific sport is present.
13. The class TOURISM contains travel stories and promotions of foreign places, cultures, and hotels. The texts are usually long with a rich vocabulary. However, an article in the form of an information scheme is possible.
14. The texts of the class REAL ESTATE describe the location, area, rent, price, and other characteristics of the real properties.

A corpus of more than 2650 articles was split: two thirds of the articles were used for training and one third for testing. A classifier was learned for the 14 classes. The training corpus contains the following distribution of classes. There are about 300 examples of the class MARKETING and about 200 examples of the class CULINARY. There are about 150 examples of the classes INVESTMENTS, STOCK MARKET, TOURISM, and SPORTS, about 100 examples of the classes FILM, CAR, COMPUTER SCIENCE, MUSIC, and LITERATURE, and about 50 examples of the classes INTERNATIONAL, POLITICS, and REAL ESTATE. In the test corpus, the classes are present with about equal proportions.

A demonstrator was built in the programming language C on a Sun™ SPARC station 5 under Solaris® 2.5.1. It learns a text classifier from the training set and automatically assigns descriptors to new article texts with the learned classifier (Figure 1). It was agreed with the publisher that the system must simulate the manual process of assigning one or two descriptors to the articles, which reflect the main topics of the article.

Figure 1: Main components of the demonstrator.

3. Methods

Humans reliably identify relevant texts for a certain subject or classification domain by skimming the texts for specific word patterns and their contexts. Knowledge bases that describe the word patterns and their relation with the text classes have been successfully applied in automatic text categorization (e.g., Hayes, 1992; Moens & Uyttendaele, 1997). The text is skimmed for cue patterns defined in a rule or frame base, possibly followed by an evaluation of the logical constraints or of a minimum frequency of occurrence in the text imposed on the patterns. This indicates that surface text features can be identified that successfully discriminate the subject and classification codes linked to a text. Direct representation of this knowledge is a time and effort consuming task, which is only justified when the knowledge is restricted. In other circumstances, machine learning methods provide an interesting alternative. Techniques of supervised learning are common in text categorization (for an overview of automatic text categorization: see Moens, in press, p. 101 ff.). In general, they involve the construction of a classification function from a large set of example texts for which the true classes are known. The function agrees with the training instances, i.e., for a given class it classifies the positive examples as members of the class and discards the negative examples, and is hopefully predictable to classify new, previously unseen texts.

Because of the large and heterogeneous subject domain, we use machine learning techniques to acquire the textual patterns that imply the text classes from an example or training set. An example text is represented as a set of features (words and phrases). A new text is equally represented as a set of features. The methods for classifying the magazine articles comprise an initial feature selection to identify important content terms in the texts, learning algorithms, and assignment of subject descriptors.

Words and proper names are the salient features involved in classifying magazine articles. The number of different features in a corpus of magazine articles is enormous. Because the text classes regard the main topics of the texts, it is important to identify content terms that relate to the main topics and to discard terms that do not bear upon content or treat only marginal topics in training and test corpus. Proper names are identified with heuristic rules that take into account patterns of capitalized words and reoccurrence of the names in the texts. Other content words are selected after elimination of stopwords. A stoplist of 879 non-content words is built based upon their syntactic classes. The stoplist contains function words such as articles, prepositions, auxiliary verbs, and others. Numbers are not accounted for. Currently, no form of stemming is used, except for the use of conjugated forms of auxiliary verbs in the stoplist. After removal of stopwords and numbers, we consider two different approaches for selecting important topic terms. In a first approach, words and proper names with a high weight are selected. Terms are weighted by their frequency of occurrence in the text divided by the maximum frequency a content term occurs in the text (length normalization factor). In a second approach, words and proper names are selected from the beginning of the article, which usually includes the discourse segments of the headline, lead, and the attribution of the article.

For recognizing the classification patterns and assignment of subject descriptors, we implemented several statistical algorithms.

In Bayesian independence classification (Maron, 1961; Fuhr, 1989; Lewis, 1995) the posterior probability that a new, previously unseen text belongs to a certain text class given its features (here words and proper names) is computed based on the probabilities that the individual features are related to the class. Probability estimates of the individual features are based on the co-occurrence of the text class and selected features in the training corpus, and on the assumption of their linkage. The computation of the probability that a document text belongs to a specific class given the probability of its features is simplified by using the theorem of Bayes, which assumes that the probabilities of the features are independent.

The Rocchio (Rocchio, 1971; Lewis, Schapire, Callan, & Papka, 1996) and the c² algorithms generalize the positive and negative examples of each class into a category weight vector. The components of this vector are the text features (words and proper names) of the example texts. The weights indicate the strength of their relationship with the subject class.

For a new article to be classified by the Bayesian classifier, the probability of class membership is computed for each subject descriptor. The most probable descriptor is assigned. When a new article is classified with the Rocchio or c² classifier, a scoring function computes the similarity between the feature vector of the new text to be classified and the weight vector of each class or category. We use the inner product of the vectors for computing this similarity (Jones & Furnas, 1987). The subject descriptor of the category weight vector with highest similarity is assigned to the new article. In a variant implementation, a second descriptor is assigned when the probability with the second best class or the similarity with the vector of this second best class is less than 10% lower than the probability or similarity of the best class.

The Bayesian and Rocchio classifiers are classical tools for pattern recognition in texts and are useful for comparisons. The Rocchio differs from the c² classifier in the computation of the category weight vector.

The Rocchio algorithm developed for relevance feedback in information retrieval learns a better weight for each term of the query based upon the average weight of the term in the set of relevant and non-relevant texts. In text categorization the algorithm is used in a like manner. The weight of a feature (word or proper name) in a category weight vector is computed as the weighted difference of the mean weight of the feature in positive and negative training examples of the text class.

The c² test computes how closely an observed probability distribution corresponds to an expected probability distribution. In our task, the observed probability distribution is formed by the observed frequencies of the number of texts relevant or non-relevant for the text class that contain the text feature (word or proper name) or not contain the feature. A useful expected probability distribution is that all expected frequencies of the presence of the feature (or of the absence of the feature) will be equal in texts relevant for the class and texts non-relevant for the class. The hypothesis is tested whether the observed and the expected frequencies are close enough to conclude that they come from the same probability distribution (goodness-of-fit test). When the c² variable of a term feature is low, the fit of the observed and expected frequencies is good and hence the feature has no influence upon the text class. When the value is high, there is an association between the feature and the class. In text categorization this association is used to select features that are highly related to the text class (Schütze, Hull, & Pedersen, 1995).

We use the c² variable in a different way. The relationship of a feature with a class is computed by applying the c² test. Instead of selecting features with a high c² value, the raw c² values are used in the category weight vector. The contingency table of relevant and non-relevant texts containing or not containing the text feature has 1 degree of freedom. Using the raw c² values in similarity computation with the feature vector of a new text implicates that a term of the new text (word or proper name) that is related to the text class with a probability of 68% or more based on the training corpus has a positive effect upon class assignment (c² value of 1 or more in the category weight vector used in the inner product). High c² values of 9 or more indicate a probability of close to 100% that the term is related to the text class.

4. Results

We conducted a number of experiments aiming at comparing the initial feature selection methods and at comparing the different algorithms for text categorization.

The methods were tested upon a set of more than 930 new, previously unseen magazine articles. The effectiveness of automatic assignment of subject descriptors to the articles is obtained by comparing the results with descriptor assignment to these texts by the indexers of the publisher and by computing recall, precision, and F-measure values, which are common metrics for evaluating text categorization (Lewis, 1995). Text categorization is seen as a binary decision. An article belongs or does not belong to a specific class or category. Recall is the proportion of class members that the system assigns to the class. Precision is the proportion of members assigned to the class that really are class members. The F-measure combines recall and precision in one single measure. Recall, precision, and F-measure are ideally close to one.

A magazine article often contains many marginal topics that are of no interest when learning the class concepts or when assigning subject descriptors to new texts. Term weighting by considering the term frequency divided by the maximum frequency that a term occurs in the text and selecting terms with a high weight proved to be effective. The approach is useful to detect important topic terms and to undo the effect of long texts that are the result of verbosity. This form of term weighting proved also to be practical for identifying important proper names in the articles. Categorization results are usually better when feature selection is based on the term frequency with length normalization with an elimination of low term weights than when features are selected only from the begin section of the article. When content terms are selected from topically important segments of the article such as the heading, lead and attribution part, the set of terms still contains a set of noise terms. Discourse structure can be useful when selecting content terms. But, the segments, in which important content terms are located, may differ from one text class to another or from one text type to another. For instance, selecting content terms from the begin segments was advantageous for identifying features of the class MARKETING. In this class, the noisy terms of product descriptions usually appear further in the texts.

The c² classifier scored much better than the other training methods and especially better than the Rocchio algorithm under the same circumstances. The classifier resulted in an average recall of 73%, average precision of 64%, and an average F-measure of 66%, when one or two descriptors were assigned (Table 1). When one descriptor was assigned, the classifier produced an average recall of 69%, average precision of 68% and an average F-measure of 66%. The percentages are the result of a initial feature selection by considering terms with a high frequency after length normalization. The average F-measure was 6% and 12% higher than the resulting F-measure when applying the Bayesian independence classifier and the Rocchio classifier respectively under the same circumstances (Table 2). The c² classifier is less sensitive to noise terms. In the category weight vectors, noise terms have a very low weight compared to good cue terms. The c² test that measures the fit between the observed and the expected frequencies of the content terms in the training corpus is effective for identifying terms that are related to the class. We use the raw c² values in the category weight vector. The benefit of this approach is explained and proved by our experiments. The c² values strongly distinguish terms that are highly related to a class from the ones that are related to a lesser degree. Good results are obtained despite a limited number of positive training examples. A limited number of positive examples is common in routing tasks. At any time new topics may be introduced in the document stream.

Table 1. Results of categorization with the c² algorithm. Features are selected by considering the term frequency with length normalization and elimination of terms with low weights. 1 or 2 descriptors are assigned.

Category Recall Precision F-measure

CAR 0.827586 0.558140 0.666667

INVESTMENTS 0.789773 0.785311 0.787535

STOCK MARKET 0.481203 0.646465 0.551724

CULINARY 0.805970 0.613636 0.696774

FILM 1. 000000 0.576471 0.731343

COMPUTER SCIENCE 0.673913 0.620000 0.645833

INTERNATIONAL 0.444444 0.689655 0.540541

LITERATURE 0.909091 0.526316 0.666667

MARKETING 0.457895 0.906250 0.608392

MUSIC 0.93750 0.759494 0.839161

POLITICS 0.607143 0.377778 0.465753

SPORTS 0.888889 0.623377 0.732824

TOURISM 0.743590 0.707317 0.725000

REAL ESTATE 0.653846 0.500000 0.566667

Average 0.730060 0.635015 0.658920

Tabel 2. Average values of recall, precision, and F-measure when assigning 1 or 2 descriptors to the articles. Features are selected by considering the term frequency with length normalization and elimination of terms with low weights.

Recall Precision F-measure

Bayesian independence 62% 62% 60%

Rocchio 63% 57% 54%

c² 73% 64% 66%

In appendix we include four examples of articles that were categorized with the c² classifier (see below). Examples 1, 2, and 3 were correctly categorized by the system respectively as LITERATURE, REAL ESTATE, and FILM. The system correctly assigned the descriptor INTERNATIONAL in example 4, but ignored the second descriptor POLITICS. The texts of the examples are the original texts written in Dutch.

The Belgian publisher has recently implemented the techniques of selection of content terms and of training a c² classifier as part of his document management system. Initial results with a partly different training and test corpus and partly different text classes (e.g., economy, taxes, etc.) show very good average F-measures and confirm the usefulness of our approach.

5. Conclusions

We can conclude the following. Successful systems that classify texts and assign subject or classification codes to texts rely upon the words and phrase patterns that signal the text class. In many text categorization situations their number is large to acquire manually. In this case, the classifier is trained upon example texts. Because the subject descriptors regard the broad text topics, an initial feature selection that identifies important content terms and proper names based upon the term frequency that is normalized by the maximum number a content term occurs in the text is effective. Adding knowledge of the discourse structure in the term selection process is useful for certain text classes. Given the limited number of positive examples and the high number of text features in the articles that belong to a variety of magazines, columns, and subject domains, the results of training a text classifier with the c² algorithm are very satisfying.

References

Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Information Processing & Management, 25 (1), 55-72.

Hayes, P.J. (1992). Intelligent high-volume text processing using shallow, domain-specific techniques. In P.S. Jacobs (Ed.), Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval (pp. 227-241). Hillsdale: Lawrence Erlbaum.

Jones, W.P., & Furnas, G.W. (1987). Pictures of relevance: a geometric analysis of similarity measures. Journal of the American Society for Information Science, 38 (6), 420-442.

Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 246-254). New York: ACM.

Lewis, D.D., Schapire, R.E., Callan, J.P., & Papka, R. (1996). Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schäuble, & R. Wilkinson (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 298-306). New York: ACM.

Maron, M. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM, 8, 404-417.

Moens, M.-F. (in press). Automatic Indexing and Abstracting of Document Texts (270 pp.). Boston: Kluwer Academic Publishers.

Moens, M.-F., & Dumortier, J. (in press). Automatic categorization of magazine articles. Information Processing & Management.

Moens, M.-F., & Uyttendaele, C. (1997). Automatic structuring and categorization as a first step in summarizing legal cases. Information Processing & Management, 33(6), 727-737.

Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: Experiments in automatic document processing (pp. 313-323). Englewood Cliffs, NJ: Prentice Hall.

Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 229-237). New York: ACM.

APPENDIX

Example 1

HET GROTE PLAN

Als conservatieve factor kan het Ierse katholicisme tellen. Aanvaard je lot, is de algemene teneur, en wanneer je er onderdoor dreigt te gaan, kun je altijd troost vinden in de gedachte dat alle lijden in Gods grote plan een beter doel dient. Ook al ben je dan te kortzichtig om dat te snappen. In Niall Williams' debuutroman "Vier liefdesbrieven" spelen dit befaamde lot en de eruit resulterende schuld de hoofdrollen.

Nicholas Coughlan is twaalf wanneer zijn vader William van God de opdracht krijgt om zijn ambtenarenjob te laten voor wat hij is en kunstschilder te worden. Dat hij zijn gezin daarmee de armoede en zijn vrouw de dood injaagt, is voor William slechts een irrelevante opmerking. Wanneer Nicholas volwassen is, vecht hij een innerlijke strijd uit: zijn vader kan niet gelogen hebben over zijn roeping en er bestaat dus wel degelijk een groot goddelijk plan waarvan ook hij deel uitmaakt. Maar waarom blijft de openbaring zo lang uit?

In een tweede verhaallijn voert Williams ons naar een eiland buiten de Ierse westkust. Daar, in een bijna sprookjesachtig Keltisch aura van dichters en feeën, leven de kinderen Isabel en Sean Gore. Wanneer Sean tijdens het spelen verongelukt, neemt zus Isabel de morele schuld op haar schouders, waarna ze zichzelf als boetedoening opoffert in een hels huwelijk. Ook zij is dus getekend. De twee verhalen komen samen wanneer de zoektocht naar het laatste schilderij van zijn vader, Nicholas op het eiland brengt en het goddelijke plan, anders dan hij verwachtte, hem duidelijk wordt.

Williams heeft zijn boek in een licht magisch-realistische stijl geschreven, wat vooral in de eilandpassages goed aansluit bij de Ierse setting. Verwacht dus geen Roddy Doyle of John Banville, maar wel een hedendaagse prozaversie van William Butler Yeats.

Niall Williams, "Vier liefdesbrieven", Contact, Amsterdam, 336 blz., 995 fr.

Marnix Verplancke

Example 2

Lintbewinkeling

Hoe zijn baanwinkels in ons land gegroeid, wie zijn de eigenaars en welke producten worden er aangeboden? Een studie van makelaar Healey & Baker brengt de Belgische baanwinkelsituatie in kaart.

Dat het winkelaanbod in België op zijn zachtst gezegd 'versnipperd' is, mag voor een groot deel worden toegeschreven aan de baanwinkels, winkels die door hun omvang en activiteit de periferie van de steden opzoeken. Het bekendste voorbeeld is ongetwijfeld de 'Boomsesteenweg', op het grondgebied van Aartselaar bij Antwerpen.

Een studie van makelaar Healey & Baker schetst de geschiedenis van de baanwinkels in België en trekt enkele verrassende conclusies uit de samenstelling en de oppervlakte van het baanwinkelgebeuren.

De wildgroei in het winkelaanbod in perifere zones leidde in 1975 tot de wet op de handelsvestigingen. Het land werd ingedeeld in zones. Buiten de centra mochten nog winkels worden gebouwd zonder socio-economische vergunning met een netto-oppervlakte van 1500 vierkante meter (bruto 3000 vierkante meter), binnen de kernen van 750 vierkante meter netto (1000 vierkante meter bruto). Voor grotere verkoopoppervlakten was een dergelijke vergunning wel verplicht. Maar de tijdrovende procedures om ze te verkrijgen (in totaal mag op een wachttijd worden gerekend van zes tot acht maanden) en de subjectieve criteria die worden gehanteerd, zetten winkelbouwers aan om achterpoortjes te zoeken. De oplossing bleek panden te bouwen met een netto-oppervlakte van iets minder dan de limiet (1499 vierkante meter buiten en 749 vierkante vierkante meter binnen de kernen).

De reactie van de overheid liet lang op zich wachten, maar kwam er toch in 1994 met een nieuwe wet op de handelsvestigingen. Die beperkte de netto-oppervlakte van winkels zonder socio-economische vergunning tot 1000 vierkante meter netto buiten en 400 vierkante meter netto binnen bewoningskernen.

De studie van Healey & Baker stelt de vraag hoezeer het baanwinkelaanbod versnipperd is, welke eigenaars dit marktsegment beheersen en welke producten en diensten er vooral worden aangeboden. Van 'baanwinkelketens' spreekt de studie zodra een onderneming meer dan vijf vestigingen heeft in ons land (met uitzondering van Ikea met vier vestigingen, maar een belangrijke verkoopoppervlakte) en twee derde van de vestigingen zich in de periferie bevinden.

KLEDING EN SCHOENEN

101 ketens voldeden aan deze voorwaarde (voedingswinkels kwamen niet in aanmerking), goed voor 1975 vestigingen en 2.100.000 vierkante meter brutoverkoopoppervlakte. Een groot gedeelte ervan (39 van de 101 ketens om precies te zijn) heeft minder dan tien vestigingen. Slechts drie hebben er meer dan honderd. Van versnippering gesproken. Uitgedrukt in vierkante meter verkoopoppervlakte beschikken twee ketens over meer dan 100.000 vierkante meter (Brico en Euroshoe). De helft bezet minder dan 10.000 vierkante meter.

Opvallend is het overzicht van de winkelactiviteiten van de baanwinkels. Meer dan 600 (27% van de oppervlakte) van de 1975 vestigingen bieden schoenen en kleding aan. Nochtans zijn dat precies de activiteiten die men via de vestigingswet in de stadskernen wilde houden. De wet van 1975 bereikte met andere woorden niet het gewenste doel.

Van de 101 ketens zijn 63 Belgisch van nationaliteit (goed voor 1370 van de 1975 vestigingen). Verder tellen we 14 Franse (145 winkels) en 19 Nederlandse (411 winkels) ketens (en vijf 'andere' waaronder Ikea en Toys 'R' Us).

In tegenstelling tot de situatie in het buitenland - waar baanwinkels gegroepeerd liggen in daartoe bestemde zones - gaat het in België om veel kleine parken. Meer dan de helft van de baanwinkellocaties telt minder dan tien winkels. Slechts in vier gevallen groepeerden meer dan dertig winkels zich op eenzelfde plaats.

Het rendement van baanwinkels daalt intussen fors. Voor enkele jaren moest een dergelijke zaak nog een rendement bieden dat 3,5 tot 4% boven dat van een A1-winkel (in een winkelstraat) lag. Nu vragen investeerders nog slechts een premie (het verschil in rendement) van goed 1%. 62% van de baanwinkels behoort toe aan privé-investeerders, 14% is in handen van de ketens zelf, 18% is eigendom van ontwikkelaars en 6% van institutionelen (maar het aandeel van deze laatste stijgt snel). De gemiddelde huurprijzen voor baanwinkels stegen volgens de studie van Healey & Baker met 60% tussen 1987 en nu. En de experts van het huis verwachten een verdere opmars met nog eens 40% tegen het jaar 2001.

Example 3

De dromen waren bitter

De Hollywoodstudio Warner Brothers is vijfenzeventig jaar oud. Een terugblik op het imperium van de ruziënde broers Warner.

Bij het kijken naar oude Hollywoodprenten begint het plezier al bij het logo van de producerende studio: de brullende leeuw van MGM, de berg met wolkenaureool van Paramount, de rondtollende wereldbol van Universal, de radiomast van RKO, de zwaailichten rond de monumentale 20th Century Fox letterblokken, de dame met toorts van Columbia Pictures. Maar geen bedrijfssymbool is de filmfan liever dan het robuuste zilveren schild waarin de twee eerste letters van Warner Brothers gebeiteld lijken.

Dit vertrouwd embleem, tijdens de gouden jaren van Hollywood steevast begeleid door de bruisende fanfare van Max Steiner, roept meteen een wereld op van taaie films, liefst in zwart-wit, vol misdaad, vervaarlijk avontuur en maatschappijkritische ondertonen. "The Maltese Falcon" en "Casablanca" met Humphrey Bogart, "Jezebel" en "The Letter" met Bette Davis, "The Roaring Twenties" en "White Heat" met James Cagney, "Mildred Pierce" met Joan Crawford - om maar enkele van de honderden prenten te noemen.

Meer dan bij welke andere studio is de historiek van Warner Bros. ook het verhaal van een aantal kleur rijke pioniers: de vier gebroeders Warner, dankzij wie het begrip nepotisme onlosmakelijk met Hollywood verbonden is. Jack L. (1995-1981), de jongste en bekendste van de vier, superviseerde samen met de vroeggestorven Sam (1888-1927) de productie; Jack aan de westkust en Sam in New York. Harry (1882-1958) en Albert (1884-1967) bestuurden de financiële en commerciële afdeling. Het viertal stamde uit een kroostrijk gezin van Pools-joodse afkomst. Hoe zij Hollywood veroverden, lijkt wel een filmscenario over de American Dream.

Vader Benjamin had in de vroege jaren 1880 zijn gezin achtergelaten in Kraznashiltz in Polen, om verwanten te volgen die in Amerika hun geluk gingen beproeven. Na twee jaar schoenen lappen in Baltimore had hij genoeg gespaard om zijn familie te laten overkomen, die in het beloofde land gestaag bleef groeien.

De vier broers probeerden aan de bak te komen in de meest diverse beroepen en handel (schoenen, slachterij, roomijs, kermis, zeep, fietsen) en verzeilden per toeval in de nieuwe industrie die moeizaam uit de grond werd gestampt. Met hun spaarcenten kochten ze samen een kapotte filmprojector die ze oplapten en waarmee ze enige tijd (1905-1907) een bioscoop in Pennsylvania exploiteerden. Daarna stapten ze over op filmdistributie - een weinig gereglementeerde activiteit die nog in zijn kinderschoenen stond - en gingen vervolgens hun eigen films produceren. Hun eerste succes was minder een kwestie van talent dan van doorzettingsvermogen. In 1923 richtten ze hun eigen filmmaatschappij Warner Brothers op.

EEN PROVOCATEUR

De reputatie van hun studio werd gemaakt doordat Warner als eerste de stap naar de geluidsfilm waagde met "The Jazz Singer", een mijlpaal. De eerste experimenten met geluid waren vooral uitgevoerd door Sam, die echter de triomf van deze revolutionaire techniek zelf niet zou meemaken. De ironie van het lot wil dat hij in 1927 aan de vooravond van de première van "The Jazz Singer" overleed aan de gevolgen van een slecht verzorgd sinusabces. Voortaan had Jack de leiding over de filmproductie. Meer dan dertig jaar stond hij aan het hoofd van Warner Brothers en kwam daarbij voortdurend in conflict met zijn oudste broer. Jack en Harry, de twee pijlers van het bedrijf, konden elkaar niet uitstaan. Hun vijandschap was zo intens dat ze het vertikten om samen in het studiorestaurant te eten; tegen het eind van hun leven spraken ze gewoon niet meer tegen elkaar.

Jack, een snelpratende provocateur, gedroeg zich als een gefrustreerdeBroadwaykomiek. Volgens getuigen deed hij niets liever dan met luide stem vulgaire moppen vertellen. Toen Albert Einstein de studio in Burbank bezocht, zou hij tegen de grondlegger van de relativiteitstheorie gezegd hebben: "You know, I have a theory of relatives, too - don't hire them." Jack was onbeschoft, vulgair, opzichtig gekleed en hield ervan de anderen in verlegenheid te brengen. Vooral dan zijn broer Harry, in alles zijn tegengestelde.

De sobere en conservatieve Harry maakte weinig indruk op zijn omgeving. Hij was een toegewijde echtgenoot en vader. Net als rivaliserende "moguls", Adolph Zukor van Paramount en Louis B. Mayer van MGM, was hij een strenge moralist. Van zijn vader, een devote jood, had hij geleerd raciale en religieuze verdraagzaamheid te propageren.

Harry wist hoe hij de bankiers in Wall Street moest aanpakken en dankzij geweldige kapitaalsinvesteringen kon hij Warner ombouwen tot de eerste grote geluidsstudio. In de jaren dertig nam WB ook honderden filmzalen over, samen met platenfirma's en radiostations. Ook financierde hij Broadwayshows. In volle economische crisis had alleen MGM evenveel troeven in handen, waardoor beide studio's het best de moeilijke tijden doorspartelden.

Terwijl de studio uitgroeide tot een van de best uitgeruste filmproductiefabrieken, met naast hangars voor de geluidsstudio's en administratiegebouwen ook een aantal permanente decors (het archetypische westernstadje, straten in New York), werden de Warners ook getroffen door persoonlijke tragedies en verdeeldheid. Tijdens een bezoek aan Cuba liep Harry's 22-jarige zoon Lewis een fatale bloedvergiftiging op. De studio was daarmee een monarchie zonder kroonprins. Na de dood van Lewis groeide de tweedracht tussen Harry en Jack. Die werd nog op de spits gedreven toen Jack verliefd werd op would-be actrice Ann Page Alvarado en met zijn maîtresse ging samenwonen nog voor hij wettelijk was gescheiden. Nu ook hun vader Benjamin was overleden, zag Harry het als zijn heilige taak om de eendracht binnen de familie te bewaren. "As long as you stand together, you will be strong", had de oude Benjamin nog gewaarschuwd.

BOERENKINKELS

Zelfs naar Hollywoodnormen werden de Warner broers als onderontwikkelde boerenkinkels beschouwd. Geen van de Warners was gecultiveerd. Ze lazen bijvoorbeeld nooit een boek, zelfs niet een gevierde roman die in aanmerking kwam om door hun studio te worden verfilmd. Toen regisseur Mervyn LeRoy tijdens zijn wittebroodsweken (na zijn huwelijk met Harry's dochter) merkte dat iedereen "Anthony Adverse" las, stuurde hij Jack een telegram met de aanmaning om het boek te lezen. "Read it?" telegrafeerde Jack terug, "I can't even lift it." En toch was het inzicht van deze cultuurbarbaar, alsook zijn instinctief aanvoelen van wat het publiek bezighoudt, van kapitaal belang bij het leiden van dit strak georganiseerde productiesysteem.

In zijn grondig gedocumenteerde studie over de joodse gemeenschap in Hollywood, "An Empire of Their Own - How the Jews Invented Hollywood", beschrijft Neal Gabler een werkdag van Jack in de jaren dertig. Elke ochtend stond hij om 9 uur op, greep naar de telefoon om met zijn productiemanager zijn dagtaak te bespreken. Daarna nam hij met zijn assistente de post en de Hollywoodvakbladen door, waarin de passages die hem aanbelangden al waren aangestreept en samengevat. Bij het ontbijt las hij, met het oog op een mogelijke verfilming, synopsissen van scenario's en boeken. Daarna douchte hij. Gewoonlijk arriveerde hij pas rond de middag op het studioterrein waar hij nogmaals checkte met de productiemanager en occasioneel ook met de juridische dienst aangaande een deal voor een ster of een boek.

Rond één uur dertig ging hij lunchen in de directie-eetzaal waar hij een Zwitserse chef en een Duitse hoofdkelner had aangesteld. Tijdens de lunch praatte hij alleen maar over koetjes en kalfjes - gewoonlijk roddels en tips voor de paardenrace, zijn grote passie. Na de lunch ging Jack Warner in een van de filmzaaltjes de nog niet gemonteerde opnamen van de dag bekijken - dailies zoals ze in het jargon worden genoemd. Dit nam het grootste deel van de namiddag in beslag, twee tot drie uur. Terug in zijn kantoor ontving hij bezoekers en wisselde hij informatie uit met zijn productiechef, wiens bureau aan het zijne paalde.

Daarna was het tijd voor zijn dagelijkse scheerbeurt bij de studiobarbier, gevolgd door een bezoekje aan de studiosauna, waarna hij met hernieuwde krachten aan nog meer vergaderingen en conferenties begon. Toen de avond viel, ging hij nog niet naar huis, maar woonde hij samen met de andere top executives previews bij van nieuwe Warner-films, gewoonlijk in de buitenwijken van Los Angeles, soms meer dan een uur rijden, maar altijd op een "geheime" locatie. Af en toe nam hij zijn dochtertje Barbara mee ("Climbing into those black cars, we were like gangsters going to rob a bank", herinnert ze zich).

Tijdens de screening zat Jack altijd naast de cutter, aan wie hij zijn instructies doorgaf voor het aanbrengen van wijzigingen bij de montage. Wat daarbij opviel, was zijn feilloos geheugen voor dailies die hij drie tot vier maanden tevoren had gezien. Hij had altijd het laatste woord. Na die proefvertoningen keerde hij laat terug naar huis.

Alle studiobazen volgden in grote lijnen dezelfde routine. Hun leven speelde zich af in en rond de studio die hen in staat stelde een compleet fictieve wereld te scheppen waarover zij heer en meester waren.

AGRESSIEF EN GEDURFD

Tijdens de hoogdagen van het Amerikaanse studiosysteem had elke major zijn persoonlijke stijl - iets wat door heel wat factoren werd bepaald. Maar vooral de eigenaar van de studio drukte zijn stempel op de pellicule. Zo straalde de persoonlijkheid van Jack af op de producten die de Warner fabriek uitrolden, hoe groot ook de inbreng mag geweest zijn van de twee uitzonderlijk begaafde productieleiders die de studio inhuurde (Darryl Zanuck van 1929 tot 1933, Hal Wallis van 1933 tot 1944). Zoals zijn baas, was WB niet meteen een studio die je associeerde met klasse en prestige: het was integendeel de meest agressieve, realistische en gedurfde studio. MGM zweerde bij glamour en luxe, waardoor alle films er altijd onecht uitzagen; gepolijste musicals waren het paradepaardje van de studio. Paramount, waar voornamelijk emigranten werkten, was gespecialiseerd in gesofistikeerd, "continentaal" amusement. Bij Columbia heerste dankzij de populistische fabels van huisregisseur Frank Capra een spirit van sociaal optimisme.

Warner Brothers stond bekend voor zijn schraapzucht. Voor hij naar huis ging, liep Harry nog even langs de toiletten om alle lichten te doven. De studio zat voortdurend in de schulden en Harry snoeide dan altijd drastisch in de budgetten. "Luister eens, een film is niet meer dan een dure droom", zei hij tegen een journalist. "Het is even makkelijk om voor 700.000 dollar te dromen dan voor 1.500.000 dollar."

Terwijl er bij MGM met geld werd gesmeten, werd bij Warner op alle mogelijke manieren bespaard. Die zuinigheid ging uiteindelijk de stijl van de films bepalen. De Warner-films waren bot, taai en snelden in grote vaart over het doek. Een stijl die perfect paste bij de actuele onderwerpen die de taal spraken van de grote stad, een wreed, onverschillig en antagonistisch environment, vaak in zware schaduwen gehuld. De verhalen waren energiek en kordaat verteld, maar werden overheerst door gevoelens van woede, bitterheid en doem. Denk maar aan klassiekers als "I am a Fugitive From a Chain Gang", "Wild Boys of the Road" en "High Sierra". De typische Warner-musical "Forty-Second Street" vertelde het vreugdeloos verhaal van een gevallen impresario, die een laatste wanhopige gok doet naar het succes.

MGM moest het van zijn sterren hebben ("More stars than in heaven!" pochte de studio). Warner beschikte ook over zijn sterrencontingent, alleen beantwoordden hun acteurs niet aan het traditionele Hollywoodideaal van glamour en zeemzoete romantiek. De mannen - Bogey, Cagney, Edward G.Robinson, Paul Muni, John Garfield - waren ruw, ongeschoren en klein van gestalte; de dames - Bette Davis, Joan Blondell - waren hard, leep en niet op hun mondje gevallen. Zelfs de cartoonkarakters, zoals Bugs Bunny, waren cynisch, bliksemsnel en onsentimenteel.

ONTEVREDENHEID EN MUITERIJ

Warner projecteerde in zijn sterren een ideaalbeeld van zichzelf: de rebel die van leer trekt tegen het establishment waartoe hij nooit zou behoren.

Jack Warner stond bekend als een keihard zakenman en een slavendrijver met zijn personeel. "Het lijkt wel alsof de Warner-bazen hun acteurs verwarren met hun renpaarden", sneerde Cagney. Niet alleen de acteurs werden afgebeuld, ook voor de regisseurs was het arbeidsritme hels: Michael Curtiz, het werkpaard van de studio, regisseerde in de jaren dertig niet minder dan 44 films.

Precies omdat de studio teerde op koppige, compromisloze individuen, heerste er grote ontevredenheid en hing er voortdurend muiterij in de lucht. De bekendste sterren kwamen in opstand tegen hun langetermijncontracten, zowel Bette Davis als Olivia de Havilland sleepten hun werkgever voor de rechter en verloren het proces.

Volgens Gabler liet die vijandigheid ook zijn sporen na op de films, die vaak een opstandig en anti-autoritair toontje bezaten. Zelfs in de piratenprenten met huisster Errol Flynn zat klassenhaat verwerkt. Alle verworpenen van de Grote Depressie defileerden op het doek: werklozen, beroepsboksers, vuistvechters, vleesverwerkers, mijnwerkers, broodkaarters, oplichters en detectives. De studio werd dan ook het favoriete doelwit van zedenprekers die beweerden dat de studio niet alleen anti-sociaal gedrag afbeeldde, maar ook vergoelijkte.

De kritiek van moraalridders ten spijt, konden de Warner-films zeker niet antisociaal genoemd worden, alleen toonden ze een veel grotere ambivalentie jegens de traditionele Amerikaanse waarden dan de concurrerende productie.

Niet dat de Warner-broers zelf racidalen waren. Zoals alle andere studiobazen stemden ze republikeins. Behalve in 1932, toen machtige industriëlen en lobbyisten de hulp van de Warners inriepen om voor Franklin Roosevelt de kandidatuur en het presidentschap te winnen. Eens aan de macht, ontbood Roosevelt, Jack met het verzoek of hij een film wilde maken waarin de Russen - toen bondgenoten van de Amerikanen - in een goed daglicht werden gesteld. Het resultaat, "Mission to Moscow" (1943), zou Jack zuur opbreken tijdens de daaropvolgende heksenjacht op communisten.

Toen het House Un-American Activities Committee (HUAC) op het einde van de jaren veertig naar Hollywood trok om te onderzoeken of de filmindustrie zich bezondigde aan communistische propaganda, was de rabiate anticommunist Jack Warner er als de kippen bij om namen te noemen van verdachte radicalen. Hij en zijn broers zouden maar al te graag alle communisten naar Rusland verbannen en geld inzamelen voor een pesticide om dit ongedierte uit te roeien. Zijn hatelijke tirade was deels ook een reactie tegen de brutale staking die de studio's had lamgelegd.

VADER EN ZOON

Het einde van het oude Hollywood liep parallel met het definitieve uiteenvallen van de Warner-dynastie. Jack raakte vervreemd van zijn zoon Jack Jr., die zelfs verbannen werd naar het kantoor in Londen. Het was opvallend en pijnlijk in welke mate de relatie tussen Jack en zijn broer Harry, zijn vaderfiguur, later zou worden weerspiegeld in de relatie tussen Jack en zijn zoon. Vanaf de late jaren veertig werd het voortbestaan van het oude studiosysteem langs alle kanten bedreigd: door het toepassen van de antitrustwetten (studio's zagen zich genoodzaakt hun bioscoopketens af te stoten), de competitie van de commerciële televisie en de opkomst van de onafhankelijke productie. Geleidelijk geraakten de studio's opgezogen in de nieuwe Hollywoodconglomeraten. In de jaren vijftig verkochten de broers hun aandelen in Warner Brothers aan de First National Bank van Boston.

Achter de rug van zijn broer Harry, sloot Jack echter nog een deal om zijn titel te behouden. "Dit verraad werd Harry fataal", zou diens schoonzoon later zeggen. Als in een Warner Bros-melodrama kreeg Harry kort na de verkoop een beroerte waarvan hij nooit volledig herstelde. Toen hij twee jaar later in 1958 stierf, keerde Jack voor de begrafenis niet eens terug uit Frankrijk. Zelf gereduceerd tot een levend anachronisme, verkocht de laatste oorspronkelijke oprichter van Warner Brothers in 1966 voor 32 miljoen dollar zijn eigen aandelen aan de holdingmaatschappij Seven Arts Limited. Na een mislukte fin de carrière als Broadwayproducer hield hij zich onledig met gokken en tennis. Tijdens een van die partijtjes tennis maakte hij een kwalijke val waarvan hij nooit helemaal herstelde. Vier jaar later, in 1978, stierf Jack net als Harry aan een beroerte. Zijn zoon die hij geweigerd had aan zijn ziekbed, werd gedoogd op de begrafenis. Het leven van de broers Warner werd nooit verfilmd, wat even jammer als onbegrijpelijk is.

Patrick Duynslaegher

Example 4

VINGER OP DE WOND

België reageert "ontzet" op het wapenincident in Congo, maar geeft hierdoor feitelijk (nogmaals) te kennen dat onze ambassade in Kinshasa geen terreinkennis heeft. Niemand betwist het ridicule van de Congolese aantijgingen, noch dat het gaat om een groteske manipulatie door anti-Belgische fracties in de Kabila-regering. Wie Congo/Zaïre (of een ander ontwikkelingsland) kent, geeft echter geen aanleiding tot dit soort voorspelbare incidenten. De ambassade trapte nodeloos in een val - de Belgische bedrijven in Congo dreigen eens te meer de rekening te betalen.

Buitenlandse Zaken zal dit tegenspreken, maar bevestigt hiermee onze stelling. Het is immers een kwestie van elementaire basiskennis dat Belgische diplomaten niet met een krat wapens door Congo zeulen, indien zo'n transport niet wordt gewaarborgd door een vrijbrief van de president in hoogsteigen persoon (het kleinste expat-kind in Congo weet dat het niet met een waterpistool over straat loopt).

Het handvol wapens waarover het gaat, waren al maanden in Lubumbashi - er was niet de minste hoogdringendheid om deze hoognodig naar België te repatriëren. Vooral niet in een sfeer van onderkoelde relaties. Buitenlandse Zaken verwijst zelf naar een toenemende spanning, onder meer na de binnenlandse verbanning van de vroegere held van de Belgische media, oppositieleider Etienne Tshisekedi, en de opheffing door het Kabila-bewind van de mensenrechtenorganisatie Azadho. Ook in dit geval reikte België de stok aan waarmee het geslagen wordt, want kan men zich voorstellen (en dulden) dat Congo de Witte Comités in België zou financieren?

Dat Brussel dit soort gevoeligheden niet onderkent, geeft aan dat het niet over de diplomatieke bekwaamheden beschikt om op een constructieve manier met moeilijke regimes om te gaan. In die omstandigheden is zelfs "stille diplomatie" tot mislukken gedoemd.

Category	Recall	Precision	F-measure
CAR	0.827586	0.558140	0.666667
INVESTMENTS	0.789773	0.785311	0.787535
STOCK MARKET	0.481203	0.646465	0.551724
CULINARY	0.805970	0.613636	0.696774
FILM	1. 000000	0.576471	0.731343
COMPUTER SCIENCE	0.673913	0.620000	0.645833
INTERNATIONAL	0.444444	0.689655	0.540541
LITERATURE	0.909091	0.526316	0.666667
MARKETING	0.457895	0.906250	0.608392
MUSIC	0.93750	0.759494	0.839161
POLITICS	0.607143	0.377778	0.465753
SPORTS	0.888889	0.623377	0.732824
TOURISM	0.743590	0.707317	0.725000
REAL ESTATE	0.653846	0.500000	0.566667
Average	0.730060	0.635015	0.658920