ABSTRACT:
The
research presented in this paper investigates the use of a text mining approach
for automatic taxonomy generation and text categorisation for the content
management system of Alergoclínica, a private clinic of Dermatology and
Allergies in
Keywords: Text mining,
Classification systems, Taxonomies, Knowledge management, Content management
1.
Introduction
One would have expected that when technologies go out of the controlled environment of labs into the competitive software market, they would be ready for adoption. Nonetheless, this does not seem to be the case for text mining tools that have been launched and gained prominence since the end of the 1990s. Text mining defined as the process of extracting information from unstructured text has emerged as a hybrid discipline which draws on information retrieval, statistics, computational linguistics, and more often than not, the topic area the text mining is applied to. The results of tests conducted to evaluate text mining techniques at Alergoclínica - a Brazilian dermatology and allergies clinic - were hardly encouraging.
Alavi & Leidner (2001) state that the basis for achieving competitive advantage through knowledge is more related to the ability of a company to effectively employ existing knowledge than to the knowledge itself. However, different views on the meaning and nature of knowledge and knowledge management (KM) entail the choice and adoption of different knowledge management systems and, therefore, different technologies.
In this paper we assume the perspective of knowledge as access to information. From this standpoint, KM focuses on securing access to and retrieval of information while the role of IT is to provide effective search and retrieval mechanisms (Alavi & Leidner, 2001). Among them, Content Management Systems (CMS) offer a powerful software solution that benefits users by making it easier to manage learning content and digital assets in an organisational environment (Martis 2005).
At the same time, a successful KM system should be aligned with the organisational culture. In the case of Alergoclínica, where the sharing of information is already the practice, the human aspect is very important and so is the use of the internal common jargon. Any system that intends to help with the management of knowledge in this environment should firstly be concerned with delivering access to information. It is worth mentioning that Alergoclínica already has a shared knowledge base, built upon the current mode of operation of the clinic, but requires an improved system along with the appropriate technology to facilitate the exchange.
Alavi & Leidner (2001) propose a framework of four processes in knowledge management in organisations - (a) creation, (b) storage/retrieval, (c) transfer and (d) application – and they then discuss the role of IT in each of these. Regarding the second process, which is our current focus, they state that advanced storage and retrieval technologies can be used to improve organisational memory. Furthermore, any repository that aims to help an organisation to remember should be indexed in a way that facilitates knowledge transfer between the knowledge base and the individual and promotes appropriation. It is thus reasonable to say that the vocabulary employed to retrieve content is a major concern.
This is the point at which text mining techniques become useful. By going through a vast amount of text and making use of powerful algorithms, possibilities abound. One of these is to apply methods that derive a common vocabulary from the organisation texts themselves.
Based on the above, the aim of this paper is to investigate the use of a text mining approach for automatic taxonomy generation and text categorization in the design of an intranet-based CMS.
The rest of the paper is structured as follows: section 2 provides the background and rationale for the research, in section 3 we investigate the suitability of the prevalent methodologies, evaluate the available software tools and select appropriate text mining software. Section 4 presents the text analysis, clustering and automatic categorisation processes and assesses the results. Finally, in section 5 we draw conclusions and outline future work.
2. Background
The main function of a CMS
is to organize the documents of an organisation in a way that is easily
accessible for users. However, according to a survey conducted by Forrester
Research (Tilak, 2005) “only 44 per cent of users surveyed feel that it
is easy to find what they’re looking for on the intranet” whereas
61 per cent of respondents rank improved search capabilities as the area that
needs the most improvement.
Taxonomies provide a
framework for categorization of content in the system (i.e. a controlled
vocabulary) and are believed to improve the retrieval process by allowing users
to broaden out or narrow down a search within the relevant subject category or
topic instead of relying on user’s ability to build effective search
queries (Cisco & Jackson, 2005). However, building and maintaining a
human-generated thesaurus as the controlled vocabulary tool is time-consuming,
expensive, and intellectually demanding (Shearer, 2004). Automatic text
categorisation is therefore a topic of interest both in the domain of advanced
information retrieval since the early 1960s, and, more recently, in text
mining. It is seen as essential when the great volume of documents and scarcity
of time make the manual approach impractical, in addition to improving
productivity in cases where human judgement is necessary (Sebastiani, 2002).
The value of text mining is
also becoming evident in another related area – ontologies – especially since their role in the Semantic
Web infrastructure has proven indispensable (Doan et al. 2004). Ontologies
impose machine-readable constraints on the hierarchical relationships defined
by a taxonomy, which enables scientific knowledge to be interpreted and
represented from an unstructured text (Stevens et al., 2000). They are
particularly suited for any type of concept-oriented search (Dotsika &
Watkins, 2004) and as such potentially crucial for the future of text mining.
Moreover, the acknowledged
fact that at least 80% of the information in a company is in the form of text
(Dörre, 1999;Tan, 1999) has served as both motivation and advertisement for
text mining. The following are popular text mining techniques:
Ø
Information or feature extraction identifies key phrases and
relationships by looking for predefined sequences (pattern matching).
Ø Clustering groups similar documents on the fly
instead of through the use of predefined topics, and documents can appear in
multiple topic lists. A basic clustering algorithm creates a vector of topics
for each document and measures the weights indicating how well the document
fits into each cluster.
Ø Categorisation detects the main topics of a
document by placing it into a pre-defined set of topics, which have been
identified by counting the words that appear in the text.
Ø Topic tracking works by storing user profiles and, based on the documents the user views, predicts other documents of interest to the user.
Ø
Text summarisation helps the user identify whether
lengthy documents meet the user’s needs.
Ø
Concept linkage connects related documents by
identifying their commonly shared concepts.
Performance of the major
text categorisation algorithms is currently around 80% effectiveness (Sebastiani,
2002; Yang & Liu, 1999). According to Sebastiani (2002), this is comparable
to levels of effectiveness of trained human coders. In discussing issues on a
bottom up, automatic approach Cisco & Jackson (2005) mention:
Ø
little control of the meaning of high level concepts
Ø
refinement required before it makes sense to users
Ø
high cost
Ø
human intervention required to add and delete categories as appropriate
and/or judge if the final taxonomy corresponds with human understanding.
By and large, current
research in the field focuses on the lab-based technical aspects of the
technology whereas the business literature on the topic expresses high
expectations along with issues and concerns. The core of these anxieties are
rooted in human v machine open questions, mainly whether the software performs
as well as humans and the extent to which human intervention is required.
Different authors have different views on this matter and no definite answer
seems to have been found. Thus, there still is a huge gap between research and
practice, which show the need for studies to unite these worlds.
3. Methodology And Data Collection
Most studies in text mining
techniques adopt traditional experimental methods. This type of approach
usually requires experimental and control groups in addition to heavy
quantitative analysis in order to suggest causative relationships between
variables. However, it was not our concern to investigate the performance of
any particular algorithm or setup. Therefore, a purely experimental design did
not seem appropriate.
Ultimately, the adequacy of
the taxonomy and the effectiveness of categorisation can be better determined
by people with expertise in relation to the contents of the documents, which in
the case of a corporate intranet-based system are the producers and users of
information.
Moreover, it appears that
the use of the commercially available technology is not yet well mapped from
the academic perspective. Therefore, an in-depth qualitative inquiry seemed to
be more appropriate to elicit effective practices as well as limitations of the
technique. Evaluation research includes the application of scientific
procedures to the collection and analysis of information about the content,
structure and outcomes of programmes, projects and planned interventions
(Clarke, 1999). Although it has been commonly used in the social sciences, it
was particularly suitable in this case because it (a) provides the required
rigour to test the technology and the replication of the particular procedures
undertaken and (b) facilitates the quest for value and meaning of the outcomes.
The aim of the study was to
investigate the use of a text mining approach for automatic taxonomy generation
and text categorization in the design of an intranet-based Content Management
System (CMS). The selected organization, Alergoclínica, was a clinic of
dematology and allergies with 6 branches in
Thus, a selection of
evaluation methods were employed to select the tools, collect the data and run
the text mining analysis. The criteria to select the text mining tools are as
follows.
Essential:
Ø
availability – either as a free or demonstration version
Ø
language – support for Brazilian Portuguese because the clinic
operates in
Ø
specific features – taxonomy generation, automatic categorisation
of documents.
Desirable:
Ø
support for different formats of documents including .doc, .pdf, .htm
Ø
ease of use – friendly GUI, documentation etc.
Ø
visual aids – e.g. a cognitive map for viewing main concepts.
In order to find out which
text mining tools were available at the time of the initiation of this project
in June 2005 the following independent web sites were consulted:
Ø
Kdnuggets Text Analysis, Text Mining, and Information Retrieval Software
[http://www.kdnuggets.com/software/text.html]
Ø
Text Analysis Info Page [http://www.textanalysis.info]
Ø
Text-mining.org [http://www.text-mining.org]
These provided a list of
more than ten different text mining packages that would be suitable for this
evaluation. However only five of them had a free or demo version available.
Since
none of the tools evaluated completely satisfied the pre-defined criteria, two
tools were used in the evaluation: Megaputer Text Analyst 2.3, particularly for
the taxonomy creation and Provalis Research QDA Miner/Wordstat 5.0 suite, for
the automatic categorisation tests.
A purposeful stratified
sample of Alergoclínica’s
documents was collected in order to insure that all relevant areas and themes
were represented for each department and appeared to be appropriate to a team
of content experts. The sample from each area was split into two sets of
similar documents, the odd and even sets, in order to run reliability tests at
the analysis phase.
A team of three key members
of the staff, the content experts, was formed to help in the selection of
documents according to the sample strategy and to provide overall assistance in
the contact with the rest of the organisation. These were the Head of the
Marketing Department, the Head of the Human Resources Department and the
Executive Director. The number of documents gathered from each department was
as follows:
Ø
Marketing – 120 documents
Ø
Human Resources – 42 documents
Ø
Scientific – 16 documents
4. The Tools At Work
The text mining tests
included three main tasks: text analysis, clustering and categorisation.
4.1. Text
Analysis
TextAnalyst was used for
the taxonomy generation. Before any analysis could be conducted, a
pre-processing phase was required to exclude common words. The analysis was run
separately in the two sets of sample documents – odd and even –
and, surprisingly, presented only two terms in common. The substantial
difference in results may indicate that the vocabulary is too varied within the
documents or that the selection of terms is not stable enough thus impacting on
the reliability of the procedure. The terms do not seem appropriate to discriminate
between the documents, though they might be a sign of high occurrence. Further
manipulation of the stoplist could improve and thus modify the entire results.
However, in this case, more contact with the human content experts would be
necessary to create criteria for such selection.
In view of that, a
supplementary exercise with the Marketing content expert was conducted to help
evaluate the validity of the results. 12 (10%) documents were randomly selected
and sent back to their creator. First, he was asked to underline the most
relevant words in each document and to provide any words outside the document
that named a more appropriate topic or category, if any. Then, the topic
structure was sent and he was asked to see if any of the words in the list
could be used to classify the same documents. Finally a general opinion of the
taxonomy was obtained. When comparing the topics selected in the text by the
content expert with the ones extracted by the tool for the same documents only
2 topics coincided.
Table 1 below compares the
topics selected in the text by the content expert with the ones
extracted by the tool for the same documents: only 2 topics coincide – satisfação (“satisfaction”) and
recepção (“reception”).
Doc. |
Human topics
in the text (not in the text) |
Machine |
1 |
atraso,
reclamação |
none |
2 |
nao
foi atendida, demorou (espera) |
situações |
3 |
agradeço,
elogio, satisfação |
objetivo,
satisfação, padrão |
4 |
indico,
idoso, estacionamento |
convênio,
função, apresentação, participação |
5 |
ligo,
ocupado, esperando, sem resposta, não tenho tempo, melhor atendida (espera) |
atendimento
telefônico da Unidade, alterações |
6 |
sugiro, café, cha, água (sugestão) |
objetivo,
satisfação, contato, participação |
7 |
Esperar
(espera) |
situações,
relação, contato, participação |
8 |
atendido
primeiro, esperar, esperei muito, não fui atendida, reclamam, favor (espera,
reclamação) |
recepção,
sugestão, satisfação |
9 |
cadeira de rodas, indignada, deficiente, queixa
(reclamação) |
sugestão |
10 |
organizacao,
estacionamento, horário correto |
situações |
11 |
rápido,
eficiente, recepção, atendimento
médico |
recepção, objetivo, satisfação, padrão |
12 |
decepcionada,
esperando, atraso, recomendarei (espera) |
recepção,
atendimento da recepção, situações |
Table 1. Human v machine topic
extraction
Most of the terms were obtained from a single document, the Nursing Manual. On further investigation, it was found that the Nursing and Integration manuals were much longer than the other sources, (in fact 10 times larger than the average of the others). Although this may signify that longer documents have a larger impact on the contribution of terms for the topic structure, it does not necessarily follow that these documents should need more terms for classification. Even if it was indeed proven beneficial to use more terms in order to cater for the sub-topics in these larger documents, it would be more sensible to have all those terms under a major single category such as “Manuals”, as suggested by the content expert. However, the opaqueness and lack of flexibility of the algorithm offers no aid in dealing with this issue.
The overall taxonomy generated by the software could be used as a starting point for the final version to be generated manually by the steering group, but certainly much human intervention would be needed to provide a consistent and comprehensive terminology.
4.2. Clustering
Another way to try to understand the similarity of documents, and therefore possible categories within a domain, is to use clustering. This was performed using WordStat’s dendrograms. The output produced is in the form of a dendrogram graph which shows the file names grouped according to the degree of similarity with the words used in each document. The number of clusters must be defined by the user and the documents are automatically split into the specified number of groups. There is no control over the number of documents included in each cluster nor any possibility of moving a particular document from one cluster to another.
Clustering was first performed with the Human Resources set of 42 documents, producing the dendrogram. In this case, after several trials, the solution with 3 groups was chosen because it seemed to produce the most consistent groupings. Human intervention was required to examine the documents allocated to each group and to manually assign labels to the groups, since the software does not indicate the textual reason for the cluster.
Clustering was also performed in the Marketing Department documents, particularly in the subset of 116 letters answering customer queries. In this case, after trials, the number of clusters chosen was 10. Since these were named according to an internal Alergoclínica classification system as indicated by the content expert interviewed, it was possible to run further analysis to compare the clustering generated by the software with the human categorisation. Thus, this scheme was used to code the set of documents and cross-tab the clusters given by the tool with these human categories.
The bar chart in Figure 1 presents
how each software-generated cluster is divided in terms of the type assigned by the content expert.
Figure 1. Human (type) vs. Machine Cluster
Cluster 1 is the largest group incorporating the majority of the overall complaints (56 out of 82). Groups with only one document occurred even if fewer clusters were chosen. This suggests that they are considered very different from any other documents and would not belong to any other group. Although the software offers the option to eliminate clusters with only one item, this is not really useful considering that every document must belong to a category, even if it is something as general as “others”.
Cluster 3 incorporates the majority of suggestions, but it is not very uniform in terms of type as it also contains complaints and one praise. Cluster 6 incorporates the majority of praises (13 out of 17).
Based on this comparison it was possible to verify that the groupings were fairly reasonable and could be labelled. However, some problem issues were encountered during the process such as “outlier” documents falling into one-item clusters, and documents with more than one topic not being able to belong to more than one group.
Based on the results of the cluster analysis and input from the content experts, a secondary taxonomy outline was devised which could be used as the basis for yahoo-like directories in the CMS (Figure 2 below).
Figure 2. Taxonomy for Yahoo-like navigation
Automatic categorisation requires
a previously categorized set in order to train and test the model; only then
can it be used with uncategorized documents. Therefore, the choice was made to
employ the user’s classification system of customer letters once more to
test this technique. The software used was again WordStat and the sample used
to train and test the model was the customer letters of the year 2004 (a subset
of the Marketing sample). A categorical variable was created and manually
assigned to either: Complaint, Praise or Suggestion, according to the content
expert's file nomenclature (R- Reclamação, E- Elogio, S- Sugestão) and used as
the independent variable. The predictors were the keywords in the text.
The classification model
generated was very successful, according to the degrees indicated by the
literature, making a concession for comparability issues. In total, 48
documents were assigned correctly and 11, missed out of the 59 documents. Thus,
the overall accuracy was 81%. When applying the model to the documents not used
in its creation, that is, customer letters of the year 2003, the performance
seemed to achieve about the same level of results. Additional analysis of the
text categorisation technique was hindered due to the lack of other
pre-categorized sets and the non-feasibility of running real-time tests at this
stage.
The predicted v actual
precision and predicted v actual recall in each individual category are
highlighted in tables 2a and 2b.
|
|
Predicted |
|||
|
Frequency |
Elogio (Praise) |
Reclamação (Complaint) |
Sugestão (Suggestion) |
TOTAL |
Actual |
Elogio (Praise) |
8 |
0 |
1 |
9 |
100% |
0 |
6.25% |
|
||
Reclamação (Complaint) |
0 |
30 |
5 |
35 |
|
0% |
85.71% |
31.25% |
|
||
Sugestão (Suggestion) |
0 |
5 |
10 |
15 |
|
0% |
14.29% |
62.5% |
|
||
TOTAL |
8 |
35 |
16 |
59 |
Table 2a. Predicted v Actual showing
Precision - column percentage
|
|
Predicted |
|||
|
Frequency Row Pct |
Elogio (Praise) |
Reclamação (Complaint) |
Sugestão (Suggestion) |
TOTAL |
Actual |
Elogio (Praise) |
8 |
0 |
1 |
9 |
88.89% |
0 |
11.11% |
|
||
Reclamação (Complaint) |
0 |
30 |
5 |
35 |
|
0% |
85.71% |
14.29% |
|
||
Sugestão (Suggestion) |
0 |
5 |
10 |
15 |
|
0% |
33.33% |
66.67% |
|
||
TOTAL |
8 |
35 |
16 |
59 |
Table 2b. Predicted v Actual showing Recall
- row percentage
It is noteworthy that while the software performed excellently with the Elogio documents, and reasonably well with the Reclamação documents, it was much less adequate in allocating the more conceptually difficult Sugestão documents to the appropriate category.
It appears that the Automatic Categorisation technique was more successful than the Automatic Generation of taxonomy even if fewer tests were possible. It is reasonable thus to believe that it would be applicable as a stand alone technology in other scenarios in which a taxonomy already exists or some documents are already categorized and another set should be categorized in the same way. It is a “do-it-like-this” kind of approach which can only fit specific purposes. Nonetheless, if the particular fit is found, then it might be successful.
4.4. Manual Versus Automatic
Table 3 summarizes some perceptions on how the automatic approach evaluated in this study would compare with the manual one. Human input in the automatic approach is italicised for emphasis. Overall, it seems that there are not enough advantages in the automatic process to justify substitution. In fact, it might be more problematic as additional factors need to be considered. However, in some specific circumstances, the nature of the application or of the organisation may justify adoption of the automatic approach.
|
Manual |
Automatic |
Method |
Top-down and bottom-up |
Bottom-up |
Taxonomy generation process |
All terms and their
relationships are determined by people. Tools can be used to store and
facilitate maintenance of the structure. |
Tools are used to extract the
relevant terms from the text and generate the taxonomy and its relations.
Structure needs to be revised by
content experts to include/ exclude terms and relationships as
appropriate. |
Categorization process |
Each document is reviewed and
manually tagged before publishing. Inconsistency might arise when many
different coders work on the same documents |
Documents are automatically
categorized based on a sample pre-categorized set. Revision might be
necessary or machine re-trained if categories change. Requires integration of
the model generated to the CMS. |
Time consumption |
High because requires
intense involvement of human resources |
Also high because analysis
phase is long and also require human intervention |
Staff requirements |
·
Content analyst with librarianship skills and knowledge in the topic
area ·
Content experts to help create a suitable model ·
Editors to manually tag contents |
·
Text mining analyst with some librarianship
skills and trained in text mining activities (data mining experience might
help) ·
Content experts to create higher level
topics and to validate categories ·
Programmer, if components are to be
integrated in the CMS |
Organisation profile |
Best suited for small
companies or small quantity of documents. |
Best suited for companies
in which categorization or large quantities of documents is strategic or the
cost of an existing manual approach has become a serious concern. |
Cost |
Internal, staff-related |
Cost of the tool, probably
cost of third party consultants, and staff costs |
Table 3. Comparison of manual vs.
automatic approaches
5. Conclusions And Future Work
Our evaluation showed that
the tools were not flexible enough and provided little help for the automatic
generation of a taxonomy. Human intervention was required, as anticipated by
the literature, but was much higher than expected. The text categorisation was
more successful in terms of performance of the algorithm but many issues were
found, such as that the need for a pre-coded set of documents was an obstacle
to further tests. Overall the approach was found to be only conditionally
viable, and consequently the technology was not recommended to Alergoclínica at
this moment. Full details of the investigation may be found in Nara Pais' MSc
dissertation (Pais, 2006).
Likewise, we believe that
the applicability of these techniques for the automatic creation of ontologies
is also limited. Apart from the demonstrated weakness of its taxonomy
component, further studies would be necessary to determine whether it will ever
be capable of aiding in other aspects, such as different types of relationships
and axioms, which are essential for a fully fledged ontology.
This study was important to
help bridge the gap between the technical experiments with algorithms and the
real business application of the technology.
Allowing for limitations of scope, it shows that the technology does not
deliver the benefits advertised by vendors and supporters, nor does it address
the issues and concerns voiced in the literature. It is undeniable that the
field has advanced since its early start more than 40 years ago, yet commercial
products are less than 10 years old and still need much development before they
can be valuable in real practice.
Text mining is bound to
gain prominence as tools evolve and text repositories grow. Therefore, further
investigation needed not only in what relates to taxonomy (or, possibly,
ontology) generation and text categorisation, but also in the more exploratory
aspects of knowledge discovery. In the meantime, it is advisable to adjust
one’s level of expectation, bearing in mind the emergent state of the tools
and the limitations of their effectiveness at this stage.
6. References
Alavi, M. & Leidner,
D., 2001. Knowledge management and knowledge management systems: conceptual
foundations and research issues. MIS
Quarterly, vol. 25, No. 1, p 107-136.
Cisco, S. & Jackson,
W., 2005. Creating order out of chaos with taxonomies. Information
Management Journal, May/June, vol. 39, issue 3, p 44-50.
Clarke, A., 1999.
Evaluation research: an introduction to principles, methods and practice.
Doan, A., Madhavan J. &
Domingos, P., 2004. Learning to match ontologies on the semantic web, VLDB
Journal, vol. 12, p 303-319.
Dotsika, F. & Watkins,
A., 2004. Can conceptual modelling save the day: a
unified approach for modelling information systems, ontologies and knowledge
bases. In: Khosrow-Pour, Mehdi [Ed]., Innovation through information
technology. 2004 IRMA conference,
Dörre, J., 1999. Text
mining: finding nuggets in mountains of textual data. Conference on knowledge
discovery in data archive. Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining,
Martis, M.S. 2005. Content
management system as an effective knowledge management enabler. International Journal of Applied Knowledge
Management, vol.1, issue 2.
Pais, N., 2006. An
investigation of the text mining approach. MSc Dissertation.
Sebastiani, F., 2002.
Machine learning in automated text categorization. ACM Computing Surveys
(CSUR) archive, vol. 34, issue 1, p 1-47.
Shearer, J., 2004. A
practical exercise in building a thesaurus. Cataloging & Classification
Quarterly, vol. 37, issue 3/4, p 35-56.
Stevens, R., Goble, C.A.
& Bechhofer, S., 2000. Ontology-based
knowledge representation for bioinformatics. Briefings Bioinform, vol.1,
issue 4, p 398-414.
Tilak, J., 2005, Desktop
technologies most important in corporate IT – survey
http://www.dmeurope.com/default.asp?ArticleID=7994, 16/3/2006
Tan, A.H., 1999. Text
mining: the state of the art and the challenges. Proceedings of the PAKDD 1999
workshop on knowledge discovery from advanced databases,
Yang, Y. & Liu, X.,
1999. A re-examination of text categorization methods. Proceedings of SIGIR-99,
22nd ACM international conference on research and development in information
retrieval,
Contact the Authors :
Nara Naomi Nishitani Pais is Solutions Architect at SPSS MR Latin America
Inc. and can be reached at: R. Nova York, 871 ap 61. Brooklin Paulista, São
Paulo - SP, BRAZIL; Phone: +5511 55321251; E-mail: narapais@uol.com.br
Fefie Dotsika is a Senior Lecturer in the Business School of the University of
Westminster and can be reached at: University of Westminster, 35 Marylebone
Road, London NW1 5LS, UK; Phone: +44 (0)20 79115000 ext. 3027; E-mail:
F.E.Dotsika@westminster.ac.uk
James Shearer is a Senior Lecturer in the