Docco: Document Retrieval with Formal
Concept Analysis
Peter Becker1
School of Information Technology and Electrical Engineering (ITEE)
The University of Queensland
QLD 4072, Australia
peter@peterbecker.de
Abstract. A large amount of information available in digital form is
not stored in a standardized structure such as a relational database but
available only in form of semi-structured documents in a file system.
Typically these documents span a number of different file formats and do
not follow standard schemas in their structure. With Docco we present
a tool that allows retrieving documents from a heterogenous collection
using Formal Concept Analysis as a method of structuring query results.
1 Introduction
With the rise of desktop computing and modern office applications such as word
processors, spreadsheets and presentation software the number of digital docu-
ments created has increased to an extent where managing collections of such doc-
uments becomes a major issue in work environments and the current approaches
of organizing documents in folders in a file system becomes too limited.
As an example imagine a paper presented in a workshop on a conference.
Was it classified as belonging to the conference or the workshop or both? Were
acronyms or full names used for the conference or workshop? Was the paper
also classified against author(s), year or other facets such as the topic of the
paper? How is a paper that has multiple authors classified? What does the
person retrieving the document do if the file system wants them to choose the
conference first, but they know only author and year?
As a new approach to this problem we implemented Docco 1. We used a
standard indexing engine and indexed not only the content of the documents, but
also embedded metadata (if possible) and the classification made through the
position in the file system. Formal Concept Analysis (FCA) is used to visualize
the results in a way that allows easy handling of overspecified queries.
2 Connecting Information Retrieval and Formal Conce