Topical Crawling for Business Intelligence
Gautam Pant and Filippo Menczer??
Department of Management Sciences
The University of Iowa, Iowa City IA 52242, USA
Abstract. The Web provides us with a vast resource for business in-
telligence. However, the large size of the Web and its dynamic nature
make the task of foraging appropriate information challenging. General-
purpose search engines and business portals may be used to gather some
basic intelligence. Topical crawlers, driven by richer contexts, can then
leverage on the basic intelligence to facilitate in-depth and up-to-date re-
search. In this paper we investigate the use of topical crawlers in creating
a small document collection that helps locate relevant business entities.
The problem of locating business entities is encountered when an organi-
zation looks for competitors, partners or acquisitions. We formalize the
problem, create a test bed, introduce metrics to measure the performance
of crawlers, and compare the results of four different crawlers. Our re-
sults underscore the importance of identifying good hubs and exploiting
link contexts based on tag trees for accelerating the crawl and improving
the overall results.
A large number of business entities — start-up companies and established cor-
porations — have a Web presence. This makes the Web a lucrative source for
locating business information of interest. A company that is planning to diver-
sify or invest in a start-up would want to locate a number of players in the area
of business. The intelligence gathering process may involve manual efforts using
search engines, business portals or personal contacts. Topical crawlers can help
in extracting a small but focused document collection from the Web that can
then be thoroughly mined for appropriate information using off-the-shelf text
mining, indexing and ranking tools.
Topical crawlers, also called focused crawlers, have been studied extensively
in the past [6, 7, 3, 10, 2]. In our pre