Document Categorisation - A key challenge


Let's be honest! Finding documents of interest on edocr.com is not quite as bad as the above image (hopefully, you will agree with us on this). Whilst we have in-built search capability and tagging, we want to introduce easier ways to find documents as we attract more and more users and documents each day.

As part of the on-going development of edocr.com, we have started to categorise documents. Anyone who has ever tried to categorise information will understand how hard it is to get it right first-time. There are so many ways of categorising information from standards such as Standard Industry Classifications (SIC) codes used in the UK to North American Industry Classification System (NAICS) used in the United States. In addition, many business websites have adopted their own. One classic example is Kompass Directory. After initial research, we came to the conclusion that we have no choice but to create our own.

When we build edocr in 2007, our initial thoughts were to build categorisation based on combining relevant tags. Having seen how some of our community members tag, we can only conclude that not everyone understand the importance of tagging nor how to tag effectively. Perhaps, we are guilty of not describing this process well (see previous blog post on tagging here).

It is our belief that the subject of any document should be able to be covered with four levels of categorisation. As primary categories, we have already introduced the followings:

- Business Solutions
- Business Support Services
- Chemicals
- Companies
- Consumer
- Construction
- Countries
- Engineering
- Finance
- Government
- Finance
- Food & Beverage
- Government
- Hospitality & Tourism
- Insurance
- Manufacturing
- Technology
- Transportation
- Utilities

More primary categories will be added over the coming days and weeks. Adding 2nd, 3rd and 4th levels of categories to above will continue well into the year. Please do engage with us if we have not introduce categories that are relevant to your documents.

alexdenipaul's picture
 

A couple of weeks ago Bing had a small search summit for analysts, bloggers, SEO experts, entrepreneurs and advertisers. It was held in Bellevue; they put us up in the hotel and fed us. While there we received demos from Bing project teams. I was able to snag an interview with Sanaz Ahari, Lead PM on Bing. She led the team that developed the categories you see on a Bing web search. The interview was based on the slides from her presentation at the event. I have posted the significant images from her slides. The first portion of the interview focuses on how the Bing team handles Query level categorization and some of the problems they face. The second portion (up shortly) focuses on the systems used to generate the categorization health insurance.

Hi, this is Brady Forrest with O'Reilly Radar, and I'm here with Sanaz Ahari, Lead PM on Bing Search. And she's going to lead us through the categorization process that you see on every page. Hey, Sanaz.

Sanaz Ahari: Hey, Brady. So I'm going to walk you through basically kind of just the journey that we went through for coming up with our categorized experience. And so the categorized experience is basically the left rail experience that you see on Bing today. It doesn't show up for every single query today, but when it does show up, it's really about helping the users complete their task essentially. So just to take a step back, when we started on the project, we had done a lot of analysis on queries just in vacuum. And queries are always a part of users completing a task. And in a lot of the analysis we did, we noticed that a lot of the tasks are common. And it's really just common sense. When you're looking for a car, you're either researching it; you already own it; you want to buy one. When you're looking for a musician, you want to see if they're on tour real estate; you want lyrics, songs, albums, et cetera.

And so our challenge was can we apply some of that essentially structured aspect to queries. And this is really similar to what you see on sites like Amazon, IMDB, et cetera. They do just a really kick ass job of categorizing their content. The challenge is that A, those sites are really about one domain. And then B, those sites are really operating on top of already structured data. And so the challenge that we have with search is that A, we are a general purpose search engine, and then B, the data that we have is not structured. So the goal that we started out with was we wanted to start very simple. And categorization on clustering, et cetera are nothing really new in the search space. There are a lot of people for years that have been working around the space in the research and computer science space.

So what we started out with was one of the key things that we wanted were two principles. One of them was A, can we achieve aspects and categories that were really, really user intuitive. And B, can we achieve this across a query class. One of the things that we really wanted was in order for us to build a habit for our users, we needed to deliver a predictable consistent experience across a query class. So if I went and told my dad, "Hey insurance, Dad, try any car," I really want him to get a categorized experience for any car. So those are the two kind of constraints that we really set for ourselves. We said, "Unless we meet these two criteria, it's not really successful." And so we started out with a lot of prototyping around, "Hey, can we actually extract intent from queries?" So we started from the intent aspect. And I'll walk you through an example just to show you a simplistic view and how it gets very easily complicated business cards.