Posts Tagged ‘facets’

Fascinating Facets!

Wednesday, September 29th, 2010

What is Facetting?

“Facetting” is a word which has a special meaning in search-engine-world. It could be defined as the generalization of “Tagging” which I assume you’re familiar with (from Twitter, Flickr, et al).

So instead of having just one kind of tags we could be creative and have two kinds; “Tags” and “Jags”. To help you organize your stuff your system displays statistics about your Tags and Jags with counts on how many matching items you have. Fx.


Tags

  • pony (3)
  • kitten (27)

Jags

  • ninja (5)
  • samurai (68)

If I now search across my system for anything containing the word “ramen” these statistics would narrow down to show the counts for the search results. Fx:

Searching for “ramen”
Tags

  • kitten (3)

Jags

  • ninja (5)
  • samurai (2)

In any part of my journey I could click on on a particular Tag or Jag and narrow my search results down to only match items with that particular attribute. Fx. clicking on the “ninja” Jag:

Searching for “ramen”, restricted to the “ninja” Jag
Tags

  • kitten (1)

Jags

  • ninja (5) X

In real life we don’t deal as much with Tags and Jags, so consider that you could stuff anything old metadata attribute in there instead. Searching a library catalogue very useful facets would be

Example facets for a library system:
Author
Title
Publisher
Year

To be honest these facets where not exactly grabbed out of the thin air :-) I highly encourage to go play with the real deal and the homepage of the State Library of Denmark.

Technical Aspects of Facetting

If you take indexing libraries like Lucene or Xapian out of the box – you have to do quite a lot of work to get correct facetting. And by correct I mean always getting the counts exactly right and always calculating the entire facet sets for the active query.

A common solution to give the illusion of facets is to simply calculate them on the search engine for the first 100 hits (or 1000, or whatever). This leads to a slow and resource hungry solution that doesn’t provide the right results for large results sets (with more than 100 hits).

Fear not! There are shrink wrapped products like Summa or Solr that can give you correct facets pretty much out of the box. However it’s still not exactly something you are going to run on low end servers (unless you have a very small index).

This is where my true inspiration behind this blog post is revealed! Toke Eskildsen (my awesome former coworker) has been hacking away, trying to get the facetting system from Summa upstreamed into Lucene. Along his way optimizing the internals of Lucene with facetting in mind and providing hooks to make facetting more efficient. Toke’s latest status update certainly heralds a brighter future! :-D

It’s my hope that Toke’s work can help bring facetting more into the mainstream – because it’s truly and awesome way to browse huge datasets.

Facetting on the Open Source Infrastructure

Dreaming on into a world where facetting is ubiquitous I can certainly see Bugzilla, Launchpad, translations sites, wikis, and what not making lives a lot easier for everyone from passers-by to professional developers if they could do facetting across their metadata.

Facetting on the Desktop

Even though Toke’s work is all sorts of awesome, my gut instinct tells me that general facetting still would be too heavy a task for a normal desktop.

That said, it may not be impossible. At the very least I want it to be possible! :-) Really polishing of the low level data structures, maybe cheating just a wee bit, we can get something which is good enough.

A while back I actually configured Summa to harvest my desktop (wiring it up with Tika) configuring Summa to create facets for document titles, uris, and mimetypes. Stuff like that. And when I started browsing my files in Summa I just immediately had one of those Eurika moments:

Files are meant to be browsed through facets!

It just felt so right :-)

(Bonus: Facetting and Zeitgeist?)

Sorry I don’t have a cool demo to show here :-) Just a pipe dream to share.

In theory; it is possible to define a Timeline facet where each entry would correspond to a certain time range (the histogram for Gnome Activity Journal is actually more or less doing this).

Couple this with the zeitgeist-fts-extension to give you a full text search interface and you have the foundations. Now you “just” need to intersect the searches with some facetting info on the logged metadata and do a heckuwa lot of counting, and presto – magical interface to replace the aging hierarchical file system metaphor :-)

Ok – I may have made that last part sound easier than it’s likely to be… To be honest it’s gonna be darn friggin hard to implement in an effiencient and light way. So don’t hold you breath… I’m not.