Posts Tagged ‘desktop search’

Fascinating Facets!

Wednesday, September 29th, 2010

What is Facetting?

“Facetting” is a word which has a special meaning in search-engine-world. It could be defined as the generalization of “Tagging” which I assume you’re familiar with (from Twitter, Flickr, et al).

So instead of having just one kind of tags we could be creative and have two kinds; “Tags” and “Jags”. To help you organize your stuff your system displays statistics about your Tags and Jags with counts on how many matching items you have. Fx.


Tags

  • pony (3)
  • kitten (27)

Jags

  • ninja (5)
  • samurai (68)

If I now search across my system for anything containing the word “ramen” these statistics would narrow down to show the counts for the search results. Fx:

Searching for “ramen”
Tags

  • kitten (3)

Jags

  • ninja (5)
  • samurai (2)

In any part of my journey I could click on on a particular Tag or Jag and narrow my search results down to only match items with that particular attribute. Fx. clicking on the “ninja” Jag:

Searching for “ramen”, restricted to the “ninja” Jag
Tags

  • kitten (1)

Jags

  • ninja (5) X

In real life we don’t deal as much with Tags and Jags, so consider that you could stuff anything old metadata attribute in there instead. Searching a library catalogue very useful facets would be

Example facets for a library system:
Author
Title
Publisher
Year

To be honest these facets where not exactly grabbed out of the thin air :-) I highly encourage to go play with the real deal and the homepage of the State Library of Denmark.

Technical Aspects of Facetting

If you take indexing libraries like Lucene or Xapian out of the box – you have to do quite a lot of work to get correct facetting. And by correct I mean always getting the counts exactly right and always calculating the entire facet sets for the active query.

A common solution to give the illusion of facets is to simply calculate them on the search engine for the first 100 hits (or 1000, or whatever). This leads to a slow and resource hungry solution that doesn’t provide the right results for large results sets (with more than 100 hits).

Fear not! There are shrink wrapped products like Summa or Solr that can give you correct facets pretty much out of the box. However it’s still not exactly something you are going to run on low end servers (unless you have a very small index).

This is where my true inspiration behind this blog post is revealed! Toke Eskildsen (my awesome former coworker) has been hacking away, trying to get the facetting system from Summa upstreamed into Lucene. Along his way optimizing the internals of Lucene with facetting in mind and providing hooks to make facetting more efficient. Toke’s latest status update certainly heralds a brighter future! :-D

It’s my hope that Toke’s work can help bring facetting more into the mainstream – because it’s truly and awesome way to browse huge datasets.

Facetting on the Open Source Infrastructure

Dreaming on into a world where facetting is ubiquitous I can certainly see Bugzilla, Launchpad, translations sites, wikis, and what not making lives a lot easier for everyone from passers-by to professional developers if they could do facetting across their metadata.

Facetting on the Desktop

Even though Toke’s work is all sorts of awesome, my gut instinct tells me that general facetting still would be too heavy a task for a normal desktop.

That said, it may not be impossible. At the very least I want it to be possible! :-) Really polishing of the low level data structures, maybe cheating just a wee bit, we can get something which is good enough.

A while back I actually configured Summa to harvest my desktop (wiring it up with Tika) configuring Summa to create facets for document titles, uris, and mimetypes. Stuff like that. And when I started browsing my files in Summa I just immediately had one of those Eurika moments:

Files are meant to be browsed through facets!

It just felt so right :-)

(Bonus: Facetting and Zeitgeist?)

Sorry I don’t have a cool demo to show here :-) Just a pipe dream to share.

In theory; it is possible to define a Timeline facet where each entry would correspond to a certain time range (the histogram for Gnome Activity Journal is actually more or less doing this).

Couple this with the zeitgeist-fts-extension to give you a full text search interface and you have the foundations. Now you “just” need to intersect the searches with some facetting info on the logged metadata and do a heckuwa lot of counting, and presto – magical interface to replace the aging hierarchical file system metaphor :-)

Ok – I may have made that last part sound easier than it’s likely to be… To be honest it’s gonna be darn friggin hard to implement in an effiencient and light way. So don’t hold you breath… I’m not.

Xesam Tools 0.7.0

Monday, May 11th, 2009

Another one of those blog posts where the author shamelessly abuse his blog for posting boring release announcements. Next time he will reveal his deep desires and darkest fantasies – or not…

Just a quickie before I am off for bed – I just pushed Xesam Tools 0.7.0 to the galacto-webs.

The changelog is really short, but since you love bullet point lists so much:

  • Xesam 1.0 compliance

Refresh my memory: What is Xesam Tools?

From the website:

Xesam Tools is a set of tools implemented in Python/PyGobject to help development and adoption of xesam technologies.

It consists of three things.

  • A client side implementation of the xesam search api abstracting away the dbus-isms and exposing a nice clean api
  • A bunch of tools and classes to introspect and test xesam implementations
  • A set of unit tests to exercise the full feature set of a xesam search server

Party like it’s 1999: Xesam 1.0

Saturday, May 2nd, 2009

Xesam 1.0

After many delays and flamewars  I am very proud to announce the first stable version of the Xesam specification. You can read the specification online at: http://xesam.org (direct link: http://xesam.org/main/Xesam100). Unfortunately there are no PDFs up yet, I am working on this.

This release brings some changes compared to the two pre-releases that aims to amend some (and only some) of the issues that was unearthed during the Desktop Search hackfest in Berlin, Sept. 2008. For some background on this please see: http://xesam.org/main/Drafts/LiveAPIChanges. The tools, libraries, and server backends has not caught up with this latest release yet. With some luck and and lots of hard work these things should start trickling in soonish. You probably need to follow the Xesam mailing list or IRC channel (xesam@fdo and #xesam on FreeNode respectively) to follow this progress if they are interested.

What is Xesam?

I realize that it has been a long time since I blogged about Xesam, so here’s a recap. Xesam is short for eXtEnsible Search And Metadata and is an umbrella project with the purpose of providing unified APIs and specs for desktop search- and metadata services. We are collaborating with several projects such as Tracker, Strigi, Beagle, Pinot, Recoll, and Nepomuk-KDE.

Currently Xesam consists of four components:

  • The DBus Search API
  • An XML query language
  • A search language targetted at end users
  • A collection of ontologies for describing the data objects on a modern computer

In the future we’d also like to provide:

  • An API for querying more detailed statistics from the search engine’s index (such as term- and field counts)
  • An API for storing an retrieving metadata
  • A more inspiring website :-)

Xesam 2.0 and Compatibility

The first version of a standard is very rarely perfect and Xesam is no exception. There is already active work on a version 2.0 of Xesam, and we can already say now that version 2.0 will not be compatible with Xesam 1.0 (although it will be possible to write a conversion layer, if one can accept the performance hit). The changes that are being considered are:

  • Switch to Nepomuk ontologies
  • Change query language to one that can better leverage the power of a semantic, graph-based, metadata storage. Sparql is currently the primary contender
  • Switch the Search API to be stateless and optimized as to reduce bus roundtrips, see some very rough drafts at: http://xesam.org/main/Drafts/NewSearchApi

It is my expectation that we can design the index- and storage- APIs (mentioned as future tasks above) in a way so that they will work both on Xesam 1.0 and Xesam 2.0.

Thanks!

I would like to thank everyone who has commented, criticized, helped, punked, and otherwise been awesome for bringing us this far! Here’s to the future.

The Personal Note

Now I need to bring Xesam Tools and Xesam Glib up to scratch. I am totally gonna love rewriting my code to be compliant against the changes in Xesam…

And – I am sorry Xesam 1.0 took so long! I know you love me anyway.

Off to Desktop Search Hackfest

Thursday, September 18th, 2008

Off to Berlin for the Desktop Search Hackfest tomorrow morning. It is going to be fun to finally meet up with some of the people I use to flame and troll with online when we are trying to agree on something for Xesam :-) . I really must thank Nokia for giving us this great possibility! So – thank you Nokia!

I have signed up for a lot of talking as there are really a lot issues we need to have consensus on before we finalize the Xesam Search Specification 1.0. The items we have lined up for 1.1 are also ambitious and could do with a lot of discussion so we can get a good start on it.

I fear that my brain will be toast after a day with BOFs and that it might not be up for a lot of coding, but lets hope. Thankfully I have a backup plan – poking the search engine devs to complete their Xesam support. So if you are a search engine developer in Berlin this weekend prepare yourself for some serious poking!

The downside of all of this is that I have to leave my wonderful family behind, I already miss the kids (well, also the wife, but don’t tell her I said that). It seems that the kids get more and more hilarious each day. Liv surprises me several times per day with her rapidly increasing vocabulary and often leaves me standing baffled – did she just say that? Lauge has also just started trying to express himself. It mostly comes down to “Hellooo!” and “Au au!” (when he pretends to touch something he thinks is hot (such as an old half-chewed cucumber lying on the floor)).

Xesam GLib 0.4.1

Monday, August 11th, 2008

Boring new release of Xesam GLib. NEWS:

 * Fix a bug in the UserSearchParser where search where a family of strings

mixing selectors and boolean operations caused parser errors (Kamstrup) * Fix a memory leak in the UserSearchParser, when internally resetting

the parser. (Kamstrup)

Download:xesam-glib-0.4.1.tar.gz

API docs: the usual place

Two Xesam News. Hackfest and Emacs

Sunday, August 3rd, 2008

Ok. Two great news on the Xesam front. It is really hard to rate which one is the most important so I labeled both of them 1).

1) Nokia sponsors desktop search hackfest! I am going. Party! If words like “metadata”, “index”, “search”, “query” and friends gives you rashes stay out of Berlin a few days before and after Sept 21st. If you are developing a proprietary search engine – start looking for a new job.

Expect lots more blogging on the hackfest when a clearer plan emerges.

1) Michael Albinus (dude, cool guys like you ought to have a blog), is working on Xesam integration in Emacs. No shit. Of course noone is surprised that Emacs has dbus support, but Xesam… Booyacacha![1]

[1]: I always knew that Vim totally sucked and Emacs was far superior in every way. But why am I writing that? This is an undisputed universal truth.

Xesam’s Got The Sound That Make Your Booty Go *!*

Monday, July 21st, 2008

Xesam GLib 0.4.0

It’s been a long under way. Longer than I had hoped. But here it is. Feature packed and with tonnes of new code under the hood. I am fairly confident that it is not too unstable though. Hurra for unit tests! Excerpt of NEWS file:

* Strip some public symbols that should have been private (Biebl)

* Add .spec generation for Fedora (and Suse, untested) (Colin)

* Make Hits an abstract class (derivations SequentialHits and SparseHits are implemented but not part of the puclic api) (Kamstrup)

* Virtualize all methods on the Hit class (Kamstrup)

* Remove the xesam_g_hit_get_data method as it exposed implementation details, for fast hit access you can now look up the field ids with xesam_g_hit_get_field_id() and use those to retrieve field values (Kamstrup)

* Remove xesam_g_{session,search}_get_field_map(). Clients no longer have direct access to the field maps. This was an implementation detail. (Kamstrup)

* Add 'auto-continue' property on XesamGSearch (Kamstrup)

* Add support for user-data in Search and Hits objects (Kamstrup)

* Enforce single header includes. From now on it is only allowed to include xesam-glib.h, not all the object headers (Kamstrup)

* Removed the last traces of dbus-glib generated GTypes from the public headers (Kamstrup)

* Add an example of how to use XesamGDBusSearcherSTub to examples/. This can be used to proxy dbus Xesam Search interfaces to other names and paths (Kamstrup)

* * An easy to use iterator interface to XesamGHits (Kamstrup)

* Implement xesam_g_search_request_extended_data(). This is an experimental feature, so don't make you life depend on it! (Kamstrup)

Downalod: xesam-glib-0.4.0.tar.gz

API docs: Xesam GLib 0.4.0 API docs

Anecdotal Rant

The ChangeLog of xesam-glib-0.4.0 contains this line:

By coincidence this also fixes the last blocker bug for 0.4 which was that "hits-removed" and "hits-modified" signals on XesamGSearcher where not included in the gtk-doc output.

Please note the words “blocker bug”. Yes – I consider missing documentation for libraries blocker bugs. This is evidently not the case for all library writers. Not that Xesam GLib has perfect documentation, but I do think that a library without proper documentation is just a smidgeon better than a binary blob.

Other Rocking Xesam-News

Pinot with Xesam support anyone?

Strigi continuing work on their (already working) Xesam interface

Relax!

Kick back and let the good stuff come to you! Crank up the volume on your stereo and let the beats rip those ears! The usual Xesam heroes is working for you today!

xesam-glib 0.3.0

Tuesday, June 24th, 2008

I just released Xesam GLib 0.3.0 onto the wild. You can grab xesam-glib here.

News in this release:

  • The far biggest item to hit xesam-glib has been the query handling API. It contains tools to parse queries and user search strings as well as a generic QueryBuilder interface you can plug in to convert Xesam queries to what ever you like (SQL, Lucene, SPARQL, GObjects)
  • A security issue related in XesamGSearch (see next point)
  • Take care to treat all interfacing with a XesamGSearcher as dealing with an untrusted source. Search engine hackers are notoriously evil people, we all know that ;-P

Tarball: xesam-glib-0.3.0.tar.gz

API docs: documentation for xesam-glib-0.3.0