google
yahoo
bing

Posts Tagged ‘midgard’

Zeitgeist API Ramblings

Saturday, August 1st, 2009

I’ve been spending a lot of my brain cycles lately thinking about how to design the Zeitgeist DBus API properly. Let me tell you that it is a tough nut to crack – that, or I am not very good at this stuff :-)

Please note that this is merely a personal brain dump. I have not discussed this with the other Zeitgeist developers yet.

Zeitgeist Object Model

Before I can get to the real problem we have to be on the same page regarding the object model employed by Zeitgeist.

Here’s the deal. First chant with me: “Zeitgeist is an event logging framework”. For Zeitgeist to provide a useful API it needs to know about more things than just Events. It needs to know some minimal stuff about the Items that make up your desktop (files, emails, contacts, online services, etc.). It also needs to consider that these items can have Annotations such as tags, ratings, comments, being bookmarked, or other. Behold my graphical superiority:

Zeitgeist event, item, and annotation relations

Zeitgeist event, item, and annotation relations

As it stands both Annotations and Events are considered subclasses of Item in an Object Oriented sense. This means that we can have Annotations on Annotations and Events on Annotations (or Events on Events!).

I am definitely of the opinion that Annotations should be first class Items in their own right, however I am not sure about events. The case about Events being Items or not is mostly technical. From an ideological POV I think it is great if everything inherits from Item (more flexibility – yay!). The case is that we can cut down on the DB size if Events doesn’t subclass Item – hence the question marks on the label.

You may be able to grasp the Zeitgeist data model better if you read the database design spec.

The Problem: Querying

So we want to utilize this rich event log to query for interesting relations between Items (and by Item I will also mean any subclass hereof). Listing the most recently used tags would be a simple use case. The same application would probably also be interested in listing all the existing tags – or maybe more advanced, all tags on some specific subset of Items such as all files (and probably also a lot of more complex things).

The current Zeitgeist can do this, no problem. There is however the subtle problem that Zeitgeist is not meant to manage your tags. Zeitgeist is an Event Log – recall?

So assume that Item metadata and Annotations are stored elsewhere, in Tracker, CouchDB, Midgard, Soprano, Ikea or where ever you want. Where they should be. Let me call this sacred silo of user data the Repository. Applications will generally be interested in doing what amounts to an SQL JOIN over the Repository and the Zeitgeist event log. That is, cross referencing both and selecting subsets of data from both based on the relations between them.

If the Repository and Zeitgeist can’t cooperate in some way (fx. being in the same process having access to the same resources)  these “JOINS” can only be accomplished by fetching broad selections of data from the Repository and have Zeitgeist filter out the relevant parts before it returns the data to you. This will perform like crap.

Raising the Questions

  • Separation – How should we separate the Repository and the event log? If this is makes sense to do. To me they are two very different things.
  • API – Should Zeitgeist expose what it knows about the Repository as some “weak Repository API”, essentially acting as a proxy. This could make sense so that you might be able to run without a full blown repository, but only using the limited Repository functionality that Zeitgeist can provide natively.
  • Query Language vs. API – How does one query the Repository+Log? Currently we have nifty API utilising some “filters” you pass to the methods to limit what types of data you query. My playing around with this says that this is simply not powerful enough. It will power the ideas we have now, but what about next year? One really need the power of a query language to really reap the fruits of the event log. Obviously raising the question about what query language to expose:
    • SparQL – Powerful but a very hard (read:impossible) requirement if we want alterntive backends
    • MQL – JSON based query language of the Freebase project. The json-glib package should make this easier,
    • Xesam Query Language – Simple (the question is if it is too simple?). Has libs for Python, C/GObject, C++/Qt.
    • SQL – A simple subset of SQL should be fairly easy to shoehorn on top of most backends. Should be easy to parse too.

    For most of these languages we’d need only support of a subset of the full language

To be honest I hope someone comes up with a brilliant idea that is flexible enough to not make us need a full query language. I hope I am just an Architecture Astronaut.

The Other Problem: Different Backends

There are lots of interested parties in the Zeitgeist universe. This is wonderful, but also complicates matters a bit because there are lots of different agendas. I already mentioned Tracker, CouchDB, Midgard, and Soprano(Nepomuk). Of course we also have our own backend based on SQLite.

Tracker and Soprano can be queried with Sparql using the Nepomuk ontologies, but that is also where any similarities between all of the above ends. At least from my limited reading.

  • CouchDB uses some predefined “views” that must be known a priori. And is not very good at accommodating very varying queries. Maybe some Couch wizards can elaborate on what kind of querying one can do. And don’t get me wrong – I think the way data is queried in Couch is extremely elegant!
  • Midgard appears to expose a query building framework without exposing the actual query language. I like this approach because it gives flexibility to both the client and the server.

Ideas

I have a small collection of ideas that address many of the issues I’ve raised. I am not really happy about any of them, but I give them here anyway.

  • Monolithic Log – The idea here is to store all relevant item data right inside each log statement. That way we will not have to do the “JOIN” with the Repository I lamented about earlier. Everything will be right here in the log. The problems this approach drags along are an increased size of the log file and that the application submitting the log entry would have to know all about the subject of the logged event (that or the Zeitgeist engine needs to look it up before if commits to the log).
  • DBus Query Builder – A way to remove the need for a full query language would be to expose some query building framework in the API. This would not be a problem if the query building where done locally by an in-prcess library, but we are exposing a DBus interface here. Building queries will result in lots of DBus round trips and this is really bad (especially on lighter devices). To remedy this one could provide a “prepared statement” like interface where the apps build query templates before they need them and then submit the parameters to these templates when they perform the actual query. This will bring some book keeping for both clients and server. The client side could probably be handled fairly gracefully by a library though.
  • Log File With Index – The idea here is simply build up a semantic log file without really having a database or such thing it is stored in. Use a GZip compressed text file appending JSON or Turtle much inspired by the Metadata on Removable Devices Spec. This will of course not result in something we can query efficiently (updating such a structure will be very fast and light though). Alternatively this log can be stored in a CouchDB backend and you can have it replicated among all your workstations. To make it “queryable” this log would be indexed by Tracker, Strigi, whatever, or some Zeitgeist native SQLite.

All of these ideas can actually all be combined and solve almost all problems I’ve discussed. However there are almost certainly better solutions lurking out there on the internet – in the back of young aspiring hackers (or veterans!). Perhaps you dear reader?

Or maybe we can actually do fine with what we already have and I just pulled you through a long and elaborate blog post wasting your time?

Fin!

Sorry if this seemed like a long incomprehensible rambling. There are almost certainly loose ends and incomplete explanations. Do ask me to elaborate – I will respond :-)