Zeitgeist API Ramblings
I’ve been spending a lot of my brain cycles lately thinking about how to design the Zeitgeist DBus API properly. Let me tell you that it is a tough nut to crack – that, or I am not very good at this stuff
Please note that this is merely a personal brain dump. I have not discussed this with the other Zeitgeist developers yet.
Zeitgeist Object Model
Before I can get to the real problem we have to be on the same page regarding the object model employed by Zeitgeist.
Here’s the deal. First chant with me: “Zeitgeist is an event logging framework”. For Zeitgeist to provide a useful API it needs to know about more things than just Events. It needs to know some minimal stuff about the Items that make up your desktop (files, emails, contacts, online services, etc.). It also needs to consider that these items can have Annotations such as tags, ratings, comments, being bookmarked, or other. Behold my graphical superiority:
As it stands both Annotations and Events are considered subclasses of Item in an Object Oriented sense. This means that we can have Annotations on Annotations and Events on Annotations (or Events on Events!).
I am definitely of the opinion that Annotations should be first class Items in their own right, however I am not sure about events. The case about Events being Items or not is mostly technical. From an ideological POV I think it is great if everything inherits from Item (more flexibility – yay!). The case is that we can cut down on the DB size if Events doesn’t subclass Item – hence the question marks on the label.
You may be able to grasp the Zeitgeist data model better if you read the database design spec.
The Problem: Querying
So we want to utilize this rich event log to query for interesting relations between Items (and by Item I will also mean any subclass hereof). Listing the most recently used tags would be a simple use case. The same application would probably also be interested in listing all the existing tags – or maybe more advanced, all tags on some specific subset of Items such as all files (and probably also a lot of more complex things).
The current Zeitgeist can do this, no problem. There is however the subtle problem that Zeitgeist is not meant to manage your tags. Zeitgeist is an Event Log – recall?
So assume that Item metadata and Annotations are stored elsewhere, in Tracker, CouchDB, Midgard, Soprano, Ikea or where ever you want. Where they should be. Let me call this sacred silo of user data the Repository. Applications will generally be interested in doing what amounts to an SQL JOIN over the Repository and the Zeitgeist event log. That is, cross referencing both and selecting subsets of data from both based on the relations between them.
If the Repository and Zeitgeist can’t cooperate in some way (fx. being in the same process having access to the same resources) these “JOINS” can only be accomplished by fetching broad selections of data from the Repository and have Zeitgeist filter out the relevant parts before it returns the data to you. This will perform like crap.
Raising the Questions
- Separation – How should we separate the Repository and the event log? If this is makes sense to do. To me they are two very different things.
- API – Should Zeitgeist expose what it knows about the Repository as some “weak Repository API”, essentially acting as a proxy. This could make sense so that you might be able to run without a full blown repository, but only using the limited Repository functionality that Zeitgeist can provide natively.
- Query Language vs. API – How does one query the Repository+Log? Currently we have nifty API utilising some “filters” you pass to the methods to limit what types of data you query. My playing around with this says that this is simply not powerful enough. It will power the ideas we have now, but what about next year? One really need the power of a query language to really reap the fruits of the event log. Obviously raising the question about what query language to expose:
- SparQL – Powerful but a very hard (read:impossible) requirement if we want alterntive backends
- MQL – JSON based query language of the Freebase project. The json-glib package should make this easier,
- Xesam Query Language – Simple (the question is if it is too simple?). Has libs for Python, C/GObject, C++/Qt.
- SQL – A simple subset of SQL should be fairly easy to shoehorn on top of most backends. Should be easy to parse too.
For most of these languages we’d need only support of a subset of the full language
To be honest I hope someone comes up with a brilliant idea that is flexible enough to not make us need a full query language. I hope I am just an Architecture Astronaut.
The Other Problem: Different Backends
There are lots of interested parties in the Zeitgeist universe. This is wonderful, but also complicates matters a bit because there are lots of different agendas. I already mentioned Tracker, CouchDB, Midgard, and Soprano(Nepomuk). Of course we also have our own backend based on SQLite.
Tracker and Soprano can be queried with Sparql using the Nepomuk ontologies, but that is also where any similarities between all of the above ends. At least from my limited reading.
- CouchDB uses some predefined “views” that must be known a priori. And is not very good at accommodating very varying queries. Maybe some Couch wizards can elaborate on what kind of querying one can do. And don’t get me wrong – I think the way data is queried in Couch is extremely elegant!
- Midgard appears to expose a query building framework without exposing the actual query language. I like this approach because it gives flexibility to both the client and the server.
Ideas
I have a small collection of ideas that address many of the issues I’ve raised. I am not really happy about any of them, but I give them here anyway.
- Monolithic Log – The idea here is to store all relevant item data right inside each log statement. That way we will not have to do the “JOIN” with the Repository I lamented about earlier. Everything will be right here in the log. The problems this approach drags along are an increased size of the log file and that the application submitting the log entry would have to know all about the subject of the logged event (that or the Zeitgeist engine needs to look it up before if commits to the log).
- DBus Query Builder – A way to remove the need for a full query language would be to expose some query building framework in the API. This would not be a problem if the query building where done locally by an in-prcess library, but we are exposing a DBus interface here. Building queries will result in lots of DBus round trips and this is really bad (especially on lighter devices). To remedy this one could provide a “prepared statement” like interface where the apps build query templates before they need them and then submit the parameters to these templates when they perform the actual query. This will bring some book keeping for both clients and server. The client side could probably be handled fairly gracefully by a library though.
- Log File With Index – The idea here is simply build up a semantic log file without really having a database or such thing it is stored in. Use a GZip compressed text file appending JSON or Turtle much inspired by the Metadata on Removable Devices Spec. This will of course not result in something we can query efficiently (updating such a structure will be very fast and light though). Alternatively this log can be stored in a CouchDB backend and you can have it replicated among all your workstations. To make it “queryable” this log would be indexed by Tracker, Strigi, whatever, or some Zeitgeist native SQLite.
All of these ideas can actually all be combined and solve almost all problems I’ve discussed. However there are almost certainly better solutions lurking out there on the internet – in the back of young aspiring hackers (or veterans!). Perhaps you dear reader?
Or maybe we can actually do fine with what we already have and I just pulled you through a long and elaborate blog post wasting your time?
Fin!
Sorry if this seemed like a long incomprehensible rambling. There are almost certainly loose ends and incomplete explanations. Do ask me to elaborate – I will respond

August 1st, 2009 at 3:42 pm
well thats what we ontologies for – you know how to query it at design time.
the whole events thingy looks to be too generic to be useful – indeed it subverts an ontology. Differnt apps could use different names for events further destroying its usefulness and making it almost impossible to query
EG take a file attachment to an email, for zeitgeist this might be an event but for an ontology based system like tracker it is not – Its a semantic relationship defined in the onto.
The same could be said for file history, web history etc. These are all semantic relationships that belong in a structured onto and not loosely defined in a table
What zeitgeist should do IMO is copy what Dashboard did for beagle but store the events fired off as well in tracker for a limited time period (say 6 months) so it does not bloat up and use tracker to query them. When tracker goes into gnome as it should zeitgeist wont need to worry about other backends
August 1st, 2009 at 3:52 pm
Pick one backend and stick to it. I would have thought that the one-backend thing was common knowledge by now. But in case it’s not, feel free to figure out why Phonon or wxWidgets are such jokes.
And keep in mind that you will throw away and redesign 3 more times anyway when you try to actually make useful applications from that data. (You might also want to tell that to the Nepomuk and Tracker guys.)
If I were you, I’d go Tracker, because you’ll likely get a bunch of people (read: Tracker hackers) that will implement features just for you, because you’re the only real user of their awesome library. And also because Tracker has a proper idea what “Items” are and how to query them effectively. As opposed to SQL or soe such.
August 2nd, 2009 at 2:50 am
I don’t see the structural difference between events and annotations. I think of it of as data and metadata no matter it are locations, tags, dates or whatever. I can understand you separate them if it makes the queries too complex or your backend doesn’t hold date information.
Like Bejamin I agree choosing one backend instead of trying to make generic with an abstraction layer. Abstraction will slow down the development, narrow the possibilities and make everything much more complex then needed.
Tracker also got my vote since it’s active, specialized for files/program (meta)data and it’s ontology and is not like an content repository which used to have the content itself. It’s also close to Gnome and they’re hopefully open to optimize their API for Zeitgeist.
August 2nd, 2009 at 5:44 am
@Jamie: Zeitgeist is based on a semantic platform – rest assured. Events will adhere to some given Zeitgeist Event Ontology that we will define as the Zeitgeist project grows and we understand the problem scope better.
We will also use Nepomuk for describing the all non-Event objects.
And taking your attachment example – an attachment in itself will not be an event in Zeitgeist. You would have an event for when you send or received the parent email. I am not sure that Zeitgeist events would go to the granularity of specific attachments, but since we are Nepomuk based this is really something that depends on the apps submitting the events.
August 2nd, 2009 at 5:47 am
@Benjamin: We are already at something like the 3rd rewrite
They are relatively cheap because everything is in Python (for now).
It is not impossible that we settle on one backend only. My brain storming is specifically to figure out if there exist a common abstraction that is powerful enough. If there ain’t then we’ll most likely marry ourselves to what ever fits best.
August 2nd, 2009 at 5:54 am
@Cas: Events and Annotations are similar in the sense that they both have a many-to-many relationship with Items. But that is also where the similarities end.
Ofcourse if we go to the abstraction level of RDF graphs they are alike, but in a more structured datamodel they are very different.
Zeitgeist Items has metadata such as content type, mimetype, source type (where they are from fx. filesystem, web page, email account, …), URL.
Events have such metadata as timestamp, the app emitting the event, what type of event (user action, system notification), what happended (created, modified, deleted, etc.), and some more.
Annotations are really identical to Items just with the extra property that they have a set of subjects.
August 2nd, 2009 at 7:05 am
I have to admit I haven’t looked at Tracker yet (so maybe once I see it I may change my opinion), but at the moment my idea is that once Tracker supports events (so that we can store everything in there and don’t need to keep events in our own database, which wouldn’t work performance-wise given the sort of queries we have) we should switch over to it.
With that approach I’d see Zeitgeist kind of as an “extension” to Tracker, where it’s functions would be on one side, the logging of events, and on the other providing a high-level D-Bus API to easily access events and to get relationships between them (Seif’s magic); this way we get to keep a very simple to use API but applications needing more advanced stuff (or maybe requiring more fine-grained control on what to fetch for performance reasons) have the chance to query Tracker directly. What are your thoughts on this?
August 2nd, 2009 at 9:36 am
@kamstrup You managed to get me to start reading the CouchDB book. It is an interesting read. I like the views. Instead of a filesystem, a CouchDB might be used as a versioned filessytem with automatic indexing and event logging. Linux is stuck with the POSIX API and CouchDB is a refreshing simple alternative storage system.
As to where to store all the data, I’d say everything is an event. If I write a file, that is an event, so the file contents should go into the event log. When I receive an email, that’s an event, the content of the mail should be in the same storage that records the event.
Any new event stored should trigger the appropriate analysis calls. These can be map/reduce functions, SPARQL queries and metadata extractors. The results of these all wander in temporary indexes/caches/views that provide the user interface with nice views on the data.
August 2nd, 2009 at 12:10 pm
@Jos – There is something very elegant about CouchDB indeed!
About events – what you are describing is really a version controlled triple store. Store all objects with full history and the event log will be completely redundant. As it stands I am not sure that this is feasible and/or a good idea.
You may have events that are far less tangible than files and emails etc. Consider mouse gesutures, system notifications, unlocking the screen after screen lock, approaching the computer with your bluetooth enabled device, etc.
August 2nd, 2009 at 4:47 pm
We can (and will) help Zeitgeist with their SPARQL query requirements. We can for example implement certain custom FILTER functions for Zeitgeist (if necessary, because if functionality can be done using SPARQL’s standard functions, then those should of course be used).
As many people already voiced here don’t I think you should go for ‘multiple backends’: it would require some sort of abstract layer for the query language, or implementing a SPARQL -> to -> storage’s native query language – function.
I think that for metadata SPARQL is ~ the language that we, desktop and mobile developers, should standardize on. So just do SPARQL UPDATE to insert your data into the store, use SPARQL to query it, and write custom ontologies if necessary. All tree are supported by Tracker’s master, by the way.