S4 development

August 16, 2010

The End

Filed under: GSoC — Sivert Berg @ 19:01

GSoC is coming to an end, and it is time to wrap it all up and show the results of a summer of code. Since the last post I have implemented the Collections 2.0 query concept with some minor changes. I have also written a simple command-line tool for S4 to query, add to and delete from S4 databases without using XMMS2. You can read more about Collections 2.0 here and S4 and the S4 cli tool here.

If you want to try out the new code you can checkout git://git.xmms.se/xmms2/xmms2-cippo.git (the master branch). On the first run it should convert your old sqlite database into a new S4 database automatically. As the code has not seen extensive testing I would recommend that you back up your sqlite database before using the new code.

For client developers interested in the new query system I recommend you take a look at this. There is also a simple client using the new query API in the s4coll2 branch in the previously mentioned git repository. You can also take a look at how xmmsc_coll_query_ids and xmmsc_coll_query_infos are implemented, as they use the new xmmsc_coll_query function underneath.

Before logging off and starting to think about other things than GSoC again, I would just like to say thank you to the good folks at XMMS2 and Google. It has been a really interesting and enjoyable summer of code!

July 14, 2010

Halfway There

Filed under: GSoC — Sivert Berg @ 19:58

Summer is at its peak and GSoCers all over the world are submitting mid-term evaluations. What better time to give a quick status update on S4 and Coll 2.0? Since the last post S4 has seen some big changes, most of them to make implementing Coll 2.0 possible/easier. It started with midb which I mentioned in my previous post. midb made indexes memory only, and only the data was saved to disk. This helps to keep the on-disk format simple. Together with a log this provides a pretty reliable database (something the old S4 was not). The other big change was adding source preferences. Source preferences gives sources priorities, and the property (or properties) with the highest priority source is chosen to be matched/fetched. On top of the new changes a new query system was fitted, removing confusing functions like s4_entry_contains, s4_entry_contained and similar, adding just s4_query. All in all those changes morphed S4 from whatever it was and into a set of entries, where each entry can have a different number of properties. Matching is done on an entry-by-entry basis, and if an entry matches data is fetched from it. Here’s an example:

Key Value Source
song_id 1
artist Foo plugin/id3v2
title Bar plugin/id3v2
rating 4 client/playa
Key Value Source
song_id 2
artist Crazy Jazz Band plugin/id3v2
title 2 hour jam session plugin/id3v2
rating 5 client/playa
Key Value Source
song_id 3
artist Generic Artist plugin/id3v2
title Incorrect Title plugin/id3v2
title Correct Title client/playa
rating 3 client/playa

Above we see three entries with some properties set. Properties have key, value and source. We see that the first property, song_id, is special; it has no source. This is because song_id is the parent property of the entry.  It can be thought of as the key of the entry, there’s no two entries with the same parent property.  Now say that the user wants to find the artist of every song with rating >= 4. He would then do something like this (in pseudo-code).

s4_query (fetch = ‘artist’, condition = ‘rating >= 4’)

s4_query would then visit each entry, see if rating exists and is >= 4, and if it is return the artist. So the above query would result in the result set: {“Foo”, “Crazy Jazz Band”}.  Now say the user wanted to query the title of every song in his library. He would then do something like this:

s4_query (fetch = ‘title’, condition = ‘everything’, sourcepref = ‘client/*:plugin/*:*’)

We see that the user this time provided a source preference too. This sourcepref will be used when choosing what data to fetch. s4_query would again visit every entry, the condition would match all of them so it would fetch the title of all of them. For the first two it’s simple, but the last one has two properties with the key ‘title’. Which one to choose? If we look at the sourcepref we see that sources matching ‘client/*’ comes before ‘plugin/*’, so it would pick the one set by ‘client/playa’, and s4_query would return the result set {“Bar”, “2 hour jam session”, “Correct Title”}.

Matching by visiting every entry is of course slower than using an index, but it turns out in most cases it is Fast Enough (TM). There are however some cases where it is too slow, for example when searching for entries matching a property many times per second, and S4 therefore supports creating indexes on specific property-keys. By default it creates an index on parent properties, so in the example above searching on song_id would be fast, but the user can also specify other properties to make an index on. XMMS2 creates indexes on ‘url’ and ‘status’, because they are searched on a lot.

We can compare it to the benchmark we ran earlier to show how this new design affects performance:

Query S4 old S4 new
Avg (µs) S (µs) Avg (µs) S (µs) Result size
“one” 127.7 ~3.7 14812.3 ~369.5 5
“*” 148165.8 ~3209.5 50732.8 ~1476.6 9410
“artist:metallica” 3336.5 ~219.5 10192.7 ~272.7 192
“tracknr>20” 1871.7 ~132.7 9448.8 ~464.1 107
“tracknr>30” 236.1 ~7.5 8913.5 ~143.1 13
“+comment” 58894.9 ~2003.7 21995.1 ~589.6 3297
“tracknr:4” OR “artist~foo” AND NOT “artist:weezer” 18785.8 ~638.8 20717.0 ~388.8 776

As we can see small queries (small result size) have a big slowdown, while large queries have a speedup. This is because fetching is faster with the new code, while the checking is slower. If the result size is about 1/10th of the total size the fetching starts outweighing the checking, and the new code starts getting faster.

With the new functionality in place S4 is ready to be used for Coll 2.0. Two weeks ago I started implementing the Coll2 operators, and I’m now just about done with the server code. The collection parser is also updated to produce the new operators and nycli has been hacked to compile. The different language bindings still needs to be updated to use the new operators. The next thing to do is fix the language bindings (this could possible break quite a lot of clients, not good) and start to look at the new Coll2 query concept. I had also planned to write a standalone S4 client, like the ‘sqlite3’ command-line application, but it looks like time might run out.

Blog at WordPress.com.