Summer is at its peak and GSoCers all over the world are submitting mid-term evaluations. What better time to give a quick status update on S4 and Coll 2.0? Since the last post S4 has seen some big changes, most of them to make implementing Coll 2.0 possible/easier. It started with midb which I mentioned in my previous post. midb made indexes memory only, and only the data was saved to disk. This helps to keep the on-disk format simple. Together with a log this provides a pretty reliable database (something the old S4 was not). The other big change was adding source preferences. Source preferences gives sources priorities, and the property (or properties) with the highest priority source is chosen to be matched/fetched. On top of the new changes a new query system was fitted, removing confusing functions like s4_entry_contains, s4_entry_contained and similar, adding just s4_query. All in all those changes morphed S4 from whatever it was and into a set of entries, where each entry can have a different number of properties. Matching is done on an entry-by-entry basis, and if an entry matches data is fetched from it. Here’s an example:
Key |
Value |
Source |
song_id |
1 |
|
artist |
Foo |
plugin/id3v2 |
title |
Bar |
plugin/id3v2 |
rating |
4 |
client/playa |
|
Key |
Value |
Source |
song_id |
2 |
|
artist |
Crazy Jazz Band |
plugin/id3v2 |
title |
2 hour jam session |
plugin/id3v2 |
rating |
5 |
client/playa |
|
Key |
Value |
Source |
song_id |
3 |
|
artist |
Generic Artist |
plugin/id3v2 |
title |
Incorrect Title |
plugin/id3v2 |
title |
Correct Title |
client/playa |
rating |
3 |
client/playa |
|
Above we see three entries with some properties set. Properties have key, value and source. We see that the first property, song_id, is special; it has no source. This is because song_id is the parent property of the entry. It can be thought of as the key of the entry, there’s no two entries with the same parent property. Now say that the user wants to find the artist of every song with rating >= 4. He would then do something like this (in pseudo-code).
s4_query (fetch = ‘artist’, condition = ‘rating >= 4’)
s4_query would then visit each entry, see if rating exists and is >= 4, and if it is return the artist. So the above query would result in the result set: {“Foo”, “Crazy Jazz Band”}. Now say the user wanted to query the title of every song in his library. He would then do something like this:
s4_query (fetch = ‘title’, condition = ‘everything’, sourcepref = ‘client/*:plugin/*:*’)
We see that the user this time provided a source preference too. This sourcepref will be used when choosing what data to fetch. s4_query would again visit every entry, the condition would match all of them so it would fetch the title of all of them. For the first two it’s simple, but the last one has two properties with the key ‘title’. Which one to choose? If we look at the sourcepref we see that sources matching ‘client/*’ comes before ‘plugin/*’, so it would pick the one set by ‘client/playa’, and s4_query would return the result set {“Bar”, “2 hour jam session”, “Correct Title”}.
Matching by visiting every entry is of course slower than using an index, but it turns out in most cases it is Fast Enough (TM). There are however some cases where it is too slow, for example when searching for entries matching a property many times per second, and S4 therefore supports creating indexes on specific property-keys. By default it creates an index on parent properties, so in the example above searching on song_id would be fast, but the user can also specify other properties to make an index on. XMMS2 creates indexes on ‘url’ and ‘status’, because they are searched on a lot.
We can compare it to the benchmark we ran earlier to show how this new design affects performance:
Query |
S4 old |
S4 new |
|
Avg (µs) |
S (µs) |
Avg (µs) |
S (µs) |
Result size |
“one” |
127.7 |
~3.7 |
14812.3 |
~369.5 |
5 |
“*” |
148165.8 |
~3209.5 |
50732.8 |
~1476.6 |
9410 |
“artist:metallica” |
3336.5 |
~219.5 |
10192.7 |
~272.7 |
192 |
“tracknr>20” |
1871.7 |
~132.7 |
9448.8 |
~464.1 |
107 |
“tracknr>30” |
236.1 |
~7.5 |
8913.5 |
~143.1 |
13 |
“+comment” |
58894.9 |
~2003.7 |
21995.1 |
~589.6 |
3297 |
“tracknr:4” OR “artist~foo” AND NOT “artist:weezer” |
18785.8 |
~638.8 |
20717.0 |
~388.8 |
776 |
As we can see small queries (small result size) have a big slowdown, while large queries have a speedup. This is because fetching is faster with the new code, while the checking is slower. If the result size is about 1/10th of the total size the fetching starts outweighing the checking, and the new code starts getting faster.
With the new functionality in place S4 is ready to be used for Coll 2.0. Two weeks ago I started implementing the Coll2 operators, and I’m now just about done with the server code. The collection parser is also updated to produce the new operators and nycli has been hacked to compile. The different language bindings still needs to be updated to use the new operators. The next thing to do is fix the language bindings (this could possible break quite a lot of clients, not good) and start to look at the new Coll2 query concept. I had also planned to write a standalone S4 client, like the ‘sqlite3’ command-line application, but it looks like time might run out.