S4 development

August 16, 2010

The End

Filed under: GSoC — Sivert Berg @ 19:01

GSoC is coming to an end, and it is time to wrap it all up and show the results of a summer of code. Since the last post I have implemented the Collections 2.0 query concept with some minor changes. I have also written a simple command-line tool for S4 to query, add to and delete from S4 databases without using XMMS2. You can read more about Collections 2.0 here and S4 and the S4 cli tool here.

If you want to try out the new code you can checkout git://git.xmms.se/xmms2/xmms2-cippo.git (the master branch). On the first run it should convert your old sqlite database into a new S4 database automatically. As the code has not seen extensive testing I would recommend that you back up your sqlite database before using the new code.

For client developers interested in the new query system I recommend you take a look at this. There is also a simple client using the new query API in the s4coll2 branch in the previously mentioned git repository. You can also take a look at how xmmsc_coll_query_ids and xmmsc_coll_query_infos are implemented, as they use the new xmmsc_coll_query function underneath.

Before logging off and starting to think about other things than GSoC again, I would just like to say thank you to the good folks at XMMS2 and Google. It has been a really interesting and enjoyable summer of code!

Advertisement

July 14, 2010

Halfway There

Filed under: GSoC — Sivert Berg @ 19:58

Summer is at its peak and GSoCers all over the world are submitting mid-term evaluations. What better time to give a quick status update on S4 and Coll 2.0? Since the last post S4 has seen some big changes, most of them to make implementing Coll 2.0 possible/easier. It started with midb which I mentioned in my previous post. midb made indexes memory only, and only the data was saved to disk. This helps to keep the on-disk format simple. Together with a log this provides a pretty reliable database (something the old S4 was not). The other big change was adding source preferences. Source preferences gives sources priorities, and the property (or properties) with the highest priority source is chosen to be matched/fetched. On top of the new changes a new query system was fitted, removing confusing functions like s4_entry_contains, s4_entry_contained and similar, adding just s4_query. All in all those changes morphed S4 from whatever it was and into a set of entries, where each entry can have a different number of properties. Matching is done on an entry-by-entry basis, and if an entry matches data is fetched from it. Here’s an example:

Key Value Source
song_id 1
artist Foo plugin/id3v2
title Bar plugin/id3v2
rating 4 client/playa
Key Value Source
song_id 2
artist Crazy Jazz Band plugin/id3v2
title 2 hour jam session plugin/id3v2
rating 5 client/playa
Key Value Source
song_id 3
artist Generic Artist plugin/id3v2
title Incorrect Title plugin/id3v2
title Correct Title client/playa
rating 3 client/playa

Above we see three entries with some properties set. Properties have key, value and source. We see that the first property, song_id, is special; it has no source. This is because song_id is the parent property of the entry.  It can be thought of as the key of the entry, there’s no two entries with the same parent property.  Now say that the user wants to find the artist of every song with rating >= 4. He would then do something like this (in pseudo-code).

s4_query (fetch = ‘artist’, condition = ‘rating >= 4’)

s4_query would then visit each entry, see if rating exists and is >= 4, and if it is return the artist. So the above query would result in the result set: {“Foo”, “Crazy Jazz Band”}.  Now say the user wanted to query the title of every song in his library. He would then do something like this:

s4_query (fetch = ‘title’, condition = ‘everything’, sourcepref = ‘client/*:plugin/*:*’)

We see that the user this time provided a source preference too. This sourcepref will be used when choosing what data to fetch. s4_query would again visit every entry, the condition would match all of them so it would fetch the title of all of them. For the first two it’s simple, but the last one has two properties with the key ‘title’. Which one to choose? If we look at the sourcepref we see that sources matching ‘client/*’ comes before ‘plugin/*’, so it would pick the one set by ‘client/playa’, and s4_query would return the result set {“Bar”, “2 hour jam session”, “Correct Title”}.

Matching by visiting every entry is of course slower than using an index, but it turns out in most cases it is Fast Enough (TM). There are however some cases where it is too slow, for example when searching for entries matching a property many times per second, and S4 therefore supports creating indexes on specific property-keys. By default it creates an index on parent properties, so in the example above searching on song_id would be fast, but the user can also specify other properties to make an index on. XMMS2 creates indexes on ‘url’ and ‘status’, because they are searched on a lot.

We can compare it to the benchmark we ran earlier to show how this new design affects performance:

Query S4 old S4 new
Avg (µs) S (µs) Avg (µs) S (µs) Result size
“one” 127.7 ~3.7 14812.3 ~369.5 5
“*” 148165.8 ~3209.5 50732.8 ~1476.6 9410
“artist:metallica” 3336.5 ~219.5 10192.7 ~272.7 192
“tracknr>20” 1871.7 ~132.7 9448.8 ~464.1 107
“tracknr>30” 236.1 ~7.5 8913.5 ~143.1 13
“+comment” 58894.9 ~2003.7 21995.1 ~589.6 3297
“tracknr:4” OR “artist~foo” AND NOT “artist:weezer” 18785.8 ~638.8 20717.0 ~388.8 776

As we can see small queries (small result size) have a big slowdown, while large queries have a speedup. This is because fetching is faster with the new code, while the checking is slower. If the result size is about 1/10th of the total size the fetching starts outweighing the checking, and the new code starts getting faster.

With the new functionality in place S4 is ready to be used for Coll 2.0. Two weeks ago I started implementing the Coll2 operators, and I’m now just about done with the server code. The collection parser is also updated to produce the new operators and nycli has been hacked to compile. The different language bindings still needs to be updated to use the new operators. The next thing to do is fix the language bindings (this could possible break quite a lot of clients, not good) and start to look at the new Coll2 query concept. I had also planned to write a standalone S4 client, like the ‘sqlite3’ command-line application, but it looks like time might run out.

May 31, 2010

Benchmark update

Filed under: Uncategorized — Sivert Berg @ 12:35

It turns out Erik Massop (nesciens) had some code to speedup SQL queries in his repository. Here it is compared to the old SQL code and S4 midb:
UPDATE: Erik pushed some new code, making some queries faster. The queries that saw improvement are marked in red. The rest of the queries performed the same as the old code. One change in the new code is that it keeps the SQLite connection alive instead of creating a new one for every query. The difference can best be seen on a short query like “one”:

The old code:

query_infos took: 878
session took:     827
select took:      386

Compared to the new code:

query_infos took: 432
session took:     370
select took:      365

We see that the select takes about the same time, but the whole session takes about twice as long with the old code. For large queries like “*” the difference is less visible:

The old code:

query_infos took: 228048
session took:     228024
select took:      227569

Compared to the new code:

query_infos took: 224097
session took:     224062
select took:      224052
Query SQL Old SQL Experiment S4 midb
Avg (µs) S (µs) Avg (µs) S (µs) Avg (µs) S (µs)
“one” 180115.3 ~5521.6 326.0 ~8.9 127.7 ~3.7
“*” 365540.5 ~7049.7 225482.8 ~1011.7 148165.8 ~3209.5
“artist:metallica” 6864.2 ~214.3 3755.7 ~16.7 3336.5 ~219.5
“tracknr>20” 30378.6 ~789.9 14682.2 ~214.8 1871.7 ~132.7
“tracknr>30” 27624.5 ~1372.5 10996.5 ~23.3 236.1 ~7.5
“+comment” 238729.9 ~7028.3 73566.2 ~1553.5 58894.9 ~2003.7
“tracknr:4” OR “artist~foo” AND NOT “artist:weezer” 188794.7 ~4773.8 49245.6 ~1210.9 18785.8 ~638.8

As we can see most queries are getting a nice speedup. Most notably “one” which is about 500 (!) times faster. But also “+comment” (3.2 times faster) and that long last query (3.8 times faster) gets a nice boost. The only one getting slower is “*” (taking over 2 seconds). S4 is still faster than SQL, but all the SQL queries (with the exception of “*”) are under 100 milliseconds on a moderately sized media library (~10,000 songs).

May 29, 2010

Benchmarks

Filed under: Uncategorized — Sivert Berg @ 18:44

I finally had time to run a few benchmarks. I wanted to measure how the different back-ends compared to each other in query performance, memory consumption and disk usage. The back-ends being compared are SQLite (the current back-end used in xmms2-devel), S4 mmap (the “old” S4 back-end) and S4 midb (a new experimental S4 back-end).

Just to cover my back: I do not have much experience with benchmarks, and I am in no way an objective observer (slight bias towards S4, me?). If you see something that seems wrong, let me know; I probably made a mistake.

Query performance

First we will look at query performance. This is perhaps the factor that is most visible to the end user. The time measured is the time spent inside the xmms_collection_client_query_infos. It was measured by running gettimeofday at the beginning and end of the function and printing the difference in microseconds. I also used a slightly modified nyxmms2 that uses xmms_coll_query_infos instead of xmms_coll_query_ids.

So, to the results. The table shows the average time used in the query (calculated by \overline{X}=\frac{1}{n}\sum_{i=1}^n X_i where X_i is the time used on the i’th run) and the sample standard deviation (calculated by S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i - \overline{X})^2 and then taking the square-root to get S). n was 10 for all benchmarks. The last column is the average time that SQLite used divided by the average time S4 midb used.

Query SQLite S4 mmap S4 midb SQLite / midb
Avg (µs) S (µs) Avg (µs) S (µs) Avg (µs) S (µs)
“one” 180115.3 ~5521.6 203.2 ~5.2 127.7 ~3.7 ~1410
“*” 365540.5 ~7049.7 286972.0 ~5489.7 148165.8 ~3209.5 ~2.47
“artist:metallica” 6864.2 ~214.3 5377.6 ~175.7 3336.5 ~219.5 ~2.06
“tracknr>20” 30378.6 ~789.9 2885.1 ~89.4 1871.7 ~132.7 ~16.23
“tracknr>30” 27624.5 ~1372.5 454.7 ~77.0 236.1 ~7.5 ~117.0
“+comment” 238729.9 ~7028.3 106268.0 ~3277.4 58894.9 ~2003.7 ~4.05
“tracknr:4” OR “artist~foo” AND NOT “artist:weezer” 188794.7 ~4773.8 26407.7 ~793.9 18785.8 ~638.8 ~10.05

As we can see S4 is faster in all cases, ranging from about 2 to over a 1000 times faster. Why “one” is so much slower on SQLite I don’t know, could be inefficient SQL query, could be something else. Also worth noting is that S4 is mostly bound by the size of the result set. Queries resulting in small result sets (“one” for example) gives short query times. I’ll get back to why later.

Memory consumption

To get good performance S4 trades memory for speed. The big question is if the trade-off is worth it. For now we will settle for finding out exactly how much memory the different back-ends use. Memory consumption was measured by using massif, a valgrind tool. The back-ends were run two times, once by just starting up and shutting down and another time with a query for “*” before shutting down. That way I hope to visualize idle memory usage versus usage when searching. All numbers are in MiB.

SQLite S4 mmap S4 midb
Idle Search Idle Search Idle Search
10.43 17.77 1.13 10.59 23.79 32.56

As we can see both S4 implementation adds about 9 MiB when searching compared to the 7 MiB SQLite add. Also it may seem like S4 with mmap is the big winner here, but what massif does not show is shared memory, the kind mmap uses. If that had been counted in we would have added about 22 MiB to S4 mmap, bringing it up to about the same memory usage as S4 midb. So with a media library with 9,361 entries S4 uses a little over twice the memory SQLite uses at idle.

Disk usage

Finally, the one most people probably will not care about with today’s abundance of disk space: disk usage. It’s simply the size of the datafile.

SQLite S4 mmap S4 midb
41.5 MiB 21.7 MiB 5.4 MiB

Summary

In terms of performance and disk space S4 (especially the midb version) is the clear winner, but it eats up about twice as much memory compared to SQLite. Fair tradeoff? Maybe for a 10,000 entry media library, but what about one with 100,000 entries? We would probably see memory usage around 200 MiB.

A note on S4 query performance

As I said earlier S4’s query speed is mostly bound by the size of the result set. To see why we have to dig a little. Running XMMS2 in Callgrind (another Valgrind tool) is a nice way to find bottlenecks and hotspots. Opening the generated callgrind.out file in KCachegrind reveals a few interesting things:

As we can see xmms_medialib_query_ids (the function calling s4_query) only contribute 3.64% of the running time of the query. The rest of the time is spent fetching the properties (artist, album, …) of the entries we found in the query and inserting them into the dictionary we return. We see that _entry_propery_get_ is called 28,000 times. The media library this was run on has about 500,000 relationships (a relationship is an entry-to-entry mapping, for example (“songid”, 1) -> (“title”, “One”)). Using B+ leaves with room for 101 entries and a fill rate of 50% (it was measured to be around 53%, a bit away from the normally assumed 67% for B+ trees) this gives about 10,000 B+ leaves. A simple traversal of the 10,000 leaves would probably be faster than calling _entry_property_get_ 28,000 times. An observation well worth taking into account when we design the S4 query layer.

May 24, 2010

Hey!

Filed under: Uncategorized — Sivert Berg @ 18:33

Summer is here, and we all know what that means: GSoC and hack time! This year I will try to document the progress here. Who am I? I am a 22 year old computer science student from Norway attending GSoC for the second time. Once again I was given a chance to work on XMMS2 and S4. It is off to a slow start at first because of some exams, but after June 4th it is full speed ahead. As you may know my project this year involves developing a new database backend called S4 for XMMS2. The project will be mentored by Sébastien Cevey. Some of goals of this year’s project is

  • Proper support for source preferences
  • A standalone S4 query layer
  • Coll 2.0 support

To keep track of time we have set up some milestones with associated deadlines. The first one coming up at June 6th is to have a benchmark set up so we can track performance against the old database, but also against itself as we do changes to see if they are worth it.

I cheated a little and started early. The last week I have been trying a new approach in S4. Instead of storing the index data structures on disk as before we build them on startup. The start up takes a little longer (about 0.5 seconds on a media library with 10,000 entries), but otherwise it seems to perform better or on par with the old code using mmap. With this new approach we will hopefully get a more reliable database, smaller on disk file and some performance improvements. A more thorough analysis will be provided when I have time to run some proper benchmarks.

Happy hacking everyone!

Create a free website or blog at WordPress.com.