Sandboxing your data in Tracker

sandbox

For a while now, I’ve wanted to keep my personal and test data separate as well as the stable and unstable versions of the Tracker project installed in parallel for maintenance reasons. Tracker is a search engine and metadata database and indexes your data providing several powerful APIs to query and update that data. Sadly, if you want to maintain different data stores, i.e. discrete data sets and use Tracker for those data sets, you currently have to use a different login username or spend time setting up environment variables and separate IPC (DBus) sessions. Having parallel installations works, but running parallel Tracker binaries causes problems due to the shared IPC session.

Not long ago, one of the Tracker project maintainers (Sam Thursfield) started a shell script to sandbox Tracker. For those unfamiliar with sandboxing, it provides a mechanism for separating running programs from the rest of the system (i.e. what I’ve wanted to do for some time with Tracker).

I decided to work on a Python script to sandbox Tracker’s data and runtime operation about the same time Sam’s work was publicized and so held of initially to incorporate much of the work he did and to extended it.

What can you do?

The idea here was to be able to use the script to create the entire IPC session, start the binaries and perform an operation before finishing and returning to the prompt. An operation could be ‘updating the index’ or ‘running a query’ or ‘list files in index’ (which is technically just running a query too). The script was to be made in a way that allowed use inside other scripts and could be non-persistent. What became clear early on, was that I wanted to be able to drop to the shell too.

This is what you can do so far:

$ tracker-sandbox.py --help
Usage: 
  tracker-sandbox.py -i <DIR> -c <DIR> [OPTION...] - Localized Tracker sandbox for content indexing and search

Options:
  -h, --help            show this help message and exit
  -v, --version         show version information
  -d, --debug           show additional debugging
  -p PATH, --prefix=PATH
                        use a non-standard prefix (default="/usr")
  -i DIR, --index=DIR   directory storing the index
  -c DIR, --content=DIR
                        directory storing the content which is indexed
  -u, --update          update index/database from content
  -l, --list-files      list files indexed
  -s, --shell           start a shell with the environment set up
  -q CRITERIA, --query=CRITERIA
                        what content to look for in files

Creating your index

Runtime

The classic approach is that you’ve got a system installed project in /usr and want to use a /usr/local other other prefix based installation which you just built.

Data

Before you can do anything, you need to create an index based on content you want indexed. For now, let’s use the XDG location for downloads, which is tends to be $HOME/Downloads on distributions.

This is what we get (note you can see a lot more output with the -d or –debug option):

$ tracker-sandbox.py -p /usr/local -i /tmp/tracker -c ~/Downloads -u
Index now up to date!
Using query to check index has data in it...
  Currently 79 file(s) exist in our index

Using your index

As mentioned above, there are usually two approaches you want. You either want to run a standalone command to query your data or you want to instantiate a shell to do more complex operations. You can do both of these just by changing the last command line argument:

In this example, we start a shell and count how many files we have in our index. We then exit and run the same query to show that we’re querying entirely different systems:

$ tracker-sandbox.py -p /usr/local -i /tmp/tracker -s
Starting shell... (type "exit" to finish)

$ tracker-sparql -q 'select count(?f) { ?f a nfo:FileDataObject }'
Results:
  79

$ exit
exit

$ tracker-sparql -q 'select count(?f) { ?f a nfo:FileDataObject }'
Results:
  687

In this example we show how you can run a query without needing a shell first:

$ tracker-sandbox.py -p /usr/local -i /tmp/tracker -q foo
Found:
  file:///home/martyn/Downloads/foo.jpg
$

What’s next?

For now, this solves a problem with testing for the maintainers, but I hope it allows others to make use of data sets more unilaterally. It’s entirely possible to have separate databases for your source code repositories, pictures, movies and music.

If you like anything you’ve seen here and have any comments or suggestions on how we can improve this script, feel free to comment here or drop us a line on the tracker mailing list.

You can find the sandbox in Tracker’s master branch here and if you need more Tracker expertise, don’t hesitate to contact us.

Thanks for reading! 🙂

Share on Google+Tweet about this on TwitterShare on LinkedInShare on FacebookFlattr the authorBuffer this pageShare on RedditDigg thisPin on PinterestShare on StumbleUponShare on TumblrEmail this to someone

Martyn Russell has a background in the Telecommunications industry where he worked for seven years prior to becoming a software developer for Imendio. Martyn has authored open source projects such as Gossip & Tracker and been involved in other areas of GNOME since 2005. You can contact Martyn and his team for professional consulting on our contact page.

Posted in Blog, Development, Gnome, Tracker Tagged with: , , ,
4 comments on “Sandboxing your data in Tracker
  1. Hmm although it’s a very neat and interesting hack it’s my opinion that this shouldn’t be done with scripts and environment variables.

    Instead it could or should be build in libtracker-sparql (of which I think tracker-store should be a libexec and an implementation detail of the standalone library).

    I also think such a libtracker-sparql should allow custom ontologies per consumer of libtracker-sparql. Meaning that what the Tracker project is today would have its equivalence in a combination of [nepomuk-desktop-ontologies + (libtracker-sparql + tracker-store)] + [tracker’s FS miner having a dependency on nepomuk-desktop-ontologies and libtracker-sparql + (libtracker-extract + tracker-extract)]

    • Interestingly, I remember you not favouring environment variables over the years. I kind of like them, they allow a lot of flexibility. Each to their own.

      The store currently is in libexec, not sure what you mean to change there. It is also only used by extension of calls through libtracker-sparql, so what more did you mean to add here?

      I do think that depending on an ontology from components is an interesting idea – e.g. we need components X, Y and Z for MP3 extractors and A, B and C for the tracker-miner-fs. Right now, we have companies/people using Tracker and customising its ontology without knowing exactly what needs to be kept. Making this link stronger would help a lot of people who want a lighter weight database schema indeed.

  2. Sam Thursfield says:

    Philip … I agree. But this way is a lot easier as a starting point 🙂

    It might create expectations if there was support for multiple stores in libtracker-sparql that there was some way of executing queries that cross multiple stores, too, which of course there isn’t.

    • I think you have a good point too – I prefer these sort of details to be part of what’s possible rather than in the spotlight (no pun intended) 🙂 otherwise people start asking all sorts of questions about other data stores, security, combined searches, etc. as you say Sam.

      Thanks for the comments guys!

Leave a Reply

Your email address will not be published. Required fields are marked *

*