Filesystem monitoring in the Linux kernel

Evolution

At Lanedo we’ve been working on file system monitoring in many contexts, like Gvfs/GIO development or Tracker, and usually we get asked about which are the available interfaces in the Linux kernel…

The history behind filesystem monitoring interfaces in Linux can be summarized as follows:

dnotify ⊂ inotify ⊈ fanotify

Let’s look a bit more in detail what all this is about…

dnotify

dnotify is a directory monitoring system officially released in Linux 2.4.0, which provides a very limited way of interacting with the kernel to get notifications of changes in files inside a given directory.

The dnotify filesystem monitoring tool is implemented as a new F_NOTIFY operation available in the fcntl() system call, and thus it is based on the manipulation of standard file descriptors retrieved with open() calls on existing directories.

dnotify allows registering for different types of events, which can be specified as a bit mask passed to the fcntl() call. Among the type of events that dnotify supports, we have file accesses, file creations, file content modifications, file attribute modifications or file renames within the directory.

Events are notified to user-space via signals raised on the process. By default, dnotify will use SIGIO signal for that purpose, although it is recommended to use some other real-time signal, in the [SIGRTMIN,SIGRTMAX] range (configurable with the F_SETSIG operation in fcntl()).

The drawbacks and limitations of dnotify are given mainly by how its interface was designed. To list just a few of the most obvious ones:

  • Cannot monitor single files: dnotify can monitor a directory and its contents, but not single files. If only a single file needs to be monitored, its whole parent directory needs to be monitored.
  • Prevents unmount of partitions: If the path being monitored is within a partition that may get unmounted, as long as the file descriptor is open, the unmount operation won’t be allowed. In order to unmount the partition, your process will need to close all file descriptors corresponding to paths inside the mount point of the partition. This makes dnotify especially problematic when working with removable media.
  • Limited event information: Signals are definitely a poor interface between kernel and user-space. No additional interesting data is provided in the signal handler, besides the file descriptor number, and therefore the process receiving the signal will need to stat() all files in the given directory, and compare the results with previously cached results, in order to know what event exactly happened and in which file.

inotify

inotify is an inode monitoring system introduced in Linux 2.6.13. This API provides mechanisms to monitor filesystem events in single files or directories. When monitoring directories, inotify will return events both for the directory itself and for files inside it. As such, inotify is a full replacement of dnotify, and avoids most of its issues.

Instead of dnotify‘s signal-based interface with user-space, inotify is implemented as a device node which can be opened and read with a single file descriptor. Also, inotify comes with its own system calls: inotify_init() to create a new monitoring instance with its own file descriptor, inotify_add_watch() to tell the instance to monitor a given file or directory (which returns a watch descriptor), and inotify_rm_watch() to remove the monitoring of the file or directory. The single inotify file descriptor, therefore, can be used to monitor multiple paths (i.e. a single instance can manage multiple watch descriptors).

The user-space application then just needs to poll() for POLLIN events in the single inotify file descriptor, and read() information about what event happened in an inotify_event struct. This struct includes several things, like the specific watch descriptor which triggered the event (so the user can map it to the actual path monitored), a mask of events that happened in the watch, a filename (in case the watch was for a directory), and last but not least a cookie to synchronize events.

Thanks to this cookie value given, inotify is capable of not only supporting all of the event kinds that dnotify supported, but also providing support to monitor file or directory rename and move events across different directories. For example, a move of one file to another directory in the same mount point will trigger an IN_MOVED_FROM event on the source directory with cookie A, and an IN_MOVED_TO event on the target directory with the same cookie A (asuming that both source and target are monitored, of course). This kind of event matching is specially useful to e.g. file indexer applications like Tracker, as the indexer can just assume that the file was moved (path changed) but not its contents (so no need to re-index the file).

And due to the fact that standard file descriptors are no longer used as base, monitoring a given file or directory doesn’t prevent the mount point where it resides from being unmounted. Actually, inotify itself will notify via IN_UNMOUNT events when that happens to one of your monitored paths.

Still, inotify is not perfect. Some of the most strong criticisms are:

  • Maximum number of inotify instances and watches per instance: The kernel imposes some (configurable) limits to the number of inotify instances a user can create (/proc/sys/fs/inotify/max_user_instances, e.g. 128) and also to the number of watches a given user can set per instance (/proc/sys/fs/inotify/max_user_watches, e.g. 8192). This effectively limits the amount of paths a user can monitor.
  • No recursive monitoring: There is no way to tell the kernel to request monitoring a given directory and all its subdirectories.
  • Map between watch descriptor and real path: The user needs to keep itself the map of which watch descriptor corresponds to which path. Not that this is a limitation, just a bit of a burden for the user who monitors lots of paths.

fanotify… FTW!

fanotify is the latest filesystem monitoring interface, officially available in a stable manner since Linux 2.6.37 (early 2011), but which has been around since a lot longer (2009).

The API to setup monitoring using fanotify is similar to that used in inotify; this is, we have a fanotify_init() syscall which will give the user a new fanotify file descriptor, and a fanotify_mark() syscall to add or remove marks (watches). The user-space application then just needs to poll() for POLLIN events in the single fanotify file descriptor, and read() information about what event happened in a fanotify_event_metadata struct.

Since the very beginning, the most advertised feature of fanotify was that it allowed recursive monitoring, which can be accomplished by using a special FAN_MARK_MOUNT when adding a new mark. Once such a mark is set for a mount point path, the user will get events in any directory available in the same mount.

Unlike inotify, the fanotify_event_metadata struct will not tell you on which file, path or watch an event happened. Instead, it will give you an open file descriptor to the exact file or directory where the event happened. Giving a file descriptor will let the user gather the full path of the file by readlink()-ing the /proc/self/fd file, so in some way it helps user-space as there is no longer the need to have a map of watches vs paths.

Giving an open file descriptor is the basis for the other new big feature that this monitoring system provides: file access control. A process using fanotify can request the kernel not only to be notified about events happening, but also can tell the kernel to allow or forbid access to open a given file by a given process. Just think of the most obvious use case for this feature, an antivirus software. An antivirus monitor needs to be able to analyze a file being opened before the user gets to open it, and then allow or disallow the open operation. Before this system was in place, antivirus programs usually relied on the out-of-tree maintained Dazuko kernel driver to provide the same level of file-access control. But now, fanotify provides a mechanism for such programs to let them decide whether a given user-space process will be able to access a file.

The last big improvement coming with fanotify is that the kernel will not only give the file descriptor of the file where the event happened, but also the PID of the program which caused the event to happen. This is very useful for different programs, if, for example, they want to ignore events created by themselves (e.g. Tracker’s writeback).

fanotify… WTF?

What not everyone seemed to understand, though, was that fanotify is not an inotify replacement. Let me re-state the same thing with other words: fanotify doesn’t provide the same set of events that inotify provides. The root issue of this is again the open file descriptor in the fanotify_event_metadata struct API which we talked about before. Using an open file descriptor to notify where an event happened directly breaks the possibility of notifying events like file deletions or renames/moves… So, fanotify does NOT notify file deletions, file renames or file moves. Of course, it also doesn’t provide cookies to match source vs destination move events, as inotify did. To be fair, let’s say that fanotify covers just a subset of the use cases you could have with inotify; but also adds some new ones.

The other big drawback of fanotify is that it currently is root-only (CAP_SYS_ADMIN-only to be more specific). This means that only the root user can request to use the monitoring capabilities provided by fanotify, and therefore non-root use cases (like the Tracker file indexer mentioned earlier) are unable to use it.

Examples!

I’ve prepared some simple examples showing all the previous technologies, so if you want to play more with them, take a look at:

So who uses fanotify nowadays?

There are lots of programs out there using inotify, but the same cannot be said for fanotify, even several years after having it publicly available in the upstream Linux kernel.

The following list of Free Software projects using fanotify is, from what I can gather, close to complete:

  • fatrace is a monitor of system-wide file access events, basically equivalent to the mount monitoring example provided in the previous section, but more polished.
  • systemd, the init system, uses fanotify‘s mount monitoring capabilities in its readahead implementation.
  • FirefoxOS uses fanotify in their disk space watcher implementation.

That list does not include any project using fanotify‘s file access control feature. Of course, several proprietary antivirus programs do make use of that feature.

Do you know of other projects using fanotify, or are you planning to migrate one? Do not hesitate to leave a comment in the post!

Share on Google+Tweet about this on TwitterShare on LinkedInShare on FacebookFlattr the authorBuffer this pageShare on RedditDigg thisPin on PinterestShare on StumbleUponShare on TumblrEmail this to someone

Aleksander Morgado is a Spanish Telecommunications Engineer with experience in several areas, such as Antivirus & Network Security, Satellite Orbit Determination systems, VoIP servers and Mobile Broadband Communications. You can contact Aleksander and his team for professional consulting on our contact page.

Posted in Blog Tagged with: ,
2 comments on “Filesystem monitoring in the Linux kernel
  1. Very interesting read Aleksander. Thanks a lot!

  2. Adam says:

    Excellent post! Have been looking at getting an event when diskspace gets low for our databases and was researching inotify and fanotify as a potential interface.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

@LanedoTweets