We have discussed the work Lanedo has done in collaboration with Xamarin in earlier blog posts: one of these posts discussed the improvements Lanedo has made to the GTK+ stack on Mac OS X, another one discusses the implementation of a GTK+ widget capable of embedding native Mac OS X GUI controls in detail. In this blog post, we will discuss some work we have done on the optimization of Pango on Xamarin’s request.
Pango is a library that is used for laying out text. Basically, you provide Pango with Unicode text you want to render and Pango will determine which glyphs from which font should be drawn. These glyphs can then be drawn with (for instance) Cairo. Although this sounds simple, there is a lot involved, such as being able to deal with all kinds of different scripts (Hindi, Arabic, Chinese, Japanese, Korean, Georgian, etc.), different ways to handle line breaks, font substitution, etc.
Because Pango is such a fundamental part of the GTK+ stack, you can be assured that it has seen quite some optimization already. So you know beforehand that figuring out a new problem will be interesting. The problem we will be looking at in this blog post is that text handling was dramatically slow in a specific case. Let’s dive in.
Reducing the test case
All performance investigation starts with working towards a reproducible test of the problem that is as small as possible. For this particular issue a text file was provided with which the slow text handling was experienced. This is a good thing as it gives you a kick start for the analysis. The problem was reported in the context of MonoDevelop, so it concerned the MonoDevelop text editor, which uses the Pango library for text layout. The first thing to determine for these kinds of bugs is where in the stack the problem is. Is the problem in the MonoDevelop text editor itself? GTK+ (the GUI toolkit)? Pango? Cairo (the 2D graphics library used by GTK+)?
When you are dealing with a large application like MonoDevelop, a good first step is to reduce the test case to something as small as possible. Since the problem concerns loading a text file, it is useful to try to reproduce the problem with the testtext test in GTK+. It turned out that line wrapping had to be disabled in testtext, something the MonoDevelop text editor also does, in order to reproduce the problem. This already gives one good hint: the problem only turns up with unwrapped lines.
Now that the "code part" of the problem has been reproduced, you want to look at whether you can reduce the input file as well. Removing short lines from this file didn’t eliminate the problem, giving further hints that the problem only turns up with very long, unwrapped, lines. While the file was open in a text editor, it stood out that many of the long lines in this text for the majority consisted of spaces and tabs. It seemed hard to believe that a line with many tabs and spaces, instead of (complex to render) characters, would cause a performance problem. Either way, in order to verify, the tabs and spaces were replaced with ordinary characters. Interestingly, the performance problem vanished.
With the observation on tabs and spaces in mind, some further experiments were done with the test case. It turned out that the performance problem also went away if all tabs were replaced with spaces (instead of characters). If most, but not all, of the tabs were replaced no performance problem turned up either. Preliminary conclusion: large quantities of tabs are the culprit. But also: the problem is not in MonoDevelop, but in a lower layer of the stack because it could be reproduced with a minimal test case in GTK+ itself.
Time to get the profiler out
With a reduced test case both in terms of code and input file, you are in a good position to start profiling to find the problem. In this case, because most of the work was done on Mac OS X, use was made of the Time Profiler tool of the Instruments performance analysis utility that is shipped with XCode. There also are many open source tools that you can use to do a similar analysis, such as oprofile and callgrind as included with valgrind. If you use callgrind, be sure to have a look at KCachegrind as well, which is a great GUI tool to analyze callgrind output.
A first analysis of the output from the profiler indicated that a lot of time was spent in Pango when trying to edit these very long text lines. This made Pango the area of focus. With the profiler, three "hot spots" were found, all in the part of Pango that deals with laying out text. In a nutshell, when laying out text a string of Unicode characters is taken and a series of glyphs to be rendered (and the exact position for each glyph) is determined.
Let’s discuss these three hot spots and solutions for these hot spots in turn:
- The first hot spot is in reorder_runs_recurse(). In order to understand what this function is doing, it is important to know that a line is divided into runs. Each run contains characters with similar properties, as in the same font, text direction (left-to-right, right-to-left) and style. Runs belonging to a line of text are stored in a linked list. Pivotal for this discussion: every tab is stored as a separate run. So, lines with many tabs yield a long linked list of runs.
In reorder_runs_recurse(), this list of runs is unconditionally reordered based on the "analysis level" of the runs, which basically encodes the text direction. During this reordering, the entire linked list is being rebuilt. For long linked lists this is expensive and this is exactly what is happening for lines with many tabs. In case a line of text contains runs with only odd or only even analysis levels, it is not necessary to perform a full rebuild of the list of runs because all runs have the same text direction. So, we can first check whether the reordering is actually needed at all. This can be combined with the computation of the list length which was already done, so at no additional cost. Since in a code editor generally only left-to-right text is used, this solves one part of the performance problems we saw.
- Once the first hot spot was fixed, two other hot spots became more apparent. Both of these are located in shape_tab(), which is called every time a tab is to be shaped. The first problem concerns the computation of the current width of the line. Every time a tab is encountered when processing a line, the current width of the line up to that tab is computed. This involves re-visiting all runs that have been processed in this line before. When there are many runs, which is the case when there are many tabs, this is costly. This re-computation can be avoided by maintaining the current width of the line and updating this width every time a run is inserted.
- The second hot spot in shape_tab() is caused by code that computes the correct position for each tab. This computation is carried out by computing the tab position for all tabs in the line, starting at the first tab, until a tab position is found beyond the current line width. When a line has many tabs, this computation will take longer as more tabs are added to a line, but also the tab positions for the first tabs added to the line are re-computed repeatedly.
We have written a patch that reworks this code to compute a suitable position from where to start looking for the correct tab position, instead of always starting from the start of the line.
It is clear that these problems didn’t turn up before, because lines with large numbers of tabs are in general not a very common use case. Only when the number of tabs is significantly increased, problems turn up due to the choice of data structures and algorithms. You could actually say that storing runs in a linked list is not a good idea, knowing that lines can be constructed that give rise to large numbers of runs. So, shouldn’t the data structure actually be replaced? This is not a trivial task of course and could actually take a lot of time, in particular in testing the code. The Pango maintainer has, however, indicated that he is planning to rewrite PangoLayout at some point.
You can find the upstream bug with patches and discussion here: GNOME Bugzilla Bug 702389. The patch for the first hot spot has been committed to upstream Pango. The other patches are pending discussion about whether it is worth complicating the code base to fix these problems.
So, was that all?
An interesting aspect of performance optimization is that when one hot spot is eliminated, other hot spots may turn up. In this particular case, after the above hot spots were eliminated, Pango seemed to spend a lot of time in attribute processing. The main question is then: where do these attributes come from? Another interesting optimization process ensued: with testtext the problem could not be produced, so we had to turn to the MonoDevelop text editor again. It turned out that the problem only showed up after syntax color highlighting was enabled, otherwise things were just fine. Syntax color highlighting seemed to trigger many more attributes to be created than necessary, resulting in lines with large numbers of runs. After these details were reported to the MonoDevelop maintainers, this problem was fixed quickly.
In this blog post, we described how we found and fixed performance problems in the GUI toolkit and rendering stack. If you are developing an application and are experiencing performance (or other) problems, you can count on Lanedo’s experience to fix these issues for you. This way, your engineers can continue to focus on development of your application, while Lanedo deals with fixing the problems that you experience in the software stack below your application. Sounds interesting? Contact us through our contact page!