« Back to blog

Review on "Bell: Bit Encoding Online Memory Leak Detection"

Yet another review. This is the original paper: http://dl.acm.org/citation.cfm?id=1168866

Generally, this paper looks at the problem of detecting memory leaks in programs and tries to solve the same using a dynamic program analysis approach which is partially rooted in statistics. The paper seems to be interested in memory leaks caused by Useless Objects, in other words, objects which are referenced by in-use components of the program but the objects themselves will no longer be used. This is especially true of the managed languages such as Java and C#. While they resolve the problem of freeing unused objects using Garbage Collection (GC), detecting if an object will ever be used is more difficult. This is so because you are in effect trying to predict how (or if) the object will be used in the future states of the execution. Also, the paper makes one very important assumption about memory leaks: they have no immediate symptoms. This resonates with the idea that predicting the future use of an object is difficult. It is analogous to believing only after seeing it for sometime. You believe that the object is useless from now on, only if you have noticed it not being used over a reasonable period of time.

The paper's primary contribution is is the Bit-Encoding Leak Location or Bell technique. It is an elegant solution for encoding per object related information using a single bit. In this paper it has been used to associate an object's allocation and last use information with a site, i.e. a location with in the source code, using a near zero space overhead. While the technique has been portrayed for use in the context of Leak Detection it can potentially be used for other purposes as well, as it is a general purpose encoding technique that stores object related information. The paper's end goal however, seems to be to pin-point the location or the site of the memory leak, so that a programmer can examine the location and fix the leak itself. This is achieved by the paper's secondary contribution: a tool called Sleigh. Sleigh employs Bell to detect memory leaks.

Bell performs the encoding of an object's site in a single bit. This bit can be one of the free unused bits in the header of the object. Using a simple encoding function, f(object, site), which takes the identifiers of the object and the site, a 0 or 1 are stored in the encoding bit or the site bit since it stores information about the site of the object. An object matches a site if the object's encoding/site bit is the same as that calculated by f(object, site). The encoding bit itself can be stored whenever the object is allocated with memory or is used depending upon the requirement. The encoding function that is selected should have two important properties. Firstly, it should be easy to compute, since it will be computed during runtime thus creating a performance overhead. Secondly, it should be unbiased thus ensuring that an object matches a given site with 50% probability, if it was not encoded with it to begin with. If it was encoded it will always be a match.

Decoding a site means finding out the number of objects that were encoded with the site along with the total objects which match the site. Decoding a site is done for a particular subset of objects. For each of these objects the f(object, site) is computed and is compared with the site bit of the object. If they are the same, then the object and the site match and the match count for the site increases. This is denoted by msite. Hence you find out the number of matches for a given site. If nsite is the number of objects that were encoded with the site, then msite = nsite + 1/2*(n-nsite), where n is the number of objects in question. This is simply because the objects encoded with the site will always match and the remaining objects have a 50% probability of matching the site owning to the 2nd property of the encoding function as mentioned above. From here, one can compute nsite = 2*msite -n.

Such a decoding scheme, for a given site, does not give you the detailed list of objects that were encoded with the site. However, it gives a statistical strength to site i.e. the number of objects encoded at that site. Now, if the common property of all the objects that are associated with the site using the decoding technique is, that they are stale, i.e. have not been used for a very long time and are potential candidates for being useless, then you can say based on the statistical strength of that site which is nothing but the number of stale objects encoded there, that it is a strong source of a memory leak. Which is what has been done using Sleigh. The candidate set of objects are nothing but stale objects. The staleness in effect decided by the staleness bits for every object, which are incremented logarithmically, based on the number of times the GC is called, as explained in the paper. You get a saturation point at 3 as only 2 bits are used for the counter, and that is a state of being highly stale. 0, 1, 2 represent zero, slight and moderate states of staleness. Sleigh maintains 2 bits in the object's header for encoding purposes. One is to encode the object's allocation site, while the other is to encode the object's last use site. The encoding bits are maintained using instrumentation at the application. The decoding unit however, is maintained and run as a separate unit.

The evaluations with SPEC JBB2000 and Eclipse show promise especially due to the fact that the memory leaks were pinpointed very accurately and with a high probability of 99.9% due to the mFalePositive threshold for weeding false positives and nmin, which is the minimum number of objects required to be encoded with a site to be able to predict with the 99.9% that a site is leaking memory. However, there were cases where memory leaks were missed, and furthermore, the testing performed on Sleigh was not extensive. It was based on just two candidate programs, and only specific leaks were identified. That being said, the biggest problem for this kind of a technique using Bell, is a required high value of nmin, for accurately predicting memory leak locations, which can only be attained with reasonable amount of runtime. This is going back to the problem of believing only after seeing it for sometime. This is another reason that large number of candidate programs will not be available for testing this technique, the candidate requires to be running for a long period of time. Which raises an important question: if situations conducive for this technique to work while detecting leaks do not exist aplenty, then is this a valid solution? Nonetheless, the idea of encoding per object information is novel and should be employed in other applications.

As an after thought, a good idea to reduce the dependence on nmin is to present a ranked list, whose ranking is done on the probability of the site being a point of memory leak. While nmin will ensure a accurate result, in shorter running programs, it is probably a better solution to show results i.e. candidate locations with reduced probability but with multiple options.