When portability, robustness, and performance are important
Oliver is a researcher for the National Research Council of Canada doing R&D in simulation systems for engineering applications in virtual reality. He can be contacted at [email protected].
A resource is something that is useful and in limited supply. This includes everything from computer memory and disk space, to filehandles/sockets and threads, mutexes, and semaphores. It is therefore important that resources be returned to the system when they are no longer in use. Failure to do so eventually results in poor performance and ultimately in "starvation"insufficient memory for the next operation, insufficient disk space, inability to create new threads, and the like. More often, however, it leads to nasty bugs, such as lost data and deadlocks.
Python does a good job of making resource management almost trivial. However, there are some important subtleties that can have a serious effect on the portability, robustness, performance, or even correctness of your Python programs. In this article, I discuss those subtleties and some of the modules and techniques you can use to get around them.
In Python, resources are not available directly but are wrapped in higher level Python objects that you instantiate and use in your own Python objects and functions. Resource management consists of the tasks that you and/or the Python interpreter must carry out to ensure that a resource that you have acquired is returned to the system.
There are actually three issues that may not be immediately apparent to you in Python's resource management model:
- NonDeterministic destruction (NDD).
- Circular references.
- Uncaught exceptions.
NDD
Resource management in Python is trivial 90 percent of the time because it is automated: Once you no longer need an object (say after you return from a function), you can just forget about it, and the interpreter does its best to eventually release it.
The "eventually" part of this statement refers to the fact that the Python Language Reference Manual (http://docs.python .org/ref/ref.html) guarantees that objects stay alive as long as they are in use. This means that if you create an object foo in a function A and from there call another function B that creates a global reference to foo, foo stays alive past the return from function A, at least until that global reference is discarded. There is, therefore, no way for you to know when your object will no longer be in use.
The "its best" refers to the fact that the Python Language Reference (PLR) does not prescribe what the interpreter must do with foo once there are no other objects left referring to it; for instance, after foo has become unreachable or "garbage." Actually, the PLR does not even prescribe that the interpreter must notice at all. Many languages have that as well, Java and C# being two such examples. Therefore, "its best" depends on the interpreter implementation; CPython, Jython, and IronPython are the most well known. This indeterminism is a trap that is difficult for newcomers to Python to discern because of the del operator and __del__ class method. The del operator simply removes a variable name from the local or global namespace, it says nothing about object destruction. On the other hand, the __del__ method is called automatically by the interpreter just before the object is destroyed. But as explained earlier, object destruction is not guaranteed to occur. The call to your object's __del__ is completely out of your control and may never happen, and using the del operator will not help. The only reliable use for the del operator is to make sure you (or a user of your module) can't mistakenly use a name that shouldn't be used.
This can be even harder to accept for many Pythoneers who only have experience with the CPython implementation: That implementation appears deterministic because in trivial examples such as a = foo(); del a, the foo is immediately destroyed. Consequently, you may not realize that your code will not work with other interpreter implementations.
This indeterminism is even a trap for experienced programmers who have a background in object-oriented languages that have deterministic destruction (such as C++), where create-release pairs (for instance, the "resource acquisition is initialization," or "RAII" idiom) are used heavily, and to great effect. It is tempting to see del as the equivalent of delete, and __del__ as equivalent to a destructor. The two statements taken separately are true, but the difference from C++ is that Python's "delete" does not call the "destructor."
In fact, __del__ is not only effectively useless for lifetime management, but should be avoided due to reference cycles.
Circular References
In Python, variable names are references to objects, not objects in and of themselves. A Python interpreter tracks whether an object is in use by how many variable names refer to that objectits "reference count" (which, in the CPython implementation of Python, would be the same as returned by sys.getrefcount(obj)). Unfortunately, even if the PLR prescribed that objects shall be destroyed, and subjected to this as soon as they are no longer referenced by any variable names, you wouldn't be much further ahead because of reference cycles. A reference cycle occurs when an object directly or indirectly refers to itself. For instance, Listing One leads an A (aa) to refer to a B (aa.b), which in turn refers to aa via aa.b.a, and aa -> aa.b -> aa, thus creating a reference cycle.
Reference cycles are more common than you might think. For example, they are common in GUI libraries, where information must flow up/down GUI component trees.
While reference cycles are not a problem per se, they do prevent the reference count of all objects in the cycle from going to zeroeven those not directly in the chain. So, those objects are never destroyed and associated resources are never released. Hence, reference cycles are a problem only when there are critical resources to release; for example, mutex locks, opened files, and the like, or if cycles get continuously created (as in a loop), which leads to an ever-increasing memory consumption of your program (Listing Two).
Even if you are careful not to create reference cycles, third-party modules that create a cycle that refers to your object, even indirectly, can trap it in the cycle. For instance, the hypothetical object yourObj in Figure 1 gets involved in a cycle, c1->c2->c3->c1, unbeknownst to you. Its reference count can't go to zero until you break the cycle.
Uncaught Exceptions
Even if you could be sure that reference cycles don't occur in your program, the uncaught exceptions problem remains. An exception in a function causes the named references at all levels of the function call stack to be retained until the Python interpreter exits, in case you need to explore the data for debugging purposes. Yet, objects that are left over from uncaught exceptions are not guaranteed to be destroyed, as stated in the PLR. See Listing Three for a trick that can help clean up resources (but again, the PLR offers no guarantees).
It should be clear by now that you can't rely on __del__ for resource management; for example, RAII does not work in Python, and the del operator is no help in that matter.
Fighting Indeterminism
Listing Four is C++ code that uses the RAII idiom. The C++ code guarantees that the critical mutex lock resource acquired when instantiating a Lock is released when the function returns, because Lock has a destructor (not shown) that releases it; this is guaranteed by the language standard to be called upon scope exit (that is, function return). The equivalent Python code is only equivalent in appearance: lock is only a named reference to a Lock instance object, so upon scope exit, the reference count of the created Lock is decreased, but:
- Python doesn't guarantee that the reference count will be zero, since do_stuff() might have increased it.
- Even if do_stuff() didn't affect the reference count, the lock named reference is not deleted if the scope is being exited due to an exception raised by do_stuff().
- Even if no exception is raised, Python doesn't guarantee that any special function (__del__, for instance) will be called.
Until the next release of Python, the only solution is to make judicious use of the try-finally clause and manually call the release method (Listing Five). This works well, except that the coder of Lock must remember to document which method must be called to release resource (not too bad), but the user must read the documentation and notice it (far less likely). In addition, users must remember to use a try-finally clause. The latter is actually easy to forget or to get wrong (Listing Six). Also, you can't mix except and finally clauses. Rather, a try-except must wrap a try-finally, or vice versa (Listing Seven). This is an unfortunate obfuscation of code that begs for refactoring into a separate function.
Starting with Python 2.4, a new type of expression lets you use the keyword with. This lets you write Listing Seven as Listing Eight. This does the right thing as long as the developer of Lock has defined an __exit__ method in Lock. Outside of the with block, the object that lock refers to should not be used. This new syntax cleans things up somewhat but still leaves it up to you to remember that Lock requires proper manual resource management; legacy code will not be able to take advantage of this feature (though adding an __exit__ would be easy to do manually). Also, the PEP (PEP310, http:// www.python.org/peps/pep-0310.html) doesn't allow for multiple variables on the with line, though that is likely to be a rare requirement.
In cases where a with block is still not adequate (for instance, if you have more than one object to guard with the with and don't want to nest two with clauses), your only options are to:
- Continue with try-finally and try-except.
- Go with something like detscope.py that provides a means of automating the try-finally mechanics (see Listing Nine for a functional prototype that is, however, not multithreadsafe).
- Develop your own technique.
Breaking the Cycle
Since Python 2.1, the standard weakref module supports the concepts of weak reference. A weakref.ref(yourObject) is an object that does not affect the reference count of yourObject. It has other nice properties, such as being testable for nullness, and letting you specify a callback that gets called when yourObject is destroyed; see Listing Ten. A weak reference can be used to break a reference cycle because it tells the interpreter "Don't keep this object alive on my account." Listing Eleven does not create any cycle.
There is a catch: Reference cycles can be hard to find. Since Python 2.1, a garbage collector has been added that detects cycles and frees as many trapped objects as possible. The garbage collector can destroy an object involved in a cycle only if that object does not have a __del__ method. This is simply because a cycle has no beginning and no end, so there is no way of knowing which object's __del__ should be called first. Cycles that cannot be destroyed are put in a special list that you can manipulate via the gc module.
The gc module gives access to the garbage collector. The most useful members of gc are:
- garbage, a list containing all objects with reference count > 1 but that your code can no longer reach, and which has a __del__ method.
- collect(), forces a cleanup to update gc.garbage.
- get_referrers(obj), gets a list of objects that refer to obj.
- get_referents(obj), gets a list of objects referred to by obj.
Listing Twelve provides code useful for exploring the concept of cycles, and Listing Thirteen is a sample session using it and showing how the cycle is broken. Remember that gc.garbage is a standard Python list so it holds "hard" references to your objects: If you manually break a cycle, you must also remove your object from this list for it to be labeled as "unused" (reference count = 0). (See http://aspn .activestate.com/ASPN/Cookbook/Python/Recipe/65333/ for a recipe to dump unreachable objects using gc.garbage.)
If you're working with versions of Python prior to 2.1, you can use Cyclops (http://www.python.org/ftp/). This package, not in the standard library, contains the cyclops module, which helps identify reference cycles. Contrary to Python's garbage collector, Cyclops seems to require that you tell it which objects to inspect for cycles.
Once you have identified that a cycle is being created, you must figure out how (where) to break the cycle, or, if that is not possible, how to properly free the critical resources being held some other way. Most often, this will involve wrapping an object with a weakref.ref or weakref.proxy.
Conclusion
If you use Python for its portability, keep both platform and interpreter portability in mind: You must be careful not to rely on your objects being destroyed, ever, so you can't use the RAII idiom. The del operator affects only whether you can access (reach) an object, not the object's existence. Therefore, you must make sure you read the documentation of a class to see if a special disposal method must be called when you no longer need the object, and use such disposal methods inside a try-finally clause. In Python 2.4, you should be able to use the with clause, but that still requires work on your part. Detscope.py and other techniques may be appropriate as well. Circular references may prevent such a disposal from taking place, in which case you must hunt down the cycles manually or using pdb or gc or Cyclops, and fix them using weak references available via the weakref module.
Acknowledgments
Thanks to Holger Döurer, Todd MacCulloch, Francis Moore, and Pierre Rouleau for their helpful reviews of the drafts of this article.
DDJ
class A: def __init__(self): self.b = B(self) def __del__(self): print "goodbye" class B: def __init__(self, a): self.a = a aa = A() del aaBack to article
Listing Two
while 1: aa = A()Back to article
Listing Three
# someScript.py def run_application(): ... def handle_exception(): ... try: run_application() except: # catch all handle_exception() # *attempt* to free as many remaining as possible import sys sys.exc_clear() sys.exc_traceback = None sys.last_traceback = NoneBack to article
Listing Four
# C++ code: void func() { Lock lock; do_stuff(); } # "equivalent" Python code: def func(): lock = Lock() do_stuff()Back to article
Listing Five
def func(): lock = Lock() try: do_stuff() finally: lock.release()Back to article
Listing Six
# extreme danger: open('somefile.txt','w').write(contents) # runtime error in exception handler: try: ff = open('somefile.txt','w') ff.write(contents) finally: ff.close() #bang!! ff undefined # multithreaded: ff = open('somefile.txt','w') # if exception raised before getting # into the try-finally clause: bang! try: ff.write(contents) finally: ff.close()Back to article
Listing Seven
def func(): lock = Lock() # unfortunately not allowed: # try: # do_stuff() # except MyExcept: # undo_stuff() # finally: # lock.release() # instead, nesting is necessary: try: try: do_stuff() except MyExcept: undo_stuff() finally: lock.release()Back to article
Listing Eight
def func(): with lock = Lock(): do_stuff()Back to article
Listing Nine
# detscope.py """ Example use: a function, funcWCriticalRes(), creates two critical resources, of type CritRes1 and CritRes2, and you want those resources to be released regardless of control flow in function: import detscope.py def funcWCriticalRes(): critres1 = CritRes1() critres2 = CritRes2() use_res(res1, res2) if something: return # early return ... funcWCriticalRes = ScopeGuarded(funcWCriticalRes) class CritRes1(NeedsFinalization): def __init__(self, ...): ... def _finalize(self): ... class CritRes2(NeedsFinalization): def __init__(self, ...): ... def _finalize(self): ... """ import sys def ScopeGuarded(func): return lambda *args, **kwargs: ScopeGuardian(func, *args, **kwargs) _funcStack = [] class NeedsFinalization: def __init__(self): print '\n%s: being created' % repr(self) self.__finalized = False try: _funcStack[-1].append(self) except IndexError: raise RuntimeError, "Forgot to scope-guard function? " def finalizeMaster(self): """Derived classes MUST define a self._finalize() method, where they do their finalization for scope exit.""" print '%s: Finalize() being called' % repr(self) self._finalize() self.__finalized = True def __del__(self): """This just does some error checking, probably want to remove in production in case derived objects involved in cycles.""" try: problem = not self.__finalized except AttributeError: msg = '%s: NeedsFinalization.__init__ not called for %s' \ % (repr(self), self.__class__) raise RuntimeError, msg if not problem: print '%s: Finalized properly' % repr(self) else: print 'Forgot to scope-guard func?' def ScopeGuardian(func, *args, **kwargs): try: scopedObjs = [] _funcStack.append(scopedObjs) func(*args, **kwargs) finally: _funcStack.pop() if scopedObjs != []: scopedObjs.reverse() # destroy in reverse order from creation for obj in scopedObjs: obj.finalizeMaster()Back to article
Listing Ten
import weakref class Foo: def __str__(self): return "I'm a Foo and I'm ok" def __del__(self): print "obj %s: I was a Foo and now I'm dead" % id(self) def noticeDeath(wr): print "weakref %s: weakly ref'd obj has died" % id(wr) yourObj = Foo() wr = weakref.ref(yourObj, noticeDeath) print 'weakref %s -> obj %s: %s' % (id(wr), id(wr()), wr()) del yourObj assert wr() is None # output: # weakref 17797504 -> obj 17794632: I'm a Foo and I'm ok # weakref 17797504: weakly ref'd obj has died # obj 17794632: I was a Foo and now I'm deadBack to article
Listing Eleven
import weakref class A: def __init__(self): self.b = B(self) def __del__(self): print "goodbye" class B: def __init__(self, a): self.a = weakref.ref(a) aa = A() del aaBack to article
Listing Twelve
# testCycle.py from sys import getrefcount import gc class CycleContainer: def __init__(self, instName): self.instName = instName self.cycle = Cycle(self) print "Constructed a CycleContainer named '%s'" % instName def refs(self): """Get number of references to self. The 3 was determined experimentally, so method returns expected number of references.""" return getrefcount(self)-3 def __del__(self): """Will prevent CycleContainer instance from being destroyed by gc""" print "CycleContainer '%s' being finalized" % self.instName class Cycle: def __init__(self, containerOfSelf): self.container = containerOfSelf def checkgc(): gc.collect() return gc.garbageBack to article
Listing Thirteen
>>> from testCycle import CycleContainer, Cycle, checkgc >>> aa= CycleContainer('one') Constructed a CycleContainer named 'one' >>> aa.refs() 1 >>> aa.cycle = Cycle(aa) >>> aa.refs() 2 >>> checkgc() [] >>> del aa >>> checkgc() [<testCycle.CycleContainer instance at 0x00984CB0>] >>> checkgc()[0].refs() 2 >>> bad = checkgc()[0] >>> del bad.cycle >>> bad.refs() 2 >>> checkgc() [<testCycle.CycleContainer instance at 0x00984CB0>] >>> del checkgc()[:] >>> checkgc() [] >>> bad.refs() 1 >>> del bad CycleContainer 'one' being finalizedBack to article