python垃圾回收及内存分布_python垃圾回收与系统内存-CSDN博客

Abstract

The main garbage collection algorithm used by CPython is reference counting. The basic idea is that CPython counts how many different places there are that have a reference to an object. Such a place could be another object, or a global (or static) C variable, or a local variable in some C function. When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on. The reference count field can be examined using the sys.getrefcountfunction (notice that the value returned by this function is always 1 more as the function also has a reference to the object when called):

>>> x = object()
>>> sys.getrefcount(x)
2
>>> y = x
>>> sys.getrefcount(x)
3
>>> del y
>>> sys.getrefcount(x)
2

The main problem with the reference counting scheme is that it does not handle reference cycles. For instance, consider this code:

>>> container = []
>>> container.append(container)
>>> sys.getrefcount(container)
3
>>> del container

In this example, containerholds a reference to itself, so even when we remove our reference to it (the variable “container”) the reference count never falls to 0 because it still has its own internal reference. Therefore it would never be cleaned just by simple reference counting. For this reason some additional machinery is needed to clean these reference cycles between objects once they become unreachable. This is the cyclic garbage collector, usually called just Garbage Collector (GC), even though reference counting is also a form of garbage collection.

Memory layout and object structure

Normally the C structure supporting a regular Python object looks as follows:

object -----> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \
              |                    ob_refcnt                  | |
              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD
              |                    *ob_type                   | |
              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
              |                      ...                      |

In order to support the garbage collector, the memory layout of objects is altered to accommodate extra information before the normal layout:

              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \
              |                    *_gc_next                  | |
              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyGC_Head
              |                    *_gc_prev                  | |
object -----> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
              |                    ob_refcnt                  | \
              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD
              |                    *ob_type                   | |
              +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
              |                      ...                      |

Doubly linked lists are used because they efficiently support most frequently required operations. In general, the collection of all objects tracked by GC are partitioned into disjoint sets, each in its own doubly linked list. Between collections, objects are partitioned into “generations”, reflecting how often they’ve survived collection attempts. During collections, the generation(s) being collected are further partitioned into, e.g., sets of reachable and unreachable objects. Doubly linked lists support moving an object from one partition to another, adding a new object, removing an object entirely (objects tracked by GC are most often reclaimed by the refcounting system when GC isn’t running at all!), and merging partitions, all with a small constant number of pointer updates. With care, they also support iterating over a partition while objects are being added to - and removed from - it, which is frequently required while GC is running.

Identifying reference cycles

The algorithm that CPython uses to detect those reference cycles is implemented in the gc module. The garbage collector only focuses on cleaning container objects (i.e. objects that can contain a reference to one or more objects). These can be arrays, dictionaries, lists, custom class instances, classes in extension modules, etc. One could think that cycles are uncommon but the truth is that many internal references needed by the interpreter create cycles everywhere. Some notable examples:

    Exceptions contain traceback objects that contain a list of frames that contain the exception itself.

    Module-level functions reference the module’s dict (which is needed to resolve globals), which in turn contains entries for the module-level functions.

    Instances have references to their class which itself references its module, and the module contains references to everything that is inside (and maybe other modules) and this can lead back to the original instance.

    When representing data structures like graphs, it is very typical for them to have internal links to themselves.

To correctly dispose of these objects once they become unreachable, they need to be identified first. Inside the function that identifies cycles, two doubly linked lists are maintained: one list contains all objects to be scanned, and the other will contain all objects “tentatively” unreachable.

To understand how the algorithm works, let’s take the case of a circular linked list which has one link referenced by a variable A, and one self-referencing object which is completely unreachable:

>>> import gc

>>> class Link:
...    def __init__(self, next_link=None):
...        self.next_link = next_link

>>> link_3 = Link()
>>> link_2 = Link(link_3)
>>> link_1 = Link(link_2)
>>> link_3.next_link = link_1
>>> A = link_1
>>> del link_1, link_2, link_3

>>> link_4 = Link()
>>> link_4.next_link = link_4
>>> del link_4

# Collect the unreachable Link object (and its .__dict__ dict).
>>> gc.collect()
2

When the GC starts, it has all the container objects it wants to scan on the first linked list. The objective is to move all the unreachable objects. Since most objects turn out to be reachable, it is much more efficient to move the unreachable as this involves fewer pointer updates.

Every object that supports garbage collection will have an extra reference count field initialized to the reference count (gc_ref in the figures) of that object when the algorithm starts. This is because the algorithm needs to modify the reference count to do the computations and in this way the interpreter will not modify the real reference count field.
在这里插入图片描述

The GC then iterates over all containers in the first list and decrements by one the gc_ref field of any other object that container is referencing. Doing this makes use of the tp_traverse slot in the container class (implemented using the C API or inherited by a superclass) to know what objects are referenced by each container. After all the objects have been scanned, only the objects that have references from outside the “objects to scan” list will have gc_refs > 0.
在这里插入图片描述

Notice that having gc_refs == 0 does not imply that the object is unreachable. This is because another object that is reachable from the outside (gc_refs > 0) can still have references to it. For instance, the link_2 object in our example ended having gc_refs == 0. but is referenced still by the link_1 object that is reachable from the outside. To obtain the set of objects that are really unreachable, the garbage collector re-scans the container objects using the tp_traverse slot; this time with a different traverse function that marks objects with gc_refs == 0 as “tentatively unreachable” and then moves them to the tentatively unreachable list. The following image depicts the state of the lists in a moment when the GC processed the link_3 and link_4 objects but has not processed link_1 and link_2 yet.
在这里插入图片描述

Then the GC scans the next link_1 object. Because it has gc_refs == 1, the gc does not do anything special because it knows it has to be reachable (and is already in what will become the reachable list):
在这里插入图片描述

When the GC encounters an object which is reachable (gc_refs > 0), it traverses its references using the tp_traverse slot to find all the objects that are reachable from it, moving them to the end of the list of reachable objects (where they started originally) and setting its gc_refs field to 1. This is what happens to link_2 and link_3 below as they are reachable from link_1. From the state in the previous image and after examining the objects referred to by link_1 the GC knows that link_3 is reachable after all, so it is moved back to the original list and its gc_refs field is set to 1 so that if the GC visits it again, it will know that it’s reachable. To avoid visiting an object twice, the GC marks all objects that have already been visited once (by unsetting the PREV_MASK_COLLECTING flag) so that if an object that has already been processed is referenced by some other object, the GC does not process it twice.
在这里插入图片描述

Notice that an object that was marked as “tentatively unreachable” and was later moved back to the reachable list will be visited again by the garbage collector as now all the references that that object has need to be processed as well. This process is really a breadth first search over the object graph. Once all the objects are scanned, the GC knows that all container objects in the tentatively unreachable list are really unreachable and can thus be garbage collected.

Pragmatically, it’s important to note that no recursion is required by any of this, and neither does it in any other way require additional memory proportional to the number of objects, number of pointers, or the lengths of pointer chains. Apart from O(1) storage for internal C needs, the objects themselves contain all the storage the GC algorithms require.

summary

Our idea now is to keep track of all container objects. There are several ways that this can be done but one of the best is using doubly linked lists with the link fields inside the objects structure. This allows objects to be quickly inserted and removed from the set as well as not requiring extra memory allocations. When a container is created it is inserted into this set and when deleted it is removed.

Now that we have access to all the container objects, how to we find reference cycles? First we add another field to container objects in addition to the two link pointers. We will call this field gc_refs. Here are the steps to find reference cycles:

1. For each container object, set gc_refs equal to the object's reference count.
2. For each container object, find which container objects it references and decrement the referenced container's gc_refs field.
3. After all the objects have been scanned, only the objects that have references from outside the “objects to scan” list will have gc_refs > 0.
4. Re-scans the container objects to find out objects that are reachable from the outside(gc_refs > 0)  and they still have referenced objects(gc_refs == 0)  and mark these still referenced objects are “tentatively unreachable” 
5. Move  left all objects（gc_refs == 0） to unreachable collections. Note some of them are not really  unreachable
6. GC re-scans the container objects that are reachable from the outside(gc_refs > 0)  and  they still have referenced objects.  moving them from unreachable to the end of the list of reachable collections
7. All container objects that now have a gc_refs field greater than one are referenced from outside the set of container objects. We cannot free these objects so we move them to a different set.
8. We can now go about freeing these objects in unreachable collections.

Why not use “traditional” Garbage Collection?

Traditional garbage collection (eg. mark and sweep or stop and copy) usually works as follows:

Find the root objects of the system. These are things like the global environment (like the __main__ module in Python) and objects on the stack.
Search from these objects and find all objects reachable from them. This objects are all "alive".
Free all other objects.

Unfortunately this approach cannot be used in the current version of Python. Because of the way extension modules work, Python can never fully determine the root set. If the root set cannot be determined accurately we risk freeing objects still referenced from somewhere. Even if extension modules were designed differently, the is no portable way of finding what objects are currently on the C stack. Also, reference counting provides some nice benefits in terms of locality of memory reference and finalizer semantics that Python programmers have come to expect. What would be best is if we could find a way of still using reference counting but also free reference cycles.

Ref:
[1] https://devguide.python.org/internals/garbage-collector/index.html#destroying-unreachable-objects
[2] http://arctrix.com/nas/python/gc/