你懂python吗(13)

Python垃圾回收机制

Python GC主要使用引用计数（reference counting）来跟踪和回收垃圾。在引用计数的基础上，通过“标记-清除”（mark and sweep）解决容器对象可能产生的循环引用问题，通过“分代回收”（generation collection）以空间换时间的方法提高垃圾回收效率。简单来说就是以引用计数为主，标记-清除和分代收集两种机制为辅

引用计数机制

引用计数是啥？

python里每一个东西都是对象，它们的核心就是一个结构体：PyObject

typedef struct_object {
 int ob_refcnt;
 struct_typeobject *ob_type;
} PyObject;

PyObject是每个对象必有的内容，其中ob_refcnt就是做为引用计数。当一个对象有新的引用时，它的ob_refcnt就会增加，当引用它的对象被删除，它的ob_refcnt就会减少

#define Py_INCREF(op)   ((op)->ob_refcnt++) //增加计数
#define Py_DECREF(op) \ //减少计数
    if (--(op)->ob_refcnt != 0) \
        ; \
    else \
        __Py_Dealloc((PyObject *)(op))

当引用计数为0时，该对象生命就结束了。

导致引用计数+1的情况

对象被创建，例如a=23
对象被引用，例如b=a
对象被作为参数，传入到一个函数中，例如func(a)
对象作为一个元素，存储在容器中，例如list1=[a,a]

导致引用计数-1的情况

对象的别名被显式销毁，例如del a

对象的别名被赋予新的对象，例如a=24

一个对象离开它的作用域，例如f函数执行完毕时，func函数中的局部变量

对象所在的容器被销毁，或从容器中删除对象

import sys
def func(c):
    print 'in func function', sys.getrefcount(c) - 1
if __name__ == '__main__':
    print 'init', sys.getrefcount(11) - 1  # 11初始化的引用值
    a = 11                                 # 11被a引用
    print 'after a=11', sys.getrefcount(11) - 1  
    b = a
    print 'after b=a', sys.getrefcount(11) - 1  #a被b引用
    func(11)
    print 'after func(a)', sys.getrefcount(11) - 1
    list1 = [a, 12, 14]
    print 'after list1=[a,12,14]', sys.getrefcount(11) - 1
    a=12
    print 'after a=12', sys.getrefcount(11) - 1
    del a
    print 'after del a', sys.getrefcount(11) - 1
    del b
    print 'after del b', sys.getrefcount(11) - 1
    list1.pop(0)
    print 'after pop list1',sys.getrefcount(11)-1
    del list1
    print 'after del list1', sys.getrefcount(11) - 1
#output
init 30
after a=11 31
after b=a 32
in func function 34
after func(a) 32
after list1=[a,12,14] 33
after a=12 32
after del a 32
after del b 31
after pop list1 30
after del list1 30

引用计数机制的优点：

1、简单

2、实时性：一旦没有引用，内存就直接释放了。不用像其他机制等到特定时机。实时性还带来一个好处：处理回收内存的时间分摊到了平时。

引用计数机制的缺点：

1、维护引用计数消耗资源

2、循环引用

循环引用如何内存泄漏？

有 del() 函数的对象间的循环引用是导致内存泄漏的主凶。但没有del()函数的对象间的循环引用是可以被垃圾回收器回收掉的。

首先我们要看一下没有内存泄漏的例子：

import gc
import sys
class cycleLeak(object):
    def __init__(self):
        self._text = '#' * 10
    def __del__(self):
        pass
def make_cycle_ref():
    _cycleLeak = cycleLeak()
    print '_cycleLeak ref count0: %d' % (sys.getrefcount(_cycleLeak))
    del _cycleLeak
    try:
        print '_cycleLeak ref count1: %d' % (sys.getrefcount(_cycleLeak))
    except UnboundLocalError:
        print '_cycleLeak is invalid !'
def test_gcLeak():
    gc.enable()
    # gc.set_debug(gc.DEBUG_LEAK)
    gc.set_debug(gc.DEBUG_COLLECTABLE | gc.DEBUG_UNCOLLECTABLE |
                 gc.DEBUG_INSTANCES | gc.DEBUG_OBJECTS)
    print 'begin gcleak test ....'
    make_cycle_ref()
    print 'gc begin collecting ....'
    _unreachable = gc.collect()
    print 'unreachable object num : %d ' % (_unreachable)
    '''
    gc.garbage是一个list对象，列表项是垃圾收集器发现的不可达（即垃圾对象）、但又不能释放(不可回收)的对	 象，通常gc.garbage中的对象是引用对象还中的对象。因Python不知用什么顺序来调用对象的__del__函数，导     致对象始终存活在gc.garbage中，造成内存泄露
    '''
    print 'garbage object num : %d' % (len(gc.garbage))
if __name__ == '__main__':
    test_gcLeak()
#output
begin gcleak test ....
_cycleLeak ref count0: 2  #对象_gcleak的引用计数为2
_cycleLeak is invalid !   #因为执行了del函数，_gcleak变为了不可达的对象
gc begin collecting ....  #开始垃圾回收
unreachable object num : 0  #本次垃圾回收发现的不可达的对象个数为0
garbage object num : 0      #整个解释器中垃圾对象的个数为0

我们通过自我引用引起内存泄露，在上述程序中将 make_cycle_ref()添加如下：

def make_cycle_ref():
    _cycleLeak = cycleLeak()
    _cycleLeak._self=_cycleLeak   #自我引用
    print '_cycleLeak ref count0: %d' % (sys.getrefcount(_cycleLeak))
    del _cycleLeak
    try:
        print '_cycleLeak ref count1: %d' % (sys.getrefcount(_cycleLeak))
    except UnboundLocalError:
        print '_cycleLeak is invalid !'
#output
begin gcleak test ....
_cycleLeak ref count0: 3
_cycleLeak is invalid !
gc begin collecting ....
gc: uncollectable <cycleLeak 0x7f4f50779350>
gc: uncollectable <dict 0x7f4f5077a6e0>
unreachable object num : 2 #本次回收不可达的对象个数为2
garbage object num : 1     #整个解释器中垃圾个数为1

我们互相引用导致内存泄漏,代码如下：

import gc
import sys
class cycleLeakA(object):
    def __init__(self):
        self._text = '#' * 10
    def __del__(self):
        pass
class cycleLeakB(object):
    def __init__(self):
        self._text = '#' * 10
    def __del__(self):
        pass
def make_cycle_ref():
    _cycleLeakA = cycleLeakA()
    _cycleLeakB = cycleLeakB()
    _cycleLeakA._ref=_cycleLeakB
    _cycleLeakB._ref = _cycleLeakA
    print '_cycleLeak ref count0: _cycleLeakA %d ,_cycleLeakB %d' % (sys.getrefcount(_cycleLeakA),sys.getrefcount(_cycleLeakB))
    del _cycleLeakA
    del _cycleLeakB
    try:
        print '_cycleLeak ref count1: _cycleLeakA %d ,_cycleLeakB %d' % (sys.getrefcount(_cycleLeakA),sys.getrefcount(_cycleLeakB))
    except UnboundLocalError:
        print '_cycleLeak is invalid !'
def test_gcLeak():
    gc.enable()
    gc.set_debug(gc.DEBUG_COLLECTABLE | gc.DEBUG_UNCOLLECTABLE |
                 gc.DEBUG_INSTANCES | gc.DEBUG_OBJECTS ) 
    print 'begin gcleak test ....'
    make_cycle_ref()
    print 'gc begin collecting ....'
    _unreachable = gc.collect()
    print 'unreachable object num : %d ' % (_unreachable)
    print 'garbage object num : %d' % (len(gc.garbage))
if __name__ == '__main__':
    test_gcLeak()
#output
begin gcleak test ....
_cycleLeak ref count0: _cycleLeakA 3 ,_cycleLeakB 3
_cycleLeak is invalid !
gc begin collecting ....
unreachable object num : 4 
garbage object num : 2
gc: uncollectable <cycleLeakA 0x7f59f3a04710>
gc: uncollectable <cycleLeakB 0x7f59f3a04750>
gc: uncollectable <dict 0x7f59f3a05a28>
gc: uncollectable <dict 0x7f59f3a056e0>

注意：如果我们将class cycleLeak(object)的__del__属性删除，garbage object num将会是0，为什么呢？因为如果循环引用中，两个对象都定义了__del__方法，gc模块不会销毁这些不可达对象，因为gc模块不知道应该先调用哪个对象的__del__方法，所以为了安全起见，gc模块会把对象放到gc.garbage中，但是不会销毁对象。

上述的例子使用了gc模块，我们简单的介绍一下gc模块

常用函数：

gc.set_debug(flags) 设置gc的debug日志，一般设置为gc.DEBUG_LEAK ，也可以是上述事例
gc.collect([generation]) 显式进行垃圾回收，可以输入参数，0代表只检查第一代的对象，1代表检查一，二代的对象，2代表检查一，二，三代的对象，如果不传参数，执行一个full collection，也就是等于传2。返回不可达（unreachable objects）对象的数目
gc.get_count() 获取当前自动执行垃圾回收的计数器，返回一个长度为3的列表

解释如下：

import gc
class A(object):
    pass
if __name__ == '__main__':
    print 'gc.get_count()' ,gc.get_count()
    a=A()
    print 'gc.get_count()', gc.get_count()
    del a
    print 'gc.get_count()', gc.get_count()
#output
gc.get_count() (581, 8, 0)
gc.get_count() (582, 8, 0)
gc.get_count() (581, 8, 0)

（581，8，0）其中

581指距离上一次一代垃圾检查Python分配内存的数目减去释放内存的数目

8指距离上一次二代垃圾检查，一代垃圾检查的次数

0是指距离上一次三代垃圾检查，二代垃圾检查的次数

gc.set_threshold(threshold0[, threshold1[, threshold2]) 设置自动执行垃圾回收的频率。

解释如下：

gc模快有一个自动垃圾回收的阀值，即通过gc.get_threshold函数获取到的长度为3的元组，例如(700,10,10)每一次计数器的增加，gc模块就会检查增加后的计数是否达到阀值的数目，如果是，就会执行对应的代数的垃圾检查，然后重置计数器。

例如，假设阀值是(700,10,10)：

当计数器从(699,3,0)增加到(700,3,0)，gc模块就会执行gc.collect(0),即检查一代对象的垃圾，并重置计数器为(0,4,0)
当计数器从(699,9,0)增加到(700,9,0)，gc模块就会执行gc.collect(1),即检查一、二代对象的垃圾，并重置计数器为(0,0,1)
当计数器从(699,9,9)增加到(700,9,9)，gc模块就会执行gc.collect(2),即检查一、二、三代对象的垃圾，并重置计数器为(0,0,0)

使用方法：

必须要import gc模块，并且is_enable()=True才会启动自动垃圾回收。
这个机制的主要作用就是发现并处理不可达的垃圾对象。
垃圾回收=垃圾检查+垃圾回收
在Python中，采用分代收集的方法。把对象分为三代，一开始，对象在创建的时候，放在一代中，如果在一次一代的垃圾检查中，改对象存活下来，就会被放到二代中，同理在一次二代的垃圾检查中，该对象存活下来，就会被放到三代中。

标记-清除？

标记-清除机制，顾名思义，首先标记对象（垃圾检测），然后清除垃圾（垃圾回收）。基本思路是先按需分配，等到没有空闲内存的时候从寄存器和程序栈上的引用出发，遍历以对象为节点、以引用为边构成的图，把所有可以访问到的对象打上标记，然后清扫一遍内存空间，把所有没标记的对象释放。

首先初始所有对象标记为白色，并确定根节点对象（这些对象是不会被删除），标记它们为黑色（表示对象有效）。将有效对象引用的对象标记为灰色（表示对象可达，但它们所引用的对象还没检查），检查完灰色对象引用的对象后，将灰色标记为黑色。重复直到不存在灰色节点为止。最后白色结点都是需要清除的对象。

这里所采用的高级机制作为引用计数的辅助机制，用于解决产生的循环引用问题。而循环引用只会出现在“内部存在可以对其他对象引用的对象”，比如：list，class等。为了要将这些回收对象组织起来，需要建立一个链表。自然，每个被收集的对象内就需要多提供一些信息，下面代码是回收对象里必然出现的。

/* GC information is stored BEFORE the object structure. */
typedef union _gc_head {
    struct {
        union _gc_head *gc_next;
        union _gc_head *gc_prev;
        Py_ssize_t gc_refs;
    } gc;
    long double dummy;  /* force worst-case alignment */
} PyGC_Head;

一个对象的实际结构如图所示：

通过PyGC_Head的指针将每个回收对象连接起来，形成了一个链表，也就是在1里提到的初始化的所有对象。

分代收集？

分代技术是一种典型的以空间换时间的技术，这也正是java里的关键技术。这种思想简单点说就是：对象存在时间越长，越可能不是垃圾，应该越少去收集。分代回收的整体思想是：将系统中的所有内存块根据其存活时间划分为不同的集合，每个集合就成为一个“代”，垃圾收集频率随着“代”的存活时间的增大而减小，存活时间通常利用经过几次垃圾回收来度量。

Python默认定义了三代对象集合，索引数越大，对象存活时间越长。

举例：当某些内存块M经过了3次垃圾收集的清洗之后还存活时，我们就将内存块M划到一个集合A中去，而新分配的内存都划分到集合B中去。当垃圾收集开始工作时，大多数情况都只对集合B进行垃圾回收，而对集合A进行垃圾回收要隔相当长一段时间后才进行，这就使得垃圾收集机制需要处理的内存少了，效率自然就提高了。在这个过程中，集合B中的某些内存块由于存活时间长而会被转移到集合A中，当然，集合A中实际上也存在一些垃圾，这些垃圾的回收会因为这种分代的机制而被延迟。

这样的思想，可以减少标记-清除机制所带来的额外操作。分代就是将回收对象分成数个代，每个代就是一个链表（集合），代进行标记-清除的时间与代内对象存活时间成正比例关系。

/*** Global GC state ***/ 
struct gc_generation {
      PyGC_Head head;
      int threshold; /* collection threshold */
      int count; /* count of allocations or collections of younger
                    generations */
  };//每个代的结构 
#define NUM_GENERATIONS 3//代的个数
#define GEN_HEAD(n) (&generations[n].head)
 
/* linked lists of container objects */
static struct gc_generation generations[NUM_GENERATIONS] = {
     /* PyGC_Head,                               threshold,      count */
     {{{GEN_HEAD(0), GEN_HEAD(0), 0}},           700,            0},
     {{{GEN_HEAD(1), GEN_HEAD(1), 0}},           10,             0},
     {{{GEN_HEAD(2), GEN_HEAD(2), 0}},           10,             0},
 };
 
PyGC_Head *_PyGC_generation0 = GEN_HEAD(0);

从上面代码可以看出python里一共有三代，每个代的threshold值表示该代最多容纳对象的个数。默认情况下，当0代超过700,或1，2代超过10，垃圾回收机制将触发。0代触发将清理所有三代，1代触发会清理1,2代，2代触发后只会清理自己。

整个过程包括链表建立，确定根节点，垃圾标记，垃圾回收。

链表建立

0代触发将清理所有三代，1代触发会清理1,2代，2代触发后只会清理自己。在清理0代时，会将三个链表（代）链接起来，清理1代的时，会链接1,2两代。在后面三步，都是针对的这个建立之后的链表。

确定根节点

如下图的例子，ist1与list2循环引用，list3与list4循环引用，a是一个外部引用。

对于这样一个链表，我们如何得出根节点呢。python里是在引用计数的基础上又提出一个有效引用计数的概念。顾名思义，有效引用计数就是去除循环引用后的计数。下面是计算有效引用计数的相关代码：

/* Set all gc_refs = ob_refcnt.  After this, gc_refs is > 0 for all objects
 * in containers, and is GC_REACHABLE for all tracked gc objects not in
 * containers.
 */
static void
update_refs(PyGC_Head *containers)
{
    PyGC_Head *gc = containers->gc.gc_next;
    for (; gc != containers; gc = gc->gc.gc_next) {
        assert(gc->gc.gc_refs == GC_REACHABLE);
        gc->gc.gc_refs = Py_REFCNT(FROM_GC(gc));
        assert(gc->gc.gc_refs != 0);
    }
}
/* A traversal callback for subtract_refs. */
static int
visit_decref(PyObject *op, void *data)
{
    assert(op != NULL);
    if (PyObject_IS_GC(op)) {
        PyGC_Head *gc = AS_GC(op);
        /* We're only interested in gc_refs for objects in the
         * generation being collected, which can be recognized
         * because only they have positive gc_refs.
         */
        assert(gc->gc.gc_refs != 0); /* else refcount was too small */
        if (gc->gc.gc_refs > 0)
            gc->gc.gc_refs--;
    }
    return 0;
}
/* Subtract internal references from gc_refs.  After this, gc_refs is >= 0
 * for all objects in containers, and is GC_REACHABLE for all tracked gc
 * objects not in containers.  The ones with gc_refs > 0 are directly
 * reachable from outside containers, and so can't be collected.
 */
static void
subtract_refs(PyGC_Head *containers)
{
    traverseproc traverse;
    PyGC_Head *gc = containers->gc.gc_next;
    for (; gc != containers; gc=gc->gc.gc_next) {
        traverse = Py_TYPE(FROM_GC(gc))->tp_traverse;
        (void) traverse(FROM_GC(gc),
                       (visitproc)visit_decref,
                       NULL);
    }
}

update_refs函数里建立了一个引用的副本。

visit_decref函数对引用的副本减1，subtract_refs函数里traverse的作用是遍历对象里的每一个引用，执行visit_decref操作。

最后，链表内引用计数副本非0的对象，就是根节点了。

垃圾标记

接下来，python建立两条链表，一条存放根节点，以及根节点的引用对象。另外一条存放unreachable对象。

代码如下：

/* A traversal callback for move_unreachable. */
static int
visit_reachable(PyObject *op, PyGC_Head *reachable)
{
    if (PyObject_IS_GC(op)) {
        PyGC_Head *gc = AS_GC(op);
        const Py_ssize_t gc_refs = gc->gc.gc_refs;
        if (gc_refs == 0) {
            /* This is in move_unreachable's 'young' list, but
             * the traversal hasn't yet gotten to it.  All
             * we need to do is tell move_unreachable that it's
             * reachable.
             */
            gc->gc.gc_refs = 1;
        }
        else if (gc_refs == GC_TENTATIVELY_UNREACHABLE) {
            /* This had gc_refs = 0 when move_unreachable got
             * to it, but turns out it's reachable after all.
             * Move it back to move_unreachable's 'young' list,
             * and move_unreachable will eventually get to it
             * again.
             */
            gc_list_move(gc, reachable);
            gc->gc.gc_refs = 1;
        }
        /* Else there's nothing to do.
         * If gc_refs > 0, it must be in move_unreachable's 'young'
         * list, and move_unreachable will eventually get to it.
         * If gc_refs == GC_REACHABLE, it's either in some other
         * generation so we don't care about it, or move_unreachable
         * already dealt with it.
         * If gc_refs == GC_UNTRACKED, it must be ignored.
         */
         else {
            assert(gc_refs > 0
                   || gc_refs == GC_REACHABLE
                   || gc_refs == GC_UNTRACKED);
         }
    }
    return 0;
}
/* Move the unreachable objects from young to unreachable.  After this,
 * all objects in young have gc_refs = GC_REACHABLE, and all objects in
 * unreachable have gc_refs = GC_TENTATIVELY_UNREACHABLE.  All tracked
 * gc objects not in young or unreachable still have gc_refs = GC_REACHABLE.
 * All objects in young after this are directly or indirectly reachable
 * from outside the original young; and all objects in unreachable are
 * not.
 */
static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
    PyGC_Head *gc = young->gc.gc_next;
    /* Invariants:  all objects "to the left" of us in young have gc_refs
     * = GC_REACHABLE, and are indeed reachable (directly or indirectly)
     * from outside the young list as it was at entry.  All other objects
     * from the original young "to the left" of us are in unreachable now,
     * and have gc_refs = GC_TENTATIVELY_UNREACHABLE.  All objects to the
     * left of us in 'young' now have been scanned, and no objects here
     * or to the right have been scanned yet.
     */
    while (gc != young) {
        PyGC_Head *next;
        if (gc->gc.gc_refs) {
            /* gc is definitely reachable from outside the
             * original 'young'.  Mark it as such, and traverse
             * its pointers to find any other objects that may
             * be directly reachable from it.  Note that the
             * call to tp_traverse may append objects to young,
             * so we have to wait until it returns to determine
             * the next object to visit.
             */
            PyObject *op = FROM_GC(gc);
            traverseproc traverse = Py_TYPE(op)->tp_traverse;
            assert(gc->gc.gc_refs > 0);
            gc->gc.gc_refs = GC_REACHABLE;
            (void) traverse(op,
                            (visitproc)visit_reachable,
                            (void *)young);
            next = gc->gc.gc_next;
        }
        else {
            /* This *may* be unreachable.  To make progress,
             * assume it is.  gc isn't directly reachable from
             * any object we've already traversed, but may be
             * reachable from an object we haven't gotten to yet.
             * visit_reachable will eventually move gc back into
             * young if that's so, and we'll see it again. 
             */
            next = gc->gc.gc_next;
            gc_list_move(gc, unreachable);
            gc->gc.gc_refs = GC_TENTATIVELY_UNREACHABLE;
        }
        gc = next;
    }
}

标记之后，链表如下图：

垃圾回收

回收的过程，就是销毁不可达链表内对象。下面代码就是list的清除方法：

/* Methods */
static void
list_dealloc(PyListObject *op)
{
    Py_ssize_t i;
    PyObject_GC_UnTrack(op);
    Py_TRASHCAN_SAFE_BEGIN(op)
    if (op->ob_item != NULL) {
        /* Do it backwards, for Christian Tismer.
           There's a simple test case where somehow this reduces
           thrashing when a *very* large list is created and
           immediately deleted. */
        i = Py_SIZE(op);
        while (--i >= 0) {
            Py_XDECREF(op->ob_item[i]);
        }
        PyMem_FREE(op->ob_item);
    }
    if (numfree < PyList_MAXFREELIST && PyList_CheckExact(op))
        free_list[numfree++] = op;
    else
        Py_TYPE(op)->tp_free((PyObject *)op);
    Py_TRASHCAN_SAFE_END(op)
}

参考：

http://www.cnblogs.com/hackerl/p/5901553.html

http://www.jianshu.com/p/1e375fb40506

http://www.cnblogs.com/Xjng/p/5128269.html

http://www.cnblogs.com/kaituorensheng/p/4449457.html