关于json：为什么字典在Python中使用如此多的RAM

Why does a dictionary use so much RAM in Python

我编写了一个python脚本来读取两个文件的内容，第一个是一个相对较小的文件(~30kb)，第二个是一个较大的文件~270MB。两个文件的内容都加载到字典数据结构中。当加载第二个文件时，我希望所需的RAM数量大致相当于磁盘上文件的大小，可能会有一些开销，但是在我的PC上观察RAM的使用情况时，它似乎总是需要大约2GB(大约是文件大小的8倍)。相关的源代码如下(暂停插入，以便我可以看到每个阶段的RAM使用情况)。消耗大量内存的行是"tweets=map(json.loads，tweet_file)"：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

def get_scores(term_file):
global scores
for line in term_file:
term, score = line.split("\t") #tab character
scores[term] = int(score)

def pause():
tmp = raw_input('press any key to continue: ')

def main():
# get terms and their scores..
print 'open word list file ...'
term_file = open(sys.argv[1])
pause()
print 'create dictionary from word list file ...'
get_scores(term_file)
pause()
print 'close word list file ...'
term_file.close
pause()

# get tweets from file...
print 'open tweets file ...'
tweet_file = open(sys.argv[2])
pause()
print 'create dictionary from word list file ...'
tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
pause()
print 'close tweets file ...'
tweet_file.close
pause()

有人知道这是为什么吗？我担心的是，我想把我的研究扩展到更大的文件，但会很快耗尽内存。有趣的是，打开文件后，内存使用似乎没有明显增加(我认为这只是创建一个指针)。

我有一个想法，试着在文件中一行一行地循环，处理我能做的，并且只存储我将来参考所需的最小值，而不是将所有内容加载到字典列表中，但我只是想看看创建字典时，文件大小对内存的大约8倍乘数是否与其他值一致。急诊室人员的经验？

我猜你的字典上有多个副本同时存储在内存中(以各种格式)。例如，行：

1	tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)

将创建一个新副本(+400~1000MB，包括字典开销)。但你原来的tweet_file留在记忆中。为什么数字这么大？如果使用Unicode字符串，每个Unicode字符在内存中使用2或4个字节。而在您的文件中，假设使用UTF-8编码，大多数字符只使用1字节。如果在python 2中使用普通字符串，那么内存中字符串的大小应该几乎与磁盘上的大小相同。所以你必须找到另一种解释。

编辑：python 2中"character"所占用的实际字节数可能会有所不同。以下是一些示例：

1
2
3
4
5
6
7

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof("a")
41
>>> sys.getsizeof("ab")
42

号

如您所见，似乎每个字符都被编码为一个字节。但是：

1 2	>>> sys.getsizeof("à") 42

不适用于"法语"字符。而且…

1
2
3
4

>>> sys.getsizeof("世")
43
>>> sys.getsizeof("世界")
46

。

对于日语，每个字符有3个字节。

上面的结果依赖于站点——我的系统使用了默认编码"utf-8"，这一事实也解释了这一点。上面计算的"字符串大小"实际上是表示给定文本的"字节字符串大小"。

如果"json.load"使用"unicode"字符串，结果会有所不同：

1
2
3
4
5
6
7
8
9
10

>>> sys.getsizeof(u"")
52
>>> sys.getsizeof(u"a")
56
>>> sys.getsizeof(u"ab")
60
>>> sys.getsizeof(u"世")
56
>>> sys.getsizeof(u"世界")
60

在这种情况下，如您所见，每个额外的字符加上4个额外的字节。

也许文件对象会缓存一些数据？如果要触发对象的显式Dellaation，请尝试将其引用设置为无：

1
2
3
4

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
[...]
tweet_file.close()
tweet_file = None

。

当一个对象不再有任何引用时，Python将删除它，从而释放相应的内存(从Python堆中)，我认为内存不会返回到系统中。

相关讨论

我写了一个快速测试脚本来确认你的结果…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

import sys
import os
import json
import resource

def get_rss():
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024

def getsizeof_r(obj):
total = 0
if isinstance(obj, list):
for i in obj:
total += getsizeof_r(i)
elif isinstance(obj, dict):
for k, v in obj.iteritems():
total += getsizeof_r(k) + getsizeof_r(v)
else:
total += sys.getsizeof(obj)
return total

def main():
start_rss = get_rss()
filename = 'foo'
f = open(filename, 'r')
l = map(json.loads, f)
f.close()
end_rss = get_rss()

print 'File size is: %d' % os.path.getsize(filename)
print 'Data size is: %d' % getsizeof_r(l)
print 'RSS delta is: %d' % (end_rss - start_rss)

if __name__ == '__main__':
main()

。

…哪些指纹…

1
2
3

File size is: 1060864
Data size is: 4313088
RSS delta is: 4722688

…所以我只得到了四倍的增长，这可以通过每个Unicode字符占用四个字节的RAM来解释。

也许您可以用这个脚本测试您的输入文件，因为我无法解释为什么您的脚本增加了8倍。