Python性能调优:JSON到CSV,大文件

Python Performance Tuning: JSON to CSV, big file

一位同事要求我将"yelp数据集挑战"中的6个大文件从"扁平"的常规JSON转换为CSV(他认为这些文件看起来像有趣的教学数据)。

我想我可以用:

1
2
3
4
5
6
7
8
9
10
11
12
13
# With thanks to http://www.diveintopython3.net/files.html and https://www.reddit.com/r/MachineLearning/comments/33eglq/python_help_jsoncsv_pandas/cqkwyu8/

import os
import pandas

jsondir = 'c:\\example\\bigfiles\'
csvdir = '
c:\\example\\bigcsvfiles\'
if not os.path.exists(csvdir): os.makedirs(csvdir)

for file in os.listdir(jsondir):
    with open(jsondir+file, '
r', encoding='utf-8') as f: data = f.readlines()
    df = pandas.read_json('
[' + ','.join(map(lambda x: x.rstrip(), data)) + ']')
    df.to_csv(csvdir+os.path.splitext(file)[0]+'
.csv',index=0,quoting=1)

不幸的是,我的计算机内存达不到这个文件大小的任务。(即使我摆脱了循环,虽然它在不到一分钟的时间内发出50MB的文件,但它仍在努力避免冻结我的计算机或崩溃在100MB以上的文件上,最大的文件是3.25GB。)

是否还有其他简单但性能良好的东西可以替代?

在一个循环中是很好的,但是如果文件名对内存有影响的话(只有6个文件),我也可以用单独的文件名运行6次。

这里是一个".json"文件内容的例子——注意每个文件实际上有很多json对象,每行1个。

1
2
3
{"business_id":"xyzzy","name":"Business A","neighborhood":"","address":"XX YY ZZ","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.32823894longitude":-111.28948,"stars":3,"review_count":3,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"}
{"
business_id":"dsfiuweio2f","name":"Some Place","neighborhood":"","address":"Strip or something","city":"Las Vegas","state":"NV","postal_code":"89106","latitude":36.189134,"longitude":-115.92094,"stars":1.5,"review_count":2,"is_open":1,"attributes":["BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True"],"categories":["Caterers","Grocery","Food","Event Planning & Services","Party & Event Planning","Specialty Food"],"hours":["Monday 0:0-0:0","Tuesday 0:0-0:0","Wednesday 0:0-0:0","Thursday 0:0-0:0","Friday 0:0-0:0","Saturday 0:0-0:0","Sunday 0:0-0:0"],"type":"business"}
{"
business_id":"abccb","name":"La la la","neighborhood":"Blah blah","address":"Yay that","city":"Toronto","state":"ON","postal_code":"M6H 1L5","latitude":43.283984,"longitude":-79.28284,"stars":2,"review_count":6,"is_open":1,"attributes":["Alcohol: none","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}","BikeParking: True","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}","Caters: True","GoodForKids: True","GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False}","HasTV: True","NoiseLevel: quiet","OutdoorSeating: False","RestaurantsAttire: casual","RestaurantsDelivery: True","RestaurantsGoodForGroups: True","RestaurantsPriceRange2: 1","RestaurantsReservations: False","RestaurantsTableService: False","RestaurantsTakeOut: True","WiFi: free"],"categories":["Restaurants","Pizza","Chicken Wings","Italian"],"hours":["Monday 11:0-2:0","Tuesday 11:0-2:0","Wednesday 11:0-2:0","Thursday 11:0-3:0","Friday 11:0-3:0","Saturday 11:0-3:0","Sunday 11:0-2:0"],"type":"business"}

嵌套的JSON数据可以简单地保留为表示它的字符串文本——我只想将顶级键转换为CSV文件标题。


问题是,您的代码将整个文件读取到内存中,然后在内存中创建该文件的近副本。我怀疑它还创建了第三个副本,但还没有验证。NEOX建议的解决方案是逐行读取文件并进行相应的处理。这里是for循环的替换:

1
2
3
4
5
6
7
8
for file in os.listdir(jsondir):
    csv_file = csvdir + os.path.splitext(file)[0] + '.csv'
    with open(jsondir+file, 'r', encoding='utf-8') as f, open(csv_file, 'w', encoding='utf-8') as csv:
        header = True
        for line in f:
            df = pandas.read_json(''.join(('[', line.rstrip(), ']')))
            df.to_csv(csv, header=header, index=0, quoting=1)
            header = False

我已经在Mac上用python 3.5测试过这个;它应该在Windows上工作,但我还没有在那里测试过。

笔记:

  • 我已经调整了您的JSON数据;第一行的纬度/经度似乎有错误。

  • 这只是用一个小文件进行了测试;我不确定从何处获取3.5GB文件。

  • 我假设这是你朋友的一次性用法。如果这是生产代码,则需要验证"with"语句的异常处理是否正确。看看如何在python中使用"with open"打开多个文件?详情。

  • 这应该是相当好的表现,但我还是不确定从哪里得到你的大文件。