Combine two python data processing scripts into a single work flow
我正在处理一个数据处理任务。
我有两个python脚本,每个脚本都实现了一个单独的函数,但是它们在相同的数据上操作,我认为它们可以组合成一个单独的工作流程,但是我想不出实现这一点的最合理的方法。
数据文件在这里,它是JSON,但它有两个不同的组件。
第一部分如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | { "links": { "self":"http://localhost:2510/api/v2/jobs?skills=data%20science" }, "data": [ { "id": 121, "type":"job", "attributes": { "title":"Data Scientist", "date":"2014-01-22T15:25:00.000Z", "description":"Data scientists are in increasingly high demand amongst tech companies in London. Generally a combination of business acumen and technical skills are sought. Big data experience ..." }, "relationships": { "location": { "links": { "self":"http://localhost:2510/api/v2/jobs/121/location" }, "data": { "type":"location", "id": 3 } }, "country": { "links": { "self":"http://localhost:2510/api/v2/jobs/121/country" }, "data": { "type":"country", "id": 1 } }, |
它由第一个python脚本完成,这里:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import json from collections import defaultdict from pprint import pprint with open('data-science.txt') as data_file: data = json.load(data_file) locations = defaultdict(int) for item in data['data']: location = item['relationships']['location']['data']['id'] locations[location] += 1 pprint(locations) |
呈现此表单的数据:
1 2 3 4 5 6 7 8 9 | 1: 6, 2: 20, 3: 2673, 4: 126, 5: 459, 6: 346, 8: 11, 9: 68, 10: 82, |
这些是位置
JSON对象的另一部分如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 | "included": [ { "id": 3, "type":"location", "attributes": { "name":"Victoria", "coord": [ 51.503378, -0.139134 ] } }, |
并由以下python文件处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import json from collections import defaultdict from pprint import pprint with open('data-science.txt') as data_file: data = json.load(data_file) locations = defaultdict(int) for record in data['included']: id = record.get('id', None) name = record.get('attributes', {}).get('name', None) coord = record.get('attributes', {}).get('coord', None) print(id, name, coord) |
它以这种格式输出数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | 3 Victoria [51.503378, -0.139134] 1 United Kingdom None 71 data science None 32 None None 3 Victoria [51.503378, -0.139134] 1 United Kingdom None 1 data mining None 22 data analysis None 33 sdlc None 38 artificial intelligence None 39 machine learning None 40 software development None 71 data science None 93 devops None 63 None None 52 Cubitt Town [51.505199, -0.018848] |
我实际想要的是最终的输出结果如下:
1 | 3, Victoria, [51.503378, -0.139134], 2673 |
其中
如果它没有任何坐标,例如
我相信将这些脚本组合在一起并获得输出是可能的,但是我不是一个全面的思考者,我不知道如何做到。
所有真正的项目文件都在这里。
使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import json from collections import defaultdict from pprint import pprint def process_locations_data(data): # processes the 'data' block locations = defaultdict(int) for item in data['data']: location = item['relationships']['location']['data']['id'] locations[location] += 1 return locations def process_locations_included(data): # processes the 'included' block return_list = [] for record in data['included']: id = record.get('id', None) name = record.get('attributes', {}).get('name', None) coord = record.get('attributes', {}).get('coord', None) return_list.append((id, name, coord)) return return_list # return list of tuples # load the data from file once with open('data-science.txt') as data_file: data = json.load(data_file) # use the two functions on same data locations = process_locations_data(data) records = process_locations_included(data) # combine the data for printing for record in records: id, name, coord = record references = locations[id] # lookup the references in the dict print id, name, coord, references |
该函数可以有更好的名称,但这应该实现您正在寻找的统一。