Find the median from a CSV File using Python
我有一个名为"saleses.csv"的csv文件,文件内容如下:
City,Job,Salary
Delhi,Doctors,500
Delhi,Lawyers,400
Delhi,Plumbers,100
London,Doctors,800
London,Lawyers,700
London,Plumbers,300
Tokyo,Doctors,900
Tokyo,Lawyers,800
Tokyo,Plumbers,400
Lawyers,Doctors,300
Lawyers,Lawyers,400
Lawyers,Plumbers,500
Hong Kong,Doctors,1800
Hong Kong,Lawyers,1100
Hong Kong,Plumbers,1000
Moscow,Doctors,300
Moscow,Lawyers,200
Moscow,Plumbers,100
Berlin,Doctors,800
Berlin,Plumbers,900
Paris,Doctors,900
Paris,Lawyers,800
Paris,Plumbers,500
Paris,Dog catchers,400
号
我需要打印每个职业的工资中位数。我尝试了一个代码,它显示了一些错误。
我的代码是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | from StringIO import StringIO import sqlite3 import csv import operator #from operator import itemgetter, attrgetter data = open('sal.csv', 'r').read() string = ''.join(data) f = StringIO(string) reader = csv.reader(f) conn = sqlite3.connect(':memory:') c = conn.cursor() c.execute('''create table data (City text, Job text, Salary real)''') conn.commit() count = 0 for e in reader: if count==0: print"" else: e[0]=str(e[0]) e[1]=str(e[1]) e[2] = float(e[2]) c.execute("""insert into data values (?,?,?)""", e) count=count+1 conn.commit() labels = [] counts = [] count = 0 c.execute('''select count(Salary),Job from data group by Job''') for row in c: for i in row: if count==0: counts.append(i) count=count+1 else: count=0 labels.append(i) c.execute('''select Salary,Job from data order by Job''') count = 1 count1 = 1 temp = 0 pri = 0 lis = [] for row in c: lis.append(row) for cons in counts: if cons%2 == 0: pri = cons/2 else: pri = (cons+1)/2 if count1 == 1: for li in lis: if count == pri: print"Median is",li count = count + 1 count = 0 temp = pri+cons else: for li in lis: if count == temp: print"Median is",li count = count+1 count = 0 temp = temp + pri count1 = count1 + 1 |
但是,它显示了一些错误:
1 2 | IndentationError('expected an indented block', ('', 28, 2, 'if count==0: ')) |
号
如何修复错误?
你可以使用defaultdict把每个职业的所有薪水放在一起,然后得到中位数。
1 2 3 4 5 6 7 8 9 10 11 12 13 | import csv from collections import defaultdict with open("C:/Users/jimenez/Desktop/a.csv","r") as f: d = defaultdict(list) reader = csv.reader(f) reader.next() for row in reader: d[row[1]].append(float(row[2])) for k,v in d.iteritems(): print"{} median is {}".format(k,sorted(v)[len(v) // 2]) print"{} average is {}".format(k,sum(v)/len(v)) |
输出
1 2 3 4 5 6 7 8 | Plumbers median is 500.0 Plumbers average is 475.0 Lawyers median is 700.0 Lawyers average is 628.571428571 Dog catchers median is 400.0 Dog catchers average is 400.0 Doctors median is 800.0 Doctors average is 787.5 |
号
使用
1 2 3 4 5 6 7 8 9 10 | import pandas as pd df = pd.read_csv('test.csv', names=['City', 'Job', 'Salary']) df.groupby('Job').median() # Salary # Job # Doctors 800 # Dog catchers 400 # Lawyers 700 # Plumbers 450 |
。
如果你想要平均值而不是中间值,
1 2 3 4 5 6 7 8 | df.groupby('Job').mean() # Salary # Job # Doctors 787.500000 # Dog catchers 400.000000 # Lawyers 628.571429 # Plumbers 475.000000 |
如果您的问题是计算中间值,而不是将所有内容都插入到SQL数据库中并对其进行混乱,只需阅读所有行,将所有工资分组在一个列表中,并从中获取中位数-这会将您的百行量级脚本减少到:
1 2 3 4 5 6 7 8 9 | import csv professions = {} with open("sal.csv") as data: for city, profession, salary in csv.reader(data): professions.setdefault(profession.strip(), []).append(int(salary.strip())) for profession, salaries in sorted(professions.items()): print ("{}: {}".format(profession, sorted(salaries)[len(salaries//2)] )) |
(给出或取"1"从分类工资中得到适当的中位数)