使用Python从CSV文件中查找中值

Find the median from a CSV File using Python

我有一个名为"saleses.csv"的csv文件,文件内容如下:

City,Job,Salary
Delhi,Doctors,500
Delhi,Lawyers,400
Delhi,Plumbers,100
London,Doctors,800
London,Lawyers,700
London,Plumbers,300
Tokyo,Doctors,900
Tokyo,Lawyers,800
Tokyo,Plumbers,400
Lawyers,Doctors,300
Lawyers,Lawyers,400
Lawyers,Plumbers,500
Hong Kong,Doctors,1800
Hong Kong,Lawyers,1100
Hong Kong,Plumbers,1000
Moscow,Doctors,300
Moscow,Lawyers,200
Moscow,Plumbers,100
Berlin,Doctors,800
Berlin,Plumbers,900
Paris,Doctors,900
Paris,Lawyers,800
Paris,Plumbers,500
Paris,Dog catchers,400

我需要打印每个职业的工资中位数。我尝试了一个代码,它显示了一些错误。

我的代码是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from StringIO import StringIO
import sqlite3
import csv
import operator #from operator import itemgetter, attrgetter

data = open('sal.csv', 'r').read()
string = ''.join(data)
f = StringIO(string)
reader = csv.reader(f)
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table data (City text, Job text, Salary real)''')
conn.commit()
count = 0

for e in reader:
    if count==0:
        print""
    else:
        e[0]=str(e[0])
        e[1]=str(e[1])
        e[2] = float(e[2])
        c.execute("""insert into data values (?,?,?)""", e)
        count=count+1
        conn.commit()

labels = []
counts = []
count = 0
c.execute('''select count(Salary),Job from data group by Job''')

for row in c:
      for i in row:
            if count==0:
               counts.append(i)
               count=count+1
           else:
                count=0
      labels.append(i)

c.execute('''select Salary,Job from data order by Job''')

count = 1
count1 = 1
temp = 0
pri = 0
lis = []

for row in c:
      lis.append(row)
for cons in counts:
      if cons%2 == 0:
         pri = cons/2
     else:
         pri = (cons+1)/2
     if count1 == 1:
        for li in lis:
              if count == pri:
                  print"Median is",li
        count = count + 1
        count = 0
        temp = pri+cons
     else:
        for li in lis:
              if count == temp:
                  print"Median is",li
              count = count+1
              count = 0
              temp = temp + pri
       count1 = count1 + 1

但是,它显示了一些错误:

1
2
IndentationError('expected an indented block', ('', 28, 2, 'if count==0:
'
))

如何修复错误?


你可以使用defaultdict把每个职业的所有薪水放在一起,然后得到中位数。

1
2
3
4
5
6
7
8
9
10
11
12
13
import csv
from collections import defaultdict

with open("C:/Users/jimenez/Desktop/a.csv","r") as f:
    d = defaultdict(list)
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        d[row[1]].append(float(row[2]))  

for k,v in d.iteritems():
    print"{} median is {}".format(k,sorted(v)[len(v) // 2])
    print"{} average is {}".format(k,sum(v)/len(v))

输出

1
2
3
4
5
6
7
8
Plumbers median is 500.0
Plumbers average is 475.0
Lawyers median is 700.0
Lawyers average is 628.571428571
Dog catchers median is 400.0
Dog catchers average is 400.0
Doctors median is 800.0
Doctors average is 787.5


使用pandas很容易(http://pandas.pydata.org):

1
2
3
4
5
6
7
8
9
10
import pandas as pd
df = pd.read_csv('test.csv', names=['City', 'Job', 'Salary'])
df.groupby('Job').median()

#               Salary
# Job                
# Doctors          800
# Dog catchers     400
# Lawyers          700
# Plumbers         450

如果你想要平均值而不是中间值,

1
2
3
4
5
6
7
8
df.groupby('Job').mean()

#                   Salary
# Job                    
# Doctors       787.500000
# Dog catchers  400.000000
# Lawyers       628.571429
# Plumbers      475.000000


如果您的问题是计算中间值,而不是将所有内容都插入到SQL数据库中并对其进行混乱,只需阅读所有行,将所有工资分组在一个列表中,并从中获取中位数-这会将您的百行量级脚本减少到:

1
2
3
4
5
6
7
8
9
import csv
professions = {}

with open("sal.csv") as data:
    for city, profession, salary in csv.reader(data):
        professions.setdefault(profession.strip(), []).append(int(salary.strip()))

for profession, salaries in sorted(professions.items()):
    print ("{}: {}".format(profession, sorted(salaries)[len(salaries//2)] ))

(给出或取"1"从分类工资中得到适当的中位数)