关于python：如何在pandas中迭代数据帧中的行？

How to iterate over rows in a DataFrame in Pandas?

我有一个大熊猫的DataFrame：

1
2
3
4

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

输出：

1
2
3
4

c1 c2
0 10 100
1 11 110
2 12 120

现在我想迭代这个框架的行。对于每一行，我希望能够通过列的名称访问其元素(单元格中的值)。例如：

1 2	for row in df.rows: print row['c1'], row['c2']

在熊猫身上可以这样做吗？

我发现了类似的问题。但它没有给我所需要的答案。例如，建议使用：

1	for date, row in df.T.iteritems():

或

1	for row in df.iterrows():

但我不明白row的目标是什么，以及如何使用它。

相关讨论

ITerRows是一个生成索引和行的生成器。

1
2
3
4
5
6
7

for index, row in df.iterrows():
print(row['c1'], row['c2'])

Output:
10 100
11 110
12 120

相关讨论

要在pandas中迭代数据帧的行，可以使用：

数据帧.iterrows()。

1
2
for index, row in df.iterrows():
print row["c1"], row["c2"]
数据帧.itertuples()。

1
2
for row in df.itertuples(index=True, name='Pandas'):
print getattr(row,"c1"), getattr(row,"c2")

itertuples()应该比iterrows()快。

但请注意，根据文件(目前熊猫0.21.1)：

i错误：dtype可能在行与行之间不匹配。

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).
ITerRows:不修改行

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

改用dataframe.apply()：

1
new_df = df.apply(lambda x: x * 2)
迭代：

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

相关讨论

虽然iterrows()是一个很好的选择，但有时itertuples()可能更快：

1
2
3
4
5
6
7

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 μs per loop

相关讨论

您还可以使用df.apply()迭代行并访问函数的多个列。

文档：dataframe.apply())

1
2
3
4

def valuation_formula(x, y):
return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

相关讨论

您可以使用df.iloc函数，如下所示：

1 2	for i in range(0, len(df)): print df.iloc[i]['c1'], df.iloc[i]['c2']

相关讨论

我在寻找如何在行和列上迭代，并在这里结束，因此：

1
2
3

for i, row in df.iterrows():
for j, column in row.iteritems():
print(column)

相关讨论

使用itertuples()。它比ITerRows()快：

1 2	for row in df.itertuples(): print"c1 :",row.c1,"c2 :",row.c2

相关讨论

您可以编写实现namedtuple的自己的迭代器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14

from collections import namedtuple

def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()

n = namedtuple('MyTuple', cols)

for line in iter(v):
yield n(*line)

这直接与pd.DataFrame.itertuples相当。我的目标是以更高的效率完成同样的任务。

对于具有我的函数的给定数据帧：

1
2
3

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

或与pd.DataFrame.itertuples一起：

1
2
3

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

综合测试我们测试使所有列都可用，并对这些列进行子集设置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

def iterfullA(d):
return list(myiter(d))

def iterfullB(d):
return list(d.itertuples(index=False))

def itersubA(d):
return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='iterfullA iterfullB itersubA itersubB'.split(),
dtype=float
)

for i in res.index:
d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

enter image description here

相关讨论

要循环dataframe中的所有行，可以使用：

1 2	for x in range(len(date_example.index)): print date_example['Date'].iloc[x]

相关讨论

Q: How to iterate over rows in a DataFrame in Pandas?

不要！

大熊猫的迭代是一种反模式，只有当你用尽了所有其他可能的选择时，你才应该做些什么。对于超过几千行的内容，您不应该考虑在其名称中使用任何带有"iter"的函数，否则您将不得不习惯于大量等待。

是否要打印数据帧？使用DataFrame.to_string()。

你想计算一些东西吗？在这种情况下，按此顺序搜索方法(从此处修改的列表)：

矢量化

塞隆例程

列表理解(for循环)

DataFrame.apply()i.可以在赛通中进行的减少二。在python空间中迭代

DataFrame.itertuples()和iteritems()。

DataFrame.iterrows()

iterrows和itertuples在极少数情况下都应该使用(这两个问题的答案都获得了许多选票)，例如生成行对象/名称元组进行顺序处理，这些功能都很擅长。

向当局上诉迭代的docs页面有一个巨大的红色警告框，上面写着：

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

比循环更快：矢量化，赛通

大量的基本操作和计算是由熊猫"矢量化"的(通过numpy或通过cythonized函数)。这包括算术、比较(大多数)、约简、重塑(如旋转)、联接和分组操作。查看有关基本功能的文档，为您的问题找到合适的矢量化方法。

如果不存在，可以使用自定义的Cython扩展自行编写。

其次是：列出理解

如果您正在迭代，因为没有可用的矢量化解决方案，并且性能很重要(但还不够重要，无法解决代码的网络化问题)，那么使用列表理解作为下一个最佳/最简单的选项。

要使用单列迭代行，请使用

1	result = [f(x) for x in df['col']]

要使用多个列在行上迭代，可以使用

1
2
3
4
5

# two column format
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]

# many column format
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]

如果在迭代时需要整数行索引，请使用enumerate：

1	result = [f(...) for i, row in enumerate(df[...].values)]

(其中df.index[i]为您提供索引标签。)

如果你能把它变成一个函数，你可以使用列表理解。您可以通过原始Python的简单性和速度使任意复杂的事情工作。

相关讨论

imho，最简单的决定

1 2	for ind in df.index: print df['c1'][ind], df['c2'][ind]

相关讨论

为了循环一个dataframe中的所有行并方便地使用每行的值，可以将namedtuples转换为ndarrays。例如：

1	df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

遍历行：

1 2	for row in df.itertuples(index=False, name='Pandas'): print np.asarray(row)

结果：

1 2	[ 1. 0.1] [ 2. 0.2]

请注意，如果index=True，则添加索引作为元组的第一个元素，这可能不适合某些应用程序。

有时一个有用的模式是：

1
2
3
4
5
6

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
print(row_dict)

结果是：

1 2	{'col1':1.0, 'col2':0.1} {'col1':2.0, 'col2':0.2}

为什么事情复杂化？

简单。

1
2
3
4
5

import pandas as pd
import numpy as np

# Here is an example dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

1 2	for idx,row in df_existing.iterrows(): print row['A'],row['B'],row['C'],row['D']

相关讨论

有很多方法可以迭代pandas数据帧中的行。一个非常简单和直观的方法是：

1
2
3
4
5
6
7

df=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9]})
print(df)
for i in range(df.shape[0]):
# For printing the second column
print(df.iloc[i,1])
# For printing more than one columns
print(df.iloc[i,[0,2]])

此示例使用iLoc隔离数据帧中的每个数字。

1
2
3
4
5
6
7
8
9
10
11
12

import pandas as pd

a = [1, 2, 3, 4]
b = [5, 6, 7, 8]

mjr = pd.DataFrame({'a':a, 'b':b})

size = mjr.shape

for i in range(size[0]):
for j in range(size[1]):
print(mjr.iloc[i, j])

您还可以执行numpy索引，以提高速度。它不是真正的迭代，但比某些应用程序的迭代效果要好得多。

1 2	subset = row['c1'][0:5] all = row['c1'][:]

您也可以将其强制转换为数组。这些索引/选择应该已经像numpy数组一样工作了，但我遇到了问题，需要强制转换

1 2	np.asarray(all) imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file

对于查看和修改值，我将使用iterrows()。在for循环中，通过使用tuple解包(参见示例：i, row，我只使用row查看值，当我想修改值时，使用i和loc方法。如前面的答案所述，这里不应该修改您正在迭代的内容。

1
2
3

for i, row in df.iterrows():
if row['A'] == 'Old_Value':
df.loc[i,'A'] = 'New_value'

这里循环中的row是该行的副本，而不是它的视图。因此，您不应该编写类似于row['A'] = 'New_Value'的内容，它不会修改数据帧。但是，您可以使用i和loc并指定数据帧来完成这项工作。