关于python：如何获取pandas DataFrame的行数？

How do I get the row count of a pandas DataFrame?

我正在尝试用pandas获取数据帧df的行数，这是我的代码。

方法1：

1 2	total_rows = df.count print total_rows +1

方法2：

1 2	total_rows = df['First_columnn_label'].count print total_rows +1

这两个代码段都给了我这个错误：

TypeError: unsupported operand type(s) for +: 'instancemethod' and 'int'

我做错什么了？

相关讨论

您可以使用.shape属性或仅使用len(DataFrame.index)属性。但是，有显著的性能差异(len(DataFrame.index)是最快的)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.arange(12).reshape(4,3))

In [4]: df
Out[4]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

In [5]: df.shape
Out[5]: (4, 3)

In [6]: timeit df.shape
2.77 μs ± 644 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: timeit df[0].count()
348 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: len(df.index)
Out[8]: 4

In [9]: timeit len(df.index)
990 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

enter image description here

编辑：正如@dan allen在评论中指出的，len(df.index)和df[0].count()不能互换，因为count排除NaNs，

相关讨论

假设df是您的数据帧，那么：

1 2	count_row = df.shape[0] # gives number of row count count_col = df.shape[1] # gives number of col count

相关讨论

使用len(df)。这在熊猫0.11或者更早的时候起作用。

__len__()目前(0.12)与Returns length of index记录在案。计时信息，设置方式与根的答案相同：

1
2
3
4
5

In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop

In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop

由于有一个额外的函数调用，它比直接调用len(df.index)要慢一些，但在大多数用例中，这不应该发挥任何作用。

len()是你的朋友，对行数的简短回答是len(df)。

或者，您可以通过df.index访问所有行，并通过df.columns，由于你可以用len(anyList)来获取列表的计数，因此你可以使用len(df.index)用于获取行数，len(df.columns)用于列数。

或者，可以使用返回行数和列数的df.shape，如果要访问行数，则只使用df.shape[0]，只使用列数：df.shape[1]。

除上述答案外，使用可以使用df.axes得到具有行和列索引的元组，然后使用len()函数：

1 2	total_rows=len(df.axes[0]) total_cols=len(df.axes[1])

相关讨论

How do I get the row count of a Pandas DataFrame?

下面是一个表格，总结了所有不同的情况，在这些情况下，您希望计算一些东西，以及推荐的方法。

enter image description here

安装程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

df = pd.DataFrame({
'A': list('aaabbccd'), 'B': ['x', 'x', np.nan, np.nan, 'x', 'x', 'x', np.nan]})
s = df['B'].copy()

df

A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b x
5 c x
6 c x
7 d NaN

s

0 x
1 x
2 NaN
3 NaN
4 x
5 x
6 x
7 NaN
Name: B, dtype: object

数据帧中的计数行：len(df)、df.shape[0]或len(df.index)。

1
2
3
4
5
6
7
8

len(df)
# 8

df.shape[0]
# 8

len(df.index)
# 8

比较固定时间操作的性能似乎很愚蠢，特别是当差异处于"认真，不要担心"的水平时。但这似乎是其他答案的一种趋势，所以为了完整性，我也这么做了。

在上述3种方法中，len(df.index)(如其他答案所述)是最快的。

Note

All the methods above are constant time operations as they are simple attribute lookups.

df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols). For example, df.shape returns (8,
2) for the example here.

数列行：len(s)、s.size、len(s.index)。

1
2
3
4
5
6
7
8

len(s)
# 8

s.size
# 8

len(s.index)
# 8

s.size和len(s.index)在速度方面大致相同。但我推荐len(df)。

Note
size is an attribute, and it returns the number of elements (=count
of rows for any Series). DataFrames also define a size attribute which
returns the same result as df.shape[0] * df.shape[1].

数据帧中的计数列：df.shape[1]、len(df.columns)。

1
2
3
4
5

df.shape[1]
# 2

len(df.columns)
# 2

与len(df.index)类似，len(df.columns)比这两种方法更快(但需要更多的字符来输入)。

非NAN行数：DataFrame.count。

这是一个粗略的主题(因为它不精确地计算所有行，只计算非空值)。

对于系列，您可以使用Series.count()：

1 2	s.count() # 5

调用DataFrame.count()将返回每列的非NAN计数：

1
2
3
4
5

df.count()

A 8
B 5
dtype: int64

对每组(系列/数据帧)的所有行进行计数：GroupBy.size。

对于Series，使用SeriesGroupBy.size()。

1
2
3
4
5
6
7
8

s.groupby(df.A).size()

A
a 3
b 2
c 2
d 1
Name: B, dtype: int64

对于DataFrames，使用DataFrameGroupBy.size()。

1
2
3
4
5
6
7
8

df.groupby('A').size()

A
a 3
b 2
c 2
d 1
dtype: int64

每组只计算非NAN行(系列/数据帧)：GroupBy.count。

与上面类似，但使用count()，而不是size()。注意，size()总是返回一个序列，而count()则返回一个序列或数据帧，这取决于如何调用它。

以下两个语句返回相同的内容：

1
2
3
4
5
6
7
8
9

df.groupby('A')['B'].size()
df.groupby('A').size()

A
a 3
b 2
c 2
d 1
Name: B, dtype: int64

同时，对于count，我们有

1
2
3
4
5
6
7
8

df.groupby('A').count()

B
A
a 2
b 1
c 2
d 0

…对整个GroupBy对象调用，v/s，

1
2
3
4
5
6
7
8

df.groupby('A')['B'].count()

A
a 2
b 1
c 2
d 0
Name: B, dtype: int64

对特定列调用。原因应该是显而易见的。

行计数(使用任意一个)：

1 2	df.shape[0] len(df)

相关讨论

我是从R的背景来看大熊猫的，我发现大熊猫在选择行或列时更复杂。我不得不和它搏斗一段时间，然后我找到了一些方法来处理：

获取列数：

1
2
3
4
5

len(df.columns)
## Here:
#df is your data.frame
#df.columns return a string, it contains column's titles of the df.
#Then,"len()" gets the length of it.

获取行数：

1	len(df.index) #It's similar.

相关讨论

…基于扬·菲利普·盖尔克的回答。

len(df)或len(df.index)比df.shape[0]快的原因。看看代码。df.shape是一个@property，它运行两次调用len的数据帧方法。

1
2
3
4
5
6
7
8
9
10
11

df.shape??
Type: property
String form: <property object at 0x1127b33c0>
Source:
# df.shape.fget
@property
def shape(self):
"""
Return a tuple representing the dimensionality of the DataFrame.
"""
return len(self.index), len(self.columns)

在len(df)的引擎盖下面

1
2
3
4
5
6
7
8

df.__len__??
Signature: df.__len__()
Source:
def __len__(self):
"""Returns length of info axis, but here we use the index"""
return len(self.index)
File: ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type: instancemethod

len(df.index)比len(df)稍快，因为它的函数调用较少，但总是比df.shape[0]快。

df.shape以元组(行数、列数)的形式返回数据帧的形状。

您只需使用df.shape[0]或df.shape[1]分别访问行数或列数，这与访问元组的值相同。

如果要在链接操作的中间获取行计数，可以使用：

1	df.pipe(len)

例子：

1
2
3
4
5

row_count = (
pd.DataFrame(np.random.rand(3,4))
.reset_index()
.pipe(len)
)

如果不想在len()函数中放入长语句，这将非常有用。

你可以用len_uuuu()代替，但len_uuuu()看起来有点奇怪。

相关讨论

对于数据帧df，在浏览数据时使用的打印逗号格式行数：

1 2	def nrow(df): print("{:,}".format(df.shape[0]))

例子：

1 2	nrow(my_df) 12,456,789