How to drop rows of Pandas DataFrame whose value in certain columns is NaN
我有一个
1 2 3 4 5 6 7 8 9 | >>> df STK_ID EPS cash STK_ID RPT_Date 601166 20111231 601166 NaN NaN 600036 20111231 600036 NaN 12 600016 20111231 600016 4.3 NaN 601009 20111231 601009 NaN NaN 601939 20111231 601939 2.5 NaN 000001 20111231 000001 NaN NaN |
那么我只需要
1 2 3 4 | STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN |
我该怎么做?
这个问题已经解决了,但是…
…还考虑了Wouter在其原始评论中提出的解决方案。处理丢失数据的能力,包括
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | In [24]: df = pd.DataFrame(np.random.randn(10,3)) In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan; In [26]: df Out[26]: 0 1 2 0 NaN NaN NaN 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN |
1 2 3 4 5 6 | In [27]: df.dropna() #drop all rows that have any NaN values Out[27]: 0 1 2 1 2.677677 -1.466923 -0.750366 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295 |
1 2 3 4 5 6 7 8 9 10 11 12 | In [28]: df.dropna(how='all') #drop only if ALL columns are NaN Out[28]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN |
1 2 3 4 5 6 7 8 9 | In [29]: df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN Out[29]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN |
1 2 3 4 5 6 7 8 9 10 | In [30]: df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question) Out[30]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN |
还有其他选项(请参阅http://pandas.pydata.org/pandas docs/stable/generated/pandas.dataframe.dropna.html上的文档),包括删除列而不是行。
相当方便!
不要这样做。只需在
1 | df = df[np.isfinite(df['EPS'])] |
我知道这一点已经得到了回答,但仅仅是为了解决这个特定问题,而不是为了从阿曼那里得到一般性的描述(这太好了),如果有其他人发生这种情况:
1 2 | import pandas as pd df = df[pd.notnull(df['EPS'])] |
您可以使用:
1 | df.dropna(subset=['EPS'], how='all', inplace = True) |
最简单的解决方案:
1 | filtered_df = df[df['EPS'].notnull()] |
The above solution is way better than using np.isfinite()
您可以使用DataFrame方法NotNull或IsNull或Numpy.IsAn的逆方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | In [332]: df[df.EPS.notnull()] Out[332]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN In [334]: df[~df.EPS.isnull()] Out[334]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN In [347]: df[~np.isnan(df.EPS)] Out[347]: STK_ID RPT_Date STK_ID.1 EPS cash 2 600016 20111231 600016 4.3 NaN 4 601939 20111231 601939 2.5 NaN |
另一种解决方案是使用
1 2 3 4 5 6 | In [149]: df.query("EPS == EPS") Out[149]: STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN |
你可以用Dropna
例子
删除至少缺少一个元素的行。
1 | df=df.dropna() |
定义在哪些列中查找缺少的值。
1 | df=df.dropna(subset=['column1', 'column1']) |
更多示例请参见此
Note: axis parameter of dropna is deprecated since version 0.23.0:
或者(用
1 | df=df[~df['EPS'].isnull()] |
现在:
1 | print(df) |
是:
1 2 3 4 | STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN |
这个答案比上面所有的答案都简单得多。)
1 | df=df[df['EPS'].notnull()] |
简单易行的方法
来源:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.dropna.html
可以在此处添加"&;"以添加附加条件,例如
1 | df = df[(df.EPS > 2.0) & (df.EPS <4.0)] |
请注意,在评估语句时,熊猫需要括号。
出于某种原因,以前提交的答案对我来说都不起作用。这个基本的解决方案做到了:
1 | df = df[df.EPS >= 0] |
当然,这也会删除带有负数的行。所以如果你想要那些,在后面加上这个可能也很明智。
1 | df = df[df.EPS <= 0] |