关于python：NumPy或Pandas：在具有NaN值的同时将数组类型保持为整数

NumPy or Pandas: Keeping array type as integer while having a NaN value

有没有一种更好的方法来保持numpy数组的数据类型固定为int(或int64或其他类型)，而其中仍有一个元素列为numpy.NaN呢？

特别是，我正在将内部数据结构转换为熊猫数据帧。在我们的结构中，整型列仍然有NaN(但该列的dtype是int)。如果我们把它变成一个数据帧，它似乎会把所有的东西都重铸成一个浮点，但是我们真的很想成为int。

思想？

尝试过的事情：

我尝试在pandas.dataframe下使用from_records()函数，使用coerce_float=False，但这没有帮助。我还尝试使用numpy-masked数组，使用nan fill_值，但也不起作用。所有这些都导致列数据类型变为浮点型。

相关讨论

如果性能不是主要问题，可以存储字符串。

1	df.col = df.col.dropna().apply(lambda x: str(int(x)) )

然后你可以随心所欲地和NaN混合。如果您真的想要整数，根据您的应用程序，您可以使用-1、0、1234567890或其他一些专用值来表示NaN。

您还可以临时复制列：一个列和您的列一样，带有浮点；另一个列是实验列，带有int或字符串。然后在每个合理的位置插入asserts，检查两个是否同步。经过足够的测试后，您可以放开浮动。

这不是所有情况的解决方案，但我的(基因组坐标)我使用0作为NaN

1	a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

号

这至少允许使用适当的"本机"列类型，如减法、比较等操作按预期工作。

熊猫v0.24+

支持整数系列中的NaN的功能将在v0.24以上版本中提供。在v0.24"What's New"部分中有关于这个的信息，在nullable integer数据类型下有更多详细信息。

熊猫v0.23及更早版本

一般来说，在可能的情况下，最好与float系列配合使用，即使该系列由于包含NaN值而从int向float上推。这将启用基于numpy的矢量计算，否则，将在其中处理Python级别的循环。

文档确实建议："一种可能是使用dtype=object数组代替。"例如：

1
2
3
4
5
6
7
8
9

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0 1
1 2
2 3
3 NaN
dtype: object

出于美观的原因，例如输出到文件，这可能更可取。

熊猫v0.23及更早：背景

NaN被认为是float。目前的文档(从v0.23开始)详细说明了整数序列向float上转换的原因：

In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.

This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be"numeric".

号

由于NaN包含，docs还提供了上抛规则：

1
2
3
4
5

Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object

。

现在这是可能的，因为熊猫v 0.24.0

熊猫0.24.x发行说明引言："熊猫获得了保存缺少值的整数数据类型的能力。