Convert pandas dataframe to NumPy array
我有兴趣知道如何将熊猫数据帧转换为numpy数组。
数据文件:
1 2 3 4 5 6 7 8 9 | import numpy as np import pandas as pd index = [1, 2, 3, 4, 5, 6, 7] a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1] b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan] c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan] df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index) df = df.rename_axis('ID') |
给予
1 2 3 4 5 6 7 8 9 | label A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN |
我想将其转换为numpy数组,如下所示:
1 2 3 4 5 6 7 | array([[ nan, 0.2, nan], [ nan, nan, 0.5], [ nan, 0.2, 0.5], [ 0.1, 0.2, nan], [ 0.1, 0.2, 0.5], [ 0.1, nan, 0.5], [ 0.1, nan, nan]]) |
我该怎么做?
作为额外的好处,是否可以像这样保留数据类型?
1 2 3 4 5 6 7 8 | array([[ 1, nan, 0.2, nan], [ 2, nan, nan, 0.5], [ 3, nan, 0.2, 0.5], [ 4, 0.1, 0.2, nan], [ 5, 0.1, 0.2, 0.5], [ 6, 0.1, nan, 0.5], [ 7, 0.1, nan, nan]], dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')]) |
或类似的。
关于如何实现这一点有什么建议吗?
要将熊猫数据帧(df)转换为numpy ndarray,请使用以下代码:
1 2 3 4 5 6 7 8 9 | df.values array([[nan, 0.2, nan], [nan, nan, 0.5], [nan, 0.2, 0.5], [0.1, 0.2, nan], [0.1, 0.2, 0.5], [0.1, nan, 0.5], [0.1, nan, nan]]) |
注:此答案中使用的
Method
.as_matrix will be removed in a future version. Use .values instead.
熊猫有内在的东西…
1 | numpy_matrix = df.as_matrix() |
给予
1 2 3 4 5 6 7 | array([[nan, 0.2, nan], [nan, nan, 0.5], [nan, 0.2, 0.5], [0.1, 0.2, nan], [0.1, 0.2, 0.5], [0.1, nan, 0.5], [0.1, nan, nan]]) |
我只需要链接dataframe.reset_index()和dataframe.values函数,就可以获得数据帧的numpy表示,包括索引:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | In [8]: df Out[8]: A B C 0 -0.982726 0.150726 0.691625 1 0.617297 -0.471879 0.505547 2 0.417123 -1.356803 -1.013499 3 -0.166363 -0.957758 1.178659 4 -0.164103 0.074516 -0.674325 5 -0.340169 -0.293698 1.231791 6 -1.062825 0.556273 1.508058 7 0.959610 0.247539 0.091333 [8 rows x 3 columns] In [9]: df.reset_index().values Out[9]: array([[ 0. , -0.98272574, 0.150726 , 0.69162512], [ 1. , 0.61729734, -0.47187926, 0.50554728], [ 2. , 0.4171228 , -1.35680324, -1.01349922], [ 3. , -0.16636303, -0.95775849, 1.17865945], [ 4. , -0.16410334, 0.0745164 , -0.67432474], [ 5. , -0.34016865, -0.29369841, 1.23179064], [ 6. , -1.06282542, 0.55627285, 1.50805754], [ 7. , 0.95961001, 0.24753911, 0.09133339]]) |
要获取数据类型,我们需要使用视图将此数据数组转换为结构化数组:
1 2 3 4 5 6 7 8 9 10 11 | In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)]) Out[10]: array([( 0, -0.98272574, 0.150726 , 0.69162512), ( 1, 0.61729734, -0.47187926, 0.50554728), ( 2, 0.4171228 , -1.35680324, -1.01349922), ( 3, -0.16636303, -0.95775849, 1.17865945), ( 4, -0.16410334, 0.0745164 , -0.67432474), ( 5, -0.34016865, -0.29369841, 1.23179064), ( 6, -1.06282542, 0.55627285, 1.50805754), ( 7, 0.95961001, 0.24753911, 0.09133339), dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) |
反对你使用
从v0.24.0开始,我们介绍了从熊猫对象获取numpy数组的两种全新的首选方法:
如果您访问
Warning: We recommend using
DataFrame.to_numpy() instead.
有关更多信息,请参阅v0.24.0发行说明的这一部分和此答案。
为了更好的一致性:为了在整个API中保持更好的一致性,引入了一种新的方法
1 2 3 4 5 6 7 8 | # Setup. df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']) df A B a 1 4 b 2 5 c 3 6 |
1 2 3 4 | df.to_numpy() array([[1, 4], [2, 5], [3, 6]]) |
如上所述,此方法也在
1 2 3 4 5 | df.index.to_numpy() # array(['a', 'b', 'c'], dtype=object) df['A'].to_numpy() # array([1, 2, 3]) |
默认情况下,将返回一个视图,因此所做的任何修改都将影响原始视图。
1 2 3 4 5 6 7 8 | v = df.to_numpy() v[0, 0] = -1 df A B a -1 4 b 2 5 c 3 6 |
如果您需要副本,请使用
1 2 3 4 5 6 7 8 | v = df.to_numpy(copy=True) v[0, 0] = -123 df A B a 1 4 b 2 5 c 3 6 |
如果你需要保存
1 2 3 | df.to_records() # rec.array([('a', -1, 4), ('b', 2, 5), ('c', 3, 6)], # dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')]) |
不幸的是,这不能用
1 2 3 4 | v = df.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) # rec.array([('a', -1, 4), ('b', 2, 5), ('c', 3, 6)], # dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')]) |
从性能上讲,它几乎是相同的(实际上,使用
1 2 3 4 5 6 7 8 9 | df2 = pd.concat([df] * 10000) %timeit df2.to_records() %%timeit v = df2.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) 11.1 ms ± 557 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.67 ms ± 126 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
添加新方法的理由
由于在两个Github问题GH19954和GH23623下的讨论,增加了
具体来说,文件提到了理由:
[...] with
.values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (likeCategorical ). For example, withPeriodIndex ,.values
generates a newndarray of period objects each time. [...]
如前所述,
现在不推荐使用
你可以使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | In [102]: df Out[102]: label A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN In [103]: df.index.dtype Out[103]: dtype('object') In [104]: df.to_records() Out[104]: rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5), (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5), (7, 0.1, nan, nan)], dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) In [106]: df.to_records().dtype Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) |
转换recarray数据类型对我来说不起作用,但在pandas中可以这样做:
1 2 3 4 5 6 7 | In [109]: df.index = df.index.astype('i8') In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) Out[111]: rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5), (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5), (7, 0.1, nan, nan)], dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) |
注意,pandas没有在导出的记录数组(bug?)中正确设置索引的名称(到
目前,pandas只有8字节的整数
似乎江户十一〔九〕会为你工作。您要查找的确切功能已被请求,
我使用您的示例在本地尝试了这个方法,该调用产生的结果与您要查找的输出非常相似:
1 2 3 4 | rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5), (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5), (7, 0.1, nan, nan)], dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')]) |
注意,这是一个
下面是我从熊猫数据帧制作结构数组的方法。
创建数据帧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import pandas as pd import numpy as np import six NaN = float('nan') ID = [1, 2, 3, 4, 5, 6, 7] A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1] B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN] C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN] columns = {'A':A, 'B':B, 'C':C} df = pd.DataFrame(columns, index=ID) df.index.name = 'ID' print(df) A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN |
定义函数,从熊猫数据帧生成一个numpy结构数组(不是记录数组)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def df_to_sarray(df): """ Convert a pandas DataFrame object to a numpy structured array. This is functionally equivalent to but more efficient than np.array(df.to_array()) :param df: the data frame to convert :return: a numpy structured array representation of df """ v = df.values cols = df.columns if six.PY2: # python 2 needs .encode() but 3 does not types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)] else: types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)] dtype = np.dtype(types) z = np.zeros(v.shape[0], dtype) for (i, k) in enumerate(z.dtype.names): z[k] = v[:, i] return z |
使用
1 2 3 4 5 6 7 | sa = df_to_sarray(df.reset_index()) sa array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5), (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5), (7L, 0.1, nan, nan)], dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) |
编辑:将df_更新为u sarray以避免使用python 3调用.encode()时出错。感谢Joseph Garvin和Halcyon的评论和解决方案。
一种简单的方法,例如数据帧:
1 2 3 4 5 6 7 8 9 | df gbm nnet reg 0 12.097439 12.047437 12.100953 1 12.109811 12.070209 12.095288 2 11.720734 11.622139 11.740523 3 11.824557 11.926414 11.926527 4 11.800868 11.727730 11.729737 5 12.490984 12.502440 12.530894 |
用途:
1 | np.array(df.to_records().view(type=np.matrix)) |
得到:
1 2 3 4 5 6 7 8 | array([[(0, 12.097439 , 12.047437, 12.10095324), (1, 12.10981081, 12.070209, 12.09528824), (2, 11.72073428, 11.622139, 11.74052253), (3, 11.82455653, 11.926414, 11.92652727), (4, 11.80086775, 11.72773 , 11.72973699), (5, 12.49098389, 12.50244 , 12.53089367)]], dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'), ('reg', '<f8')])) |
将数据帧转换为numpy数组表示的两种方法。
mah_np_array = df.as_matrix(columns=None) mah_np_array = df.values
文件:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.as_matrix.html
刚从数据帧导出到arcgis表时遇到了类似的问题,偶然发现了一个来自usgs的解决方案(https://my.usgs.gov/confluence/display/cdi/pandas.dataframe+to+arcgis+table)。简而言之,您的问题有一个类似的解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | df A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN np_data = np.array(np.rec.fromrecords(df.values)) np_names = df.dtypes.index.tolist() np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names]) np_data array([( nan, 0.2, nan), ( nan, nan, 0.5), ( nan, 0.2, 0.5), ( 0.1, 0.2, nan), ( 0.1, 0.2, 0.5), ( 0.1, nan, 0.5), ( 0.1, nan, nan)], dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')])) |
关于Meteore的答案,我找到了密码
1 | df.index = df.index.astype('i8') |
不适合我。所以我把我的代码放在这里是为了方便其他人处理这个问题。
1 2 3 4 5 6 7 | city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8') # the field 'city_en' is a string, when converted to Numpy array, it will be an object city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records() descr=city_cluster_arr.dtype.descr # change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe) descr[1]=(descr[1][0],"S20") newArr=city_cluster_arr.astype(np.dtype(descr)) |
写