关于numpy：Python Pandas – 将一些列类型更改为类别

Python Pandas - Changing some column types to categories

我已将以下csv文件输入ipython笔记本：

1 2	public = pd.read_csv("categories.csv") public

我还进口了熊猫作为pd，numpy作为np和matplotlib.pyplot作为plt。存在以下数据类型(以下是摘要-大约有100列)

1
2
3
4
5
6
7

In [36]: public.dtypes
Out[37]: parks object
playgrounds object
sports object
roading object
resident int64
children int64

我想把"公园"、"操场"、"运动"和"公路"改为不同的类别(它们都有里克特量表的回答——每一列都有不同类型的里克特回答(例如，一个有"强烈同意"、"同意"等，另一个有"非常重要"、"重要"等)，剩下的则保留为Int64。

我可以创建一个单独的数据框架public1，并使用以下代码将其中一列更改为类别类型：

1 2	public1 = {'parks': public.parks} public1 = public1['parks'].astype('category')

但是，当我尝试使用此代码同时更改数字时，我失败了：

1
2
3

public1 = {'parks': public.parks,
'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

尽管如此，我不想创建一个只包含Categories列的单独数据框架。我希望在原始数据框中更改它们。

我尝试了很多方法来实现这一点，然后在这里尝试了代码：pandas:更改列的数据类型…

1	public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

得到以下错误：

1	NotImplementedError: > 1 ndim Categorical are not supported at this time

是否有办法将"公园"、"游乐场"、"运动"、"公路"改为类别(这样就可以分析Likert量表的反应)，让"居民"和"儿童"(以及其他94个字符串、int+float)保持原样？或者，有没有更好的方法？如果有人有任何建议和/或反馈，我将不胜感激……我正在慢慢地秃顶，把我的头发扯下来！

非常感谢。

编辑添加-我使用的是python 2.7。

有时，您只需要使用for循环：

1 2	for col in ['parks', 'playgrounds', 'sports', 'roading']: public[col] = public[col].astype('category')

相关讨论

对于pandas 0.19.0，最新的描述是read_csv支持直接解析Categorical列。只有当你从read_csv开始时，这个答案才适用。否则，我认为unutbu的答案仍然是最好的。10000条记录的示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)

<0.19.0(或>=19.0，不指定dtype)

1
2
3
4
5
6
7
8
9

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children int64
parks object
playgrounds object
resident int64
roading object
sports object
dtype: object

＞0.19.0

对于混合dtypes，可以通过在read_csv中传递字典dtype={'colname' : 'category', ...}来实现Categorical的解析。

1
2
3
4
5
6
7
8
9
10
11

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children int64
parks category
playgrounds category
resident int64
roading category
sports category
dtype: object

性能

稍微加速(本地Jupyter笔记本)，如发行说明中所述。

1
2
3
4
5
6
7
8
9
10
11
12

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

Jupyter笔记本

在我的例子中，我有一个很大的数据框架，其中包含许多我想将其转换为类别的对象。

因此，我所做的是选择对象列并填充所有不存在的内容，然后将其保存到原始数据框中，如中所示。

1
2
3
4
5
6
7

# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()

希望这是一个有帮助的参考资料

我发现使用for循环很有效。

1 2	for col in ['col_variable_name_1', 'col_variable_name_2', ect..]: dataframe_name[col] = dataframe_name[col].astype(float)