Label encoding across multiple columns in scikit-learn
我正在尝试使用scikit-learn的
将整个
1 2 3 4 5 6 7 8 9 10 11 12 13 | import pandas from sklearn import preprocessing df = pandas.DataFrame({ 'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 'New_York'] }) le = preprocessing.LabelEncoder() le.fit(df) |
Traceback (most recent call last):
File"", line 1, in
File"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit
y = column_or_1d(y, warn=True)
File"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (6, 3)
有关如何解决这个问题的任何想法?
你可以轻松地做到这一点,
1 | df.apply(LabelEncoder().fit_transform) |
EDIT2:
在scikit-learn 0.20中,推荐的方法是
1 | OneHotEncoder().fit_transform(df) |
因为OneHotEncoder现在支持字符串输入。
ColumnTransformer可以将OneHotEncoder仅应用于某些列。
编辑:
由于这个答案是在一年多以前,并产生了许多赞成票(包括赏金),我应该进一步扩展这一点。
对于inverse_transform和transform,你必须做一些hack。
1 2 | from collections import defaultdict d = defaultdict(LabelEncoder) |
有了这个,您现在将所有列
1 2 3 4 5 6 7 8 | # Encoding the variable fit = df.apply(lambda x: d[x.name].fit_transform(x)) # Inverse the encoded fit.apply(lambda x: d[x.name].inverse_transform(x)) # Using the dictionary to label future data df.apply(lambda x: d[x.name].transform(x)) |
正如larsmans所提到的,LabelEncoder()只将一维数组作为参数。也就是说,滚动您自己的标签编码器非常容易,该编码器在您选择的多个列上运行,并返回转换后的数据帧。我的代码部分基于Zac Stewart在这里找到的优秀博客文章。
创建自定义编码器只需创建一个响应
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline # Create some toy data in a Pandas dataframe fruit_data = pd.DataFrame({ 'fruit': ['apple','orange','pear','orange'], 'color': ['red','orange','green','green'], 'weight': [5,6,3,4] }) class MultiColumnLabelEncoder: def __init__(self,columns = None): self.columns = columns # array of column names to encode def fit(self,X,y=None): return self # not relevant here def transform(self,X): ''' Transforms columns of X specified in self.columns using LabelEncoder(). If no columns specified, transforms all columns in X. ''' output = X.copy() if self.columns is not None: for col in self.columns: output[col] = LabelEncoder().fit_transform(output[col]) else: for colname,col in output.iteritems(): output[colname] = LabelEncoder().fit_transform(col) return output def fit_transform(self,X,y=None): return self.fit(X,y).transform(X) |
假设我们想要编码我们的两个分类属性(
1 | MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data) |
这将转换我们的
到
传递一个完全由分类变量组成的数据框并省略
1 | MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1)) |
这转变了
到
。
请注意,当它尝试编码已经是数字的属性时,它可能会窒息(如果你愿意,可以添加一些代码来处理这个)。
另一个很好的功能是我们可以在管道中使用这个自定义变换器:
1 2 3 4 5 | encoding_pipeline = Pipeline([ ('encoding',MultiColumnLabelEncoder(columns=['fruit','color'])) # add more pipeline steps as needed ]) encoding_pipeline.fit_transform(fruit_data) |
我们不需要LabelEncoder。
您可以将列转换为分类,然后获取其代码。我使用下面的字典理解将此过程应用于每个列,并将结果包装回具有相同索引和列名称的相同形状的数据框中。
1 2 3 4 5 6 7 8 | >>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index) location owner pets 0 1 1 0 1 0 2 1 2 0 0 0 3 1 1 2 4 1 3 1 5 0 2 1 |
要创建映射字典,您可以使用字典理解来枚举类别:
1 2 3 4 5 6 | >>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} for col in df} {'location': {0: 'New_York', 1: 'San_Diego'}, 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'}, 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}} |
由于scikit-learn 0.20,你可以使用
如果您只有分类变量,则直接
1 2 3 | from sklearn.preprocessing import OneHotEncoder OneHotEncoder(handle_unknown='ignore').fit_transform(df) |
如果您有异构类型的功能:
1 2 3 4 5 6 7 8 9 10 | from sklearn.compose import make_column_transformer from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import OneHotEncoder categorical_columns = ['pets', 'owner', 'location'] numerical_columns = ['age', 'weigth', 'height'] column_trans = make_column_transformer( (categorical_columns, OneHotEncoder(handle_unknown='ignore'), (numerical_columns, RobustScaler()) column_trans.fit_transform(df) |
文档中有更多选项:http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
假设您只是想获得一个可用于表示列的
1 | le.fit(df.columns) |
在上面的代码中,您将拥有与每列对应的唯一编号。
更确切地说,您将具有
1 | le.transform(df.columns.get_values()) |
假设您要为所有行标签创建
1 | le.fit([y for x in df.get_values() for y in x]) |
在这种情况下,您很可能拥有非唯一的行标签(如您的问题所示)。要查看编??码器创建的类,您可以执行
1 | le.transform([df.get_value(0, df.columns[0])]) |
您在评论中提出的问题有点复杂,但仍然可以
完成:
1 | le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])]) |
上面的代码执行以下操作:
现在使用这个新模型它有点复杂。假设我们想要提取我们在前面的示例(df.columns和第一行中的第一列)中查找的相同项目的表示,我们可以这样做:
1 | le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))]) |
请记住,每个查找现在都是元组的字符串表示形式
包含(列,行)。
这并没有直接回答你的问题(Naputipulu Jon和PriceHardman对此有很棒的回复)
但是,出于少数分类任务的目的,您可以使用
1 | pandas.get_dummies(input_df) |
这可以输入带有分类数据的数据帧,并返回带有二进制值的数据帧。变量值在结果数据帧中编码为列名。更多
这是事后的一年半,但我也需要能够一次
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | class MultiColumnLabelEncoder(LabelEncoder): """ Wraps sklearn LabelEncoder functionality for use on multiple columns of a pandas dataframe. """ def __init__(self, columns=None): self.columns = columns def fit(self, dframe): """ Fit label encoder to pandas columns. Access individual column classes via indexig `self.all_classes_` Access individual column encoders via indexing `self.all_encoders_` """ # if columns are provided, iterate through and get `classes_` if self.columns is not None: # ndarray to hold LabelEncoder().classes_ for each # column; should match the shape of specified `columns` self.all_classes_ = np.ndarray(shape=self.columns.shape, dtype=object) self.all_encoders_ = np.ndarray(shape=self.columns.shape, dtype=object) for idx, column in enumerate(self.columns): # fit LabelEncoder to get `classes_` for the column le = LabelEncoder() le.fit(dframe.loc[:, column].values) # append the `classes_` to our ndarray container self.all_classes_[idx] = (column, np.array(le.classes_.tolist(), dtype=object)) # append this column's encoder self.all_encoders_[idx] = le else: # no columns specified; assume all are to be encoded self.columns = dframe.iloc[:, :].columns self.all_classes_ = np.ndarray(shape=self.columns.shape, dtype=object) for idx, column in enumerate(self.columns): le = LabelEncoder() le.fit(dframe.loc[:, column].values) self.all_classes_[idx] = (column, np.array(le.classes_.tolist(), dtype=object)) self.all_encoders_[idx] = le return self def fit_transform(self, dframe): """ Fit label encoder and return encoded labels. Access individual column classes via indexing `self.all_classes_` Access individual column encoders via indexing `self.all_encoders_` Access individual column encoded labels via indexing `self.all_labels_` """ # if columns are provided, iterate through and get `classes_` if self.columns is not None: # ndarray to hold LabelEncoder().classes_ for each # column; should match the shape of specified `columns` self.all_classes_ = np.ndarray(shape=self.columns.shape, dtype=object) self.all_encoders_ = np.ndarray(shape=self.columns.shape, dtype=object) self.all_labels_ = np.ndarray(shape=self.columns.shape, dtype=object) for idx, column in enumerate(self.columns): # instantiate LabelEncoder le = LabelEncoder() # fit and transform labels in the column dframe.loc[:, column] =\ le.fit_transform(dframe.loc[:, column].values) # append the `classes_` to our ndarray container self.all_classes_[idx] = (column, np.array(le.classes_.tolist(), dtype=object)) self.all_encoders_[idx] = le self.all_labels_[idx] = le else: # no columns specified; assume all are to be encoded self.columns = dframe.iloc[:, :].columns self.all_classes_ = np.ndarray(shape=self.columns.shape, dtype=object) for idx, column in enumerate(self.columns): le = LabelEncoder() dframe.loc[:, column] = le.fit_transform( dframe.loc[:, column].values) self.all_classes_[idx] = (column, np.array(le.classes_.tolist(), dtype=object)) self.all_encoders_[idx] = le return dframe def transform(self, dframe): """ Transform labels to normalized encoding. """ if self.columns is not None: for idx, column in enumerate(self.columns): dframe.loc[:, column] = self.all_encoders_[ idx].transform(dframe.loc[:, column].values) else: self.columns = dframe.iloc[:, :].columns for idx, column in enumerate(self.columns): dframe.loc[:, column] = self.all_encoders_[idx]\ .transform(dframe.loc[:, column].values) return dframe.loc[:, self.columns].values def inverse_transform(self, dframe): """ Transform labels back to original encoding. """ if self.columns is not None: for idx, column in enumerate(self.columns): dframe.loc[:, column] = self.all_encoders_[idx]\ .inverse_transform(dframe.loc[:, column].values) else: self.columns = dframe.iloc[:, :].columns for idx, column in enumerate(self.columns): dframe.loc[:, column] = self.all_encoders_[idx]\ .inverse_transform(dframe.loc[:, column].values) return dframe |
例:
如果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | # get `object` columns df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object'].columns # instantiate `MultiColumnLabelEncoder` mcle = MultiColumnLabelEncoder(columns=object_columns) # fit to `df` data mcle.fit(df) # transform the `df` data mcle.transform(df) # returns output like below array([[1, 0, 0, ..., 1, 1, 0], [0, 5, 1, ..., 1, 1, 2], [1, 1, 1, ..., 1, 1, 2], ..., [3, 5, 1, ..., 1, 1, 2], # transform `df_copy` data mcle.transform(df_copy) # returns output like below (assuming the respective columns # of `df_copy` contain the same unique values as that particular # column in `df` array([[1, 0, 0, ..., 1, 1, 0], [0, 5, 1, ..., 1, 1, 2], [1, 1, 1, ..., 1, 1, 2], ..., [3, 5, 1, ..., 1, 1, 2], # inverse `df` data mcle.inverse_transform(df) # outputs data like below array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'], ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'], ..., ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object) # inverse `df_copy` data mcle.inverse_transform(df_copy) # outputs data like below array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'], ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'], ..., ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'], ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object) |
您可以通过索引访问用于适合每列的单个列类,列标签和列编码器:
不,
我检查了LabelEncoder的源代码(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py)。它基于一组numpy转换,其中一个是np.unique()。此功能仅需1维数组输入。 (如果我错了,请纠正我)。
非常粗略的想法......
首先,确定哪些列需要LabelEncoder,然后循环遍历每一列。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | def cat_var(df): """Identify categorical features. Parameters ---------- df: original df after missing operations Returns ------- cat_var_df: summary df with col index and col name for all categorical vars """ col_type = df.dtypes col_names = list(df) cat_var_index = [i for i, x in enumerate(col_type) if x=='object'] cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index] cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 'cat_name': cat_var_name}) return cat_var_df from sklearn.preprocessing import LabelEncoder def column_encoder(df, cat_var_list): """Encoding categorical feature in the dataframe Parameters ---------- df: input dataframe cat_var_list: categorical feature index and name, from cat_var function Return ------ df: new dataframe where categorical features are encoded label_list: classes_ attribute for all encoded features """ label_list = [] cat_var_df = cat_var(df) cat_list = cat_var_df.loc[:, 'cat_name'] for index, cat_feature in enumerate(cat_list): le = LabelEncoder() le.fit(df.loc[:, cat_feature]) label_list.append(list(le.classes_)) df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature]) return df, label_list |
返回的df将是编码后的df,label_list将显示所有这些值在相应列中的含义。
这是我为工作编写的数据处理脚本的片段。如果您认为可以进一步改进,请告诉我。
编辑:
这里只想提一下上面的方法与数据框架一起工作,没有错过最好的。不确定它是如何工作的数据框包含丢失的数据。 (在执行上述方法之前,我处理了缺少的程序)
在这里和其他地方进行了大量的搜索和实验以及一些答案后,我认为你的答案就在这里:
pd.DataFrame(columns=df.columns,
data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))
这将保留跨列的类别名称:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import pandas as pd from sklearn.preprocessing import LabelEncoder df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'], ['A','E','H','F','G','I','K','','',''], ['A','C','I','F','H','G','','','','']], columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10']) pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape)) A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 0 1 2 3 4 5 6 7 9 10 8 1 1 5 8 6 7 9 10 0 0 0 2 1 3 9 6 8 7 0 0 0 0 |
可以直接在pandas中完成所有操作,非常适合
首先,让我们创建一个字典字典,将列及其值映射到新的替换值。
1 2 3 4 5 6 7 8 9 10 11 12 | transform_dict = {} for col in df.columns: cats = pd.Categorical(df[col]).categories d = {} for i, cat in enumerate(cats): d[cat] = i transform_dict[col] = d transform_dict {'location': {'New_York': 0, 'San_Diego': 1}, 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3}, 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}} |
由于这将始终是一对一映射,我们可以反转内部字典以获得新值到原始映射的映射。
1 2 3 4 5 6 7 8 | inverse_transform_dict = {} for col, d in transform_dict.items(): inverse_transform_dict[col] = {v:k for k, v in d.items()} inverse_transform_dict {'location': {0: 'New_York', 1: 'San_Diego'}, 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'}, 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}} |
现在,我们可以使用
1 2 3 4 5 6 7 8 | df.replace(transform_dict) location owner pets 0 1 1 0 1 0 2 1 2 0 0 0 3 1 1 2 4 1 3 1 5 0 2 1 |
我们可以通过再次链接
1 2 3 4 5 6 7 8 | df.replace(transform_dict).replace(inverse_transform_dict) location owner pets 0 San_Diego Champ cat 1 New_York Ron dog 2 New_York Brick cat 3 San_Diego Champ monkey 4 San_Diego Veronica dog 5 New_York Ron dog |
继续对@PriceHardman解决方案提出的意见,我将提出以下版本的类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | class LabelEncodingColoumns(BaseEstimator, TransformerMixin): def __init__(self, cols=None): pdu._is_cols_input_valid(cols) self.cols = cols self.les = {col: LabelEncoder() for col in cols} self._is_fitted = False def transform(self, df, **transform_params): """ Scaling ``cols`` of ``df`` using the fitting Parameters ---------- df : DataFrame DataFrame to be preprocessed """ if not self._is_fitted: raise NotFittedError("Fitting was not preformed") pdu._is_cols_subset_of_df_cols(self.cols, df) df = df.copy() label_enc_dict = {} for col in self.cols: label_enc_dict[col] = self.les[col].transform(df[col]) labelenc_cols = pd.DataFrame(label_enc_dict, # The index of the resulting DataFrame should be assigned and # equal to the one of the original DataFrame. Otherwise, upon # concatenation NaNs will be introduced. index=df.index ) for col in self.cols: df[col] = labelenc_cols[col] return df def fit(self, df, y=None, **fit_params): """ Fitting the preprocessing Parameters ---------- df : DataFrame Data to use for fitting. In many cases, should be ``X_train``. """ pdu._is_cols_subset_of_df_cols(self.cols, df) for col in self.cols: self.les[col].fit(df[col]) self._is_fitted = True return self |
此类适合训练集上的编码器,并在转换时使用拟合版本。可以在此处找到代码的初始版本。
如果我们有单列来进行标签编码及其逆变换,那么当python中有多个列时,它很容易实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def stringtocategory(dataset): ''' @author puja.sharma @see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data @param dataset dataframe on whoes column the label encoding has to be done @return label encoded and inverse tranform of the label encoded data. ''' data_original = dataset[:] data_tranformed = dataset[:] for y in dataset.columns: #check the dtype of the column object type contains strings or chars if (dataset[y].dtype == object): print("The string type features are :" + y) le = preprocessing.LabelEncoder() le.fit(dataset[y].unique()) #label encoded data data_tranformed[y] = le.transform(dataset[y]) #inverse label transform data data_original[y] = le.inverse_transform(data_tranformed[y]) return data_tranformed,data_original |
如果您在数据框中具有数字和分类两种类型的数据
您可以使用:此处X是我的数据框,具有分类和数字两个变量
1 2 3 4 5 6 | from sklearn import preprocessing le = preprocessing.LabelEncoder() for i in range(0,X.shape[1]): if X.dtypes[i]=='object': X[X.columns[i]] = le.fit_transform(X[X.columns[i]]) |
注意:如果您对转换它们不感兴趣,这种技术很好。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import pandas as pd from sklearn.preprocessing import LabelEncoder train=pd.read_csv('.../train.csv') #X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values # Create a label encoder object def MultiLabelEncoder(columnlist,dataframe): for i in columnlist: labelencoder_X=LabelEncoder() dataframe[i]=labelencoder_X.fit_transform(dataframe[i]) columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type'] MultiLabelEncoder(columnlist,train) |
这里我正在从位置和功能中读取csv我正在传递列表列表我想要labelencode和数据帧我想应用它。
主要使用@Alexander的答案,但不得不做出一些改变 -
1 2 3 4 5 6 7 | cols_need_mapped = ['col1', 'col2'] mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} for col in df[cols_need_mapped]} for c in cols_need_mapped : df[c] = df[c].map(mapper[c]) |
然后在将来重新使用时,您可以将输出保存到json文档,并在需要时将其读入并使用
问题是您传递给fit函数的数据(pd数据帧)的形状。
你必须通过1d列表。
使用
1 2 3 4 | from sklearn.preprocessing import LabelEncoder le_dict = {col: LabelEncoder() for col in columns } for col in columns: le_dict[col].fit_transform(df[col]) |
并且您可以使用此
1 | le_dict[col].transform(df_another[col]) |