Split (explode) pandas dataframe string entry to separate rows
我有一个
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In [7]: a Out[7]: var1 var2 0 a,b,c 1 1 d,e,f 2 In [8]: b Out[8]: var1 var2 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 |
到目前为止,我已经尝试了各种简单的函数,但是
示例数据:
1 2 3 4 5 6 7 8 9 10 | from pandas import DataFrame import numpy as np a = DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}]) b = DataFrame([{'var1': 'a', 'var2': 1}, {'var1': 'b', 'var2': 1}, {'var1': 'c', 'var2': 1}, {'var1': 'd', 'var2': 2}, {'var1': 'e', 'var2': 2}, {'var1': 'f', 'var2': 2}]) |
我知道这不起作用,因为我们通过numpy丢失DataFrame元数据,但它应该让你了解我尝试做的事情:
1 2 3 4 5 6 7 8 | def fun(row): letters = row['var1'] letters = letters.split(',') out = np.array([row] * len(letters)) out['var1'] = letters a['idx'] = range(a.shape[0]) z = a.groupby('idx') z.transform(fun) |
UPDATE2:更通用的矢量化函数,适用于多个
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | def explode(df, lst_cols, fill_value='', preserve_index=False): # make sure `lst_cols` is list-alike if (lst_cols is not None and len(lst_cols) > 0 and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))): lst_cols = [lst_cols] # all columns except `lst_cols` idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() # preserve original index values idx = np.repeat(df.index.values, lens) # create"exploded" DF res = (pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols}, index=idx) .assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols})) # append those rows that have empty lists if (lens == 0).any(): # at least one list in cells is empty res = (res.append(df.loc[lens==0, idx_cols], sort=False) .fillna(fill_value)) # revert the original index order res = res.sort_index() # reset index if requested if not preserve_index: res = res.reset_index(drop=True) return res |
演示:
多个
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | In [134]: df Out[134]: aaa myid num text 0 10 1 [1, 2, 3] [aa, bb, cc] 1 11 2 [] [] 2 12 3 [1, 2] [cc, dd] 3 13 4 [] [] In [135]: explode(df, ['num','text'], fill_value='') Out[135]: aaa myid num text 0 10 1 1 aa 1 10 1 2 bb 2 10 1 3 cc 3 11 2 4 12 3 1 cc 5 12 3 2 dd 6 13 4 |
保留原始索引值:
1 2 3 4 5 6 7 8 9 10 | In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True) Out[136]: aaa myid num text 0 10 1 1 aa 0 10 1 2 bb 0 10 1 3 cc 1 11 2 2 12 3 1 cc 2 12 3 2 dd 3 13 4 |
建立:
1 2 3 4 5 6 | df = pd.DataFrame({ 'aaa': {0: 10, 1: 11, 2: 12, 3: 13}, 'myid': {0: 1, 1: 2, 2: 3, 3: 4}, 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []}, 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []} }) |
CSV列:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | In [46]: df Out[46]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1') Out[47]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ |
使用这个小技巧,我们可以将类似CSV的列转换为
1 2 3 4 5 | In [48]: df.assign(var1=df.var1.str.split(',')) Out[48]: var1 var2 var3 0 [a, b, c] 1 XX 1 [d, e, f, x, y] 2 ZZ |
更新:通用矢量化方法(也适用于多列):
原DF:
1 2 3 4 5 | In [177]: df Out[177]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ |
解:
首先让我们将CSV字符串转换为列表:
1 2 3 4 5 6 7 8 9 | In [178]: lst_col = 'var1' In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')}) In [180]: x Out[180]: var1 var2 var3 0 [a, b, c] 1 XX 1 [d, e, f, x, y] 2 ZZ |
现在我们可以这样做:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In [181]: pd.DataFrame({ ...: col:np.repeat(x[col].values, x[lst_col].str.len()) ...: for col in x.columns.difference([lst_col]) ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()] ...: Out[181]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ |
老答案:
受@AFinkelstein解决方案的启发,我想让它更加通用化,可以应用于具有两列以上的DF,并且速度快,几乎和AFinkelstein的解决方案一样快:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | In [2]: df = pd.DataFrame( ...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'}, ...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}] ...: ) In [3]: df Out[3]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ In [4]: (df.set_index(df.columns.drop('var1',1).tolist()) ...: .var1.str.split(',', expand=True) ...: .stack() ...: .reset_index() ...: .rename(columns={0:'var1'}) ...: .loc[:, df.columns] ...: ) Out[4]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ |
经过痛苦的实验,找到比接受的答案更快的东西,我得到了这个工作。它在我试用的数据集上运行速度快了大约100倍。
如果有人知道如何使这更优雅,请务必修改我的代码。我找不到一种方法可以在不设置你想保留的其他列作为索引,然后重置索引并重新命名列,但我想有其他的东西可行。
1 2 3 | b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack() b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0 b.columns = ['var1', 'var2'] # renaming var1 |
这样的事情怎么样:
1 2 3 4 5 6 7 8 9 10 | In [55]: pd.concat([Series(row['var2'], row['var1'].split(',')) for _, row in a.iterrows()]).reset_index() Out[55]: index 0 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 |
然后你只需要重命名列
这是我为这个常见任务编写的函数。它比
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | def tidy_split(df, column, sep='|', keep=False): """ Split the values of a column and expand so the new DataFrame has one split value per row. Filters rows where the column is missing. Params ------ df : pandas.DataFrame dataframe with the column to split and expand column : str the column to split and expand sep : str the string used to split the column's values keep : bool whether to retain the presplit value as it's own row Returns ------- pandas.DataFrame Returns a dataframe with the same columns as `df`. """ indexes = list() new_values = list() df = df.dropna(subset=[column]) for i, presplit in enumerate(df[column].astype(str)): values = presplit.split(sep) if keep and len(values) > 1: indexes.append(i) new_values.append(presplit) for value in values: indexes.append(i) new_values.append(value) new_df = df.iloc[indexes, :].copy() new_df[column] = new_values return new_df |
使用此功能,原始问题很简单:
1 | tidy_split(a, 'var1', sep=',') |
类似的问题:pandas:如何将列中的文本拆分成多行?
你可以这样做:
1 2 3 4 5 6 7 8 9 10 11 12 | >> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]}) >> s = a.var1.str.split(",").apply(pd.Series, 1).stack() >> s.index = s.index.droplevel(-1) >> del a['var1'] >> a.join(s) var2 var1 0 1 a 0 1 b 0 1 c 1 2 d 1 2 e 1 2 f |
TL; DR
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np def explode_str(df, col, sep): s = df[col] i = np.arange(len(s)).repeat(s.str.count(sep) + 1) return df.iloc[i].assign(**{col: sep.join(s).split(sep)}) def explode_list(df, col): s = df[col] i = np.arange(len(s)).repeat(s.str.len()) return df.iloc[i].assign(**{col: np.concatenate(s)}) |
示范
1 2 3 4 5 6 7 8 9 | explode_str(a, 'var1', ',') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 |
让我们创建一个包含列表的新数据框
1 2 3 4 5 6 7 8 9 10 11 | d = a.assign(var1=lambda d: d.var1.str.split(',')) explode_list(d, 'var1') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 |
普通的留言
我将
常问问题
为什么我不使用
因为索引可能不是唯一的,并且使用
为什么不使用
当调用
你为什么用
当我使用
为什么索引值重复?
通过在重复位置上使用
可以使用
对于弦乐
我不想过早分裂弦乐。因此,我计算
然后我使用
1 2 3 4 | def explode_str(df, col, sep): s = df[col] i = np.arange(len(s)).repeat(s.str.count(sep) + 1) return df.iloc[i].assign(**{col: sep.join(s).split(sep)}) |
对于列表
与字符串类似,但我不需要计算
我使用Numpy的
1 2 3 4 5 6 7 | import pandas as pd import numpy as np def explode_list(df, col): s = df[col] i = np.arange(len(s)).repeat(s.str.len()) return df.iloc[i].assign(**{col: np.concatenate(s)}) |
我想出了一个具有任意列数的数据帧的解决方案(同时仍然只分离一列的条目)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def splitDataFrameList(df,target_column,separator): ''' df = dataframe to split, target_column = the column containing the values to split separator = the symbol used to perform the split returns: a dataframe with each entry for the target column separated, with each element moved into a new row. The values in the other columns are duplicated across the newly divided rows. ''' def splitListToRows(row,row_accumulator,target_column,separator): split_row = row[target_column].split(separator) for s in split_row: new_row = row.to_dict() new_row[target_column] = s row_accumulator.append(new_row) new_rows = [] df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator)) new_df = pandas.DataFrame(new_rows) return new_df |
这是一条相当简单的消息,它使用pandas
通过使用
1 2 3 4 5 6 7 8 9 10 11 12 13 | var1 = df.var1.str.split(',', expand=True).values.ravel() var2 = np.repeat(df.var2.values, len(var1) / len(df)) pd.DataFrame({'var1': var1, 'var2': var2}) var1 var2 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 |
熊猫> = 0.25
Series和DataFrame方法定义了一个
由于您有逗号分隔字符串列表,因此在逗号上拆分字符串以获取元素列表,然后在该列上调用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]}) df var1 var2 0 a,b,c 1 1 d,e,f 2 df.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 |
请注意,
NaNs和空名单得到他们应得的待遇,而不必为了正确而跳过篮球。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]}) df var1 var2 0 d,e,f 1 1 2 2 NaN 3 df['var1'].str.split(',') 0 [d, e, f] 1 [] 2 NaN df.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var2 0 d 1 0 e 1 0 f 1 1 2 # empty list entry becomes empty string after exploding 2 NaN 3 # NaN left un-touched |
与基于
字符串函数split可以选择boolean参数'expand'。
以下是使用此参数的解决方案:
1 2 3 4 5 6 7 | (a.var1 .str.split(",",expand=True) .set_index(a.var2) .stack() .reset_index(level=1, drop=True) .reset_index() .rename(columns={0:"var1"})) |
我一直在努力使用各种方式来爆炸我的列表,因此我准备了一些基准来帮助我决定upvote的哪些答案。我测试了五个场景,列表长度的比例不同,列表的数量。分享以下结果:
时间:(越少越好,点击查看大图)
峰值内存使用量:(越少越好)
结论:
- @ MaxU的答案(更新2),代号连接几乎在每种情况下提供最佳速度,同时保持较低的查看内存使用率,
- 如果您需要使用相对较小的列表处理大量行并且可以提供增加的峰值内存,请参阅@ DMulligan的答案(代号堆栈),
- 接受的@ Chang的答案适用于具有几行但非常大的列表的数据帧。
完整的细节(功能和基准代码)都在这个GitHub的要点中。请注意,基准测试问题已经简化,并且不包括将字符串拆分到列表中 - 大多数解决方案都以类似的方式执行。
基于优秀的@ DMulligan解决方案,这里是一个通用的矢量化(无循环)函数,它将数据帧的一列拆分成多行,并将其合并回原始数据帧。它还使用了一个很好的通用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | def change_column_order(df, col_name, index): cols = df.columns.tolist() cols.remove(col_name) cols.insert(index, col_name) return df[cols] def split_df(dataframe, col_name, sep): orig_col_index = dataframe.columns.tolist().index(col_name) orig_index_name = dataframe.index.name orig_columns = dataframe.columns dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge index_col_name = (set(dataframe.columns) - set(orig_columns)).pop() df_split = pd.DataFrame( pd.DataFrame(dataframe[col_name].str.split(sep).tolist()) .stack().reset_index(level=1, drop=1), columns=[col_name]) df = dataframe.drop(col_name, axis=1) df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner') df = df.set_index(index_col_name) df.index.name = orig_index_name # merge adds the column to the last place, so we need to move it back return change_column_order(df, col_name, orig_col_index) |
例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], columns=['Name', 'A', 'B'], index=[10, 12, 13]) df Name A B 10 a:b 1 4 12 c:d 2 5 13 e:f:g:h 3 6 split_df(df, 'Name', ':') Name A B 10 a 1 4 10 b 1 4 12 c 2 5 12 d 2 5 13 e 3 6 13 f 3 6 13 g 3 6 13 h 3 6 |
请注意,它保留了原始索引和列的顺序。它也适用于具有非顺序索引的数据帧。
有可能在不改变数据帧结构的情况下拆分和分解数据帧
输入:
1 2 3 4 5 6 7 8 9 10 | var1 var2 0 a,b,c 1 1 d,e,f 2 #Get the indexes which are repetative with the split df = df.reindex(df.index.repeat(df.var1.str.split(',').apply(len))) #Assign the split values to dataframe column df['var1'] = np.hstack(df['var1'].drop_duplicates().str.split(',')) |
日期:
1 2 3 4 5 6 7 | var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 |
通过MultiIndex支持升级MaxU的答案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | def explode(df, lst_cols, fill_value='', preserve_index=False): """ usage: In [134]: df Out[134]: aaa myid num text 0 10 1 [1, 2, 3] [aa, bb, cc] 1 11 2 [] [] 2 12 3 [1, 2] [cc, dd] 3 13 4 [] [] In [135]: explode(df, ['num','text'], fill_value='') Out[135]: aaa myid num text 0 10 1 1 aa 1 10 1 2 bb 2 10 1 3 cc 3 11 2 4 12 3 1 cc 5 12 3 2 dd 6 13 4 """ # make sure `lst_cols` is list-alike if (lst_cols is not None and len(lst_cols) > 0 and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))): lst_cols = [lst_cols] # all columns except `lst_cols` idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() # preserve original index values idx = np.repeat(df.index.values, lens) res = (pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols}, index=idx) .assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols})) # append those rows that have empty lists if (lens == 0).any(): # at least one list in cells is empty res = (res.append(df.loc[lens==0, idx_cols], sort=False) .fillna(fill_value)) # revert the original index order res = res.sort_index() # reset index if requested if not preserve_index: res = res.reset_index(drop=True) # if original index is MultiIndex build the dataframe from the multiindex # create"exploded" DF if isinstance(df.index, pd.MultiIndex): res = res.reindex( index=pd.MultiIndex.from_tuples( res.index, names=['number', 'color'] ) ) return res |
刚从上面使用了jiln的优秀答案,但需要扩展分割多个列。以为我会分享。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def splitDataFrameList(df,target_column,separator): ''' df = dataframe to split, target_column = the column containing the values to split separator = the symbol used to perform the split returns: a dataframe with each entry for the target column separated, with each element moved into a new row. The values in the other columns are duplicated across the newly divided rows. ''' def splitListToRows(row, row_accumulator, target_columns, separator): split_rows = [] for target_column in target_columns: split_rows.append(row[target_column].split(separator)) # Seperate for multiple columns for i in range(len(split_rows[0])): new_row = row.to_dict() for j in range(len(split_rows)): new_row[target_columns[j]] = split_rows[j][i] row_accumulator.append(new_row) new_rows = [] df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator)) new_df = pd.DataFrame(new_rows) return new_df |
我已经提出了以下解决这个问题的方法:
1 2 3 4 5 6 7 | def iter_var1(d): for _, row in d.iterrows(): for v in row["var1"].split(","): yield (v, row["var2"]) new_a = DataFrame.from_records([i for i in iter_var1(a)], columns=["var1","var2"]) |
另一个使用python copy包的解决方案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import copy new_observations = list() def pandas_explode(df, column_to_explode): new_observations = list() for row in df.to_dict(orient='records'): explode_values = row[column_to_explode] del row[column_to_explode] if type(explode_values) is list or type(explode_values) is tuple: for explode_value in explode_values: new_observation = copy.deepcopy(row) new_observation[column_to_explode] = explode_value new_observations.append(new_observation) else: new_observation = copy.deepcopy(row) new_observation[column_to_explode] = explode_values new_observations.append(new_observation) return_df = pd.DataFrame(new_observations) return return_df df = pandas_explode(df, column_name) |