Find a value of a dictionary in dataframe column and modify it
我现在处理DataFrames和Dictionaries,我有一个问题,
我有一个词典"水果"
1 | {BN:'Banana', LM:'Lemon', AP:'Apple' ..... etc} |
和一个DataFrame-"股票":
1 2 3 4 5 | Fruit Price 0 Sweet Mango 1 1 Green Apple 2 2 Few blue Banana 0 3 Black Banana 5 |
我想做下一件事:
用这种方式用
如果Fruits中的值出现在
几个蓝香蕉--->香蕉
黑香蕉--->香蕉
现在DataFrame Stock将以这种方式看待:
1 2 3 4 5 | Fruit Price 0 Sweet Mango 1 1 Green Apple 2 2 Banana 0 3 Banana 5 |
我找到了不同的代码来替换或检查Dicitionary中的值是否出现在DataFrame中
1 2 3 4 5 | Stock['Fruit'] = Stock.Fruit.map(Fruits) if (Fruits.values() in Stock['Fruit'] for item in Stock) any('Mango' in Stock['Fruit'] for index,item in Stock.iterrows()) |
但是我无法找到更新DataFrame行的任何内容
使用字符串方法处理条件并提取所需的值,
1 2 3 4 5 6 7 8 9 | pat = r'({})'.format('|'.join(d.values())) cond = df['Fruit'].str.contains('|'.join(d.values())) df.loc[cond, 'Fruit'] = df['Fruit'].str.extract((pat), expand = False) Fruit Price 0 Sweet Mango 1 1 Apple 2 2 Banana 0 3 Banana 5 |
编辑:正如@ user3483203建议的那样,您可以在提取模式后使用原始文件填充缺失值。
1 | df['Fruit'] = df['Fruit'].str.extract(pat).fillna(df.Fruit) |
IIUC,您可以使用
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd df = pd.DataFrame([['Sweet Mango', 1],['Green Apple', 2],['Few blue Banana', 0],['Black Banana', 5]], columns=['Fruit','Price']) fruits = {'BN':'Banana', 'LM': 'Lemon', 'AP':'Apple', 'MG': 'Mango'} def find_category(x): return [k for k in fruits.values() if k in x][0] df['Fruit'] = df['Fruit'].apply(find_category) |
产量:
1 2 3 4 5 | Fruit Price 0 Mango 1 1 Apple 2 2 Banana 0 3 Banana 5 |
使用这里的答案结果,我们创建了一个子类
1 2 3 4 5 6 7 8 | from collections import defaultdict class keydefaultdict(defaultdict): def __missing__(self, key): if self.default_factory is None: raise KeyError(key) else: ret = self[key] = self.default_factory(key) return ret |
我们创建了一个初始字典,用于映射我们要替换的
1 | fruit_dict = {'Few blue Banana': 'Banana', 'Black Banana': 'Banana'} |
然后我们用
1 2 | fruit_col_map = keydefaultdict(lambda x: x) fruit_col_map.update(**fruit_dict) |
最后,更新列:
1 2 | df['Fruit'] = df['Fruit'].map(fruit_col_map) df |
输出:
1 2 3 4 5 | Fruit Price 0 Sweet Mango 1 1 Green Apple 2 2 Banana 0 3 Banana 5 |
与接受的答案相比,这个速度提高了6倍多:
1 2 3 4 5 | df = pd.DataFrame({ 'Fruit': ['Sweet Mango', 'Green Apple', 'Few blue Banana', 'Black Banana']*1000, 'Price': [1, 2, 0, 5]*1000 }) %timeit df['Fruit'].map(fruit_col_map) |
结果:
1 | 1.03 ms ± 48.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
接受的答案:
1 2 | pat = r'({})'.format('|'.join(fruit_dict.values())) %timeit df['Fruit'].str.extract(pat).fillna(df['Fruit']) |
结果:
1 | 6.85 ms ± 223 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) |