关于python:Pandas字符串,替换多个单词而不用for循环

Pandas strings, replacing multiple words without for loop

本问题已经有最佳答案,请猛点这里访问。

我在一个pandas df中有大约1.3m的字符串(表示用户在发送IT帮助台时的需求)。我还想从这些字符串中删除一系列29813个名称,以便只剩下描述问题的单词。这里有一个数据的小例子——它是有效的,但花费的时间太长了。我正在寻找一种更有效的方法来实现这一结果:

输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
List1 = ["George Lucas has a problem logging in",
        "George Clooney is trying to download data into a spreadsheet",
        "Bart Graham needs to logon to CRM urgently",
        "Lucy Anne George needs to pull management reports"]
List2 = ["Access Team","Microsoft Team","Access Team","Reporting Team"]

df = pd.DataFrame({"Team":List2,"Text":List1})

xwords = pd.Series(["George","Lucas","Clooney","Lucy","Anne","Bart","Graham"])

for word in range(len(xwords)):
    df["Text"] = df["Text"].str.replace(xwords[word],"!")

# Just using ! in the example so one can clearly see the result

输出:

1
2
3
4
5
Team                Text
0   Access Team     ! ! has a problem logging in
1   Microsoft Team  ! ! is trying to download data into a spreadsheet
2   Access Team     ! ! needs to logon to CRM urgently
3   Reporting Team  ! ! ! needs to pull management reports

我试着找到答案已经有一段时间了:如果我因为缺乏经验而错过了某个地方,请温柔一点,让我知道!

非常感谢:)


谢谢Ciprian Tomiag?因为这篇文章让我加速了python 3中数百万的regex替换。Eric Duminil提供的选项,请参见"如果您想要最快的解决方案,请使用此方法(使用集合查找)",在熊猫环境中同样适用,该选项使用的是序列而不是列表-下面重复的此问题的示例代码,在我的大型数据集上,整个过程在2.54秒内完成!

输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import re

banned_words = set(word.strip().lower() for word in xwords)

def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return""
    else:
        return word

sentences = df["Text"]

word_pattern = re.compile('\w+')

df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences]
print(df)

输出:

1
2
3
4
5
Team              Text
Access Team       has a problem logging in
Microsoft Team    is trying to download data into a spreadsheet
Access Team       needs to logon to CRM urgently
Reporting Team    needs to pull management reports


pandas.series.str.replace可以将已编译的regex作为模式

1
2
3
import re
patt = re.compile(r'|'.join(xwords))
df["Text"] = df["Text"].str.replace(patt,"!")

也许这会有帮助?不过,我对这么长的正则表达式没有经验。


我建议将文本标记化并使用一组名称:

1
2
xwords = set(["George","Lucas", ...])
df["Text"] = ' '.join(filter(lambda x: x not in xwords, df["Text"].str.split(' ')))

根据字符串的不同,标记化技术需要比仅仅在空格上拆分更为复杂。

可能有一种熊猫特有的方法来做这件事,但我对此几乎没有经验;)