Pandas strings, replacing multiple words without for loop
本问题已经有最佳答案,请猛点这里访问。
我在一个pandas df中有大约1.3m的字符串(表示用户在发送IT帮助台时的需求)。我还想从这些字符串中删除一系列29813个名称,以便只剩下描述问题的单词。这里有一个数据的小例子——它是有效的,但花费的时间太长了。我正在寻找一种更有效的方法来实现这一结果:
输入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | List1 = ["George Lucas has a problem logging in", "George Clooney is trying to download data into a spreadsheet", "Bart Graham needs to logon to CRM urgently", "Lucy Anne George needs to pull management reports"] List2 = ["Access Team","Microsoft Team","Access Team","Reporting Team"] df = pd.DataFrame({"Team":List2,"Text":List1}) xwords = pd.Series(["George","Lucas","Clooney","Lucy","Anne","Bart","Graham"]) for word in range(len(xwords)): df["Text"] = df["Text"].str.replace(xwords[word],"!") # Just using ! in the example so one can clearly see the result |
输出:
1 2 3 4 5 | Team Text 0 Access Team ! ! has a problem logging in 1 Microsoft Team ! ! is trying to download data into a spreadsheet 2 Access Team ! ! needs to logon to CRM urgently 3 Reporting Team ! ! ! needs to pull management reports |
号
我试着找到答案已经有一段时间了:如果我因为缺乏经验而错过了某个地方,请温柔一点,让我知道!
非常感谢:)
谢谢Ciprian Tomiag?因为这篇文章让我加速了python 3中数百万的regex替换。Eric Duminil提供的选项,请参见"如果您想要最快的解决方案,请使用此方法(使用集合查找)",在熊猫环境中同样适用,该选项使用的是序列而不是列表-下面重复的此问题的示例代码,在我的大型数据集上,整个过程在2.54秒内完成!
输入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import re banned_words = set(word.strip().lower() for word in xwords) def delete_banned_words(matchobj): word = matchobj.group(0) if word.lower() in banned_words: return"" else: return word sentences = df["Text"] word_pattern = re.compile('\w+') df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences] print(df) |
输出:
1 2 3 4 5 | Team Text Access Team has a problem logging in Microsoft Team is trying to download data into a spreadsheet Access Team needs to logon to CRM urgently Reporting Team needs to pull management reports |
。
pandas.series.str.replace可以将已编译的regex作为模式
1 2 3 | import re patt = re.compile(r'|'.join(xwords)) df["Text"] = df["Text"].str.replace(patt,"!") |
号
也许这会有帮助?不过,我对这么长的正则表达式没有经验。
我建议将文本标记化并使用一组名称:
1 2 | xwords = set(["George","Lucas", ...]) df["Text"] = ' '.join(filter(lambda x: x not in xwords, df["Text"].str.split(' '))) |
根据字符串的不同,标记化技术需要比仅仅在空格上拆分更为复杂。
可能有一种熊猫特有的方法来做这件事,但我对此几乎没有经验;)