关于python：使用多个单词边界分隔符将字符串拆分为单词

Split Strings into words with multiple word boundary delimiters

我认为我想做的是一个相当常见的任务，但我在网上没有找到任何参考资料。我有带标点的文本，我想要一个单词列表。

1	"Hey, you - what are you doing here!?"

应该是

1	['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但是python的str.split()只使用一个参数，所以在使用空格拆分后，所有单词都使用标点符号。有什么想法吗？

相关讨论

拆分()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

1
2
3
4
5
6

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

相关讨论

正则表达式有理的一种情况：

1
2
3
4

import re
DATA ="Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

相关讨论

谢谢。不过，我仍然感兴趣-我如何实现这个模块中使用的算法？为什么它不出现在字符串模块中？
我不知道为什么字符串模块没有多字符拆分。也许它被认为是足够复杂的正则表达式领域。至于"我如何实现算法"，我不知道你的意思是什么……它在RE模块中-就用它吧。
不，我是说-这个模块是怎么工作的？一点也不简单
正则表达式起初可能令人望而生畏，但非常强大。正则表达式'w+'表示"一个字字符(a-z等)重复一次或多次"。这里有一个关于python正则表达式的方法：amk.ca/python/howto/regex
我明白了-我不是说如何使用re模块(它本身非常复杂)，但它是如何实现的？split()手工编程非常简单，这更困难…
您想知道RE模块本身是如何工作的？恐怕我帮不了你——我从来没有看过它的内部，我的计算机科学学位是很久以前的事了。8)
我在做CS1，所以我还有很长的路要走…乍一看，这似乎很困难，实际上，比TSP等更困难。
@奥博：如果你喜欢CS，那么你应该像一个武士想要掌握一把锋利的剑一样想掌握雷杰克斯。
新方法将允许只包含'char'的单词。
这也不能很好地处理Unicode——上面使用的撇号是U+0027，它是en-us键盘上的撇号。还有U+2019，unicode说它是首选的撇号表示。我经常在从其他来源粘贴的文本中看到这个字符。可以编写一个regex来查找邻近空格或行首或行尾的标点符号。我可以在有时间的时候这样做：)
这不是问题的答案。这是对另一个问题的回答，它恰好适用于这种特殊情况。就好像有人问"我该怎么左转"而投票最高的答案是"再右转三次"。这对某些交叉口有效，但没有给出所需的答案。讽刺的是，答案在re中，而不是findall。下面给re.split()的答案是更好的。
这不适用于包含连字符的单词(-)。
@Jessedhillon"接受所有由单词字符序列组成的子串"和"拆分所有由非单词字符序列组成的子串"是表达同一操作的不同方式；我不知道为什么你会把这两个答案都称为"高级"。
这是一个旧的帖子，但它今天对我有帮助。为什么编辑'？我试了一下，不管有没有，我的Windows7机器对python 2.7没有任何影响。我也没有看到我正在练习的regex备忘表中提到的那个角色。它是做什么的？
@tmwp：撇号的意思是像don't这样的词被当作一个单独的词，而不是分成don和t两个词。
这就解释了。我的测试样本没有包括任何宫缩，所以我没有任何内在的东西，我试图强调为什么有'在那里。谢谢你的澄清。现在要更改我的代码。-)
如果要用非白色字符拆分，则此解决方案不起作用。
print re.findall(r"[\w\-\_']+", DATA)更合适，因为它将包含带连字符和下划线的单词。

另一种不使用regexp的快速方法是先替换字符，如下所示：

1 2	>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split() ['a', 'bcd', 'ef', 'g']

相关讨论

如此多的答案，但我找不到任何有效地解决问题标题所要求的问题的解决方案(相反，在多个可能的分隔符上拆分，许多答案删除任何不是一个单词的东西，这是不同的)。因此，这里是对标题中的问题的回答，它依赖于Python的标准和高效的re模块：

1
2
3

>>> import re # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+","Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

在哪里？

[…]与里面列出的一个分隔符相匹配，
正则表达式中的\-是为了防止将-特殊解释为字符范围指示符(如A-Z)。
+跳过一个或多个定界符(由于filter()可以省略定界符，但这将不必要地在匹配的分隔符之间产生空字符串)，以及
filter(None, …)删除可能由前导和尾随分隔符创建的空字符串(因为空字符串的布尔值为假)。

正如问题标题中所要求的那样，这个re.split()精确地"用多个分隔符拆分"。

此外，该解决方案还不受某些其他解决方案(请参见GhostDog74答案的第一条注释)中的非ASCII字符问题的影响。

与"手工"执行python循环和测试相比，re模块的效率(速度和简洁性)要高得多！

相关讨论

"我找不到任何能有效解决问题标题所要求的问题的解决方案"—第二个答案，发表于5年前：stackoverflow.com/a/1059601/2642204。
这个答案不会在分隔符处拆分(从一组多个分隔符中)：而是在任何非字母数字的地方拆分。也就是说，我同意原海报的目的可能只是保留单词，而不是删除一些标点符号。
EOL：我认为这个答案是在一组多个熟食店中分离出来的。如果向未指定的字符串(如下划线)添加非字母数字，则不会像预期的那样拆分它们。
@我不确定我是否理解：你能举个具体的例子吗？
@EOL：我刚意识到你的评论"这个答案不分裂……"让我很困惑。我以为"这个"指的是你的"分裂"答案，但我现在意识到你是指吉默的答案。我认为这个答案(我正在评论的答案)是最好的答案：)
+用于显示如何将多个后续分隔符视为一个分隔符。谢谢！
具有讽刺意味的是，这个答案没有得到最多的选票…有技术上正确的答案，然后有原始请求者正在寻找的内容(他们的意思而不是他们说的内容)。这是一个很好的答案，当我需要的时候，我会把它抄下来。然而，对我来说，最高评价的答案解决了一个问题，这个问题非常类似于海报所做的工作，快速、清晰、代码最少。如果一个答案同时发布了这两个解决方案，我会投4票。哪一个1更好取决于你到底想做什么(而不是被问到如何做)。-)
不过，谢谢你把它贴出来。我想要两个都在我不断发展的剧目中
@eol我尝试在>或<或=上拆分，以传递字符串中的第一个为准。使用filter(none，re.split("><"，feature_name))，但我的输出是关于如何实际拥有字符串的任何建议
必须使用python 3，其中filter()构造迭代器，而不是列表。通过用list()包装表达式，可以重现python 2的行为。
与熊猫很好地玩split弦法-可爱
@EricleBigot，如果分隔符由一系列字符组成，例如"-"(2个短划线)或"：="，该怎么办？
…然后您可以简单地列出匹配的字符串，用"管道"分隔它们："--|:=|[…]+"。

另一种方式，不使用regex

1
2
3
4
5

import string
punc = string.punctuation
thestring ="Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

相关讨论

pro提示：使用string.translate进行Python最快的字符串操作。

一些证明…

首先，慢方法(抱歉，pprzemek)：

1
2
3
4
5
6
7
8
9
10
11
12

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
... res = [s]
... for sep in seps:
... s, res = res, []
... for seq in s:
... res += seq.split(sep)
... return res
...
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

接下来，我们使用re.findall()(如建议的答案所示)。快得多：

1 2	>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit() 4.194725036621094

最后，我们使用translate：

1
2
3
4

>>> from string import translate,maketrans,punctuation
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

说明：

string.translate在C中实现，与python中的许多字符串操作函数不同，string.translate不生成新的字符串。所以对于字符串替换来说，它是尽可能快的。

不过，这有点尴尬，因为它需要一个翻译表来实现这个魔力。您可以使用maketrans()便利功能制作翻译表。这里的目标是将所有不需要的字符转换为空格。一对一的替代品。同样，没有生成新数据。所以这很快！

接下来，我们使用好的旧split()。默认情况下，split()将对所有空白字符进行操作，并将它们分组以进行拆分。结果将是您想要的单词列表。这种方法比re.findall()快了近4倍！

相关讨论

回答得有点晚。)，但我也有类似的困境，不想使用"re"模块。

1
2
3
4
5
6
7
8
9
10

def my_split(s, seps):
res = [s]
for sep in seps:
s, res = res, []
for seq in s:
res += seq.split(sep)
return res

print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

相关讨论

1
2
3

join = lambda x: sum(x,[]) # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

然后这就变成了一个三行程序：

1
2
3

fragments = [text]
for token in tokens:
fragments = join(f.split(token) for f in fragments)

解释

这就是哈斯克尔所说的单子列表。单子背后的想法是，一旦"在单子里"你"留在单子里"，直到有东西把你带走。例如，在haskell中，假设您将python range(n) -> [1,2,...,n]函数映射到一个列表上。如果结果是一个列表，它将被附加到列表中，这样您将得到类似于map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]的东西。这被称为map append(或mappend，或者类似的东西)。这里的想法是您已经得到了要应用的这个操作(在一个令牌上拆分)，每当您这样做时，您就将结果加入到列表中。

您可以将其抽象为一个函数，默认情况下使用tokens=string.punctuation。

这种方法的优点：

这种方法(与基于原始regex的方法不同)可以使用任意长度的标记(regex也可以使用更高级的语法)。
您不仅仅局限于标记；您可以用任意逻辑代替每个标记，例如，"标记"中的一个可以是一个根据嵌套括号的方式拆分的函数。

相关讨论

首先，我想同意其他人的观点，即基于regex或str.translate(...)的解决方案是最有效的。对于我的用例来说，这个函数的性能并不重要，所以我想添加一些我用那个标准考虑过的想法。

我的主要目标是将其他一些答案中的想法归纳为一个解决方案，该解决方案可以适用于包含不仅仅是regex单词的字符串(即，将标点字符的显式子集与白名单字符的显式子集进行黑名单)。

注意，在任何方法中，也可以考虑使用string.punctuation代替手动定义的列表。

选项1-RE.SUB

我惊讶地发现到目前为止还没有答案使用re.sub(…)。我发现这是解决这个问题的一种简单而自然的方法。

1
2
3
4
5

import re

my_str ="Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

在这个解决方案中，我在re.split(...)中嵌套了对re.sub(...)的调用，但是如果性能非常关键，那么在外部编译regex可能会有好处——对于我的用例来说，差异并不显著，所以我更喜欢简单和可读性。

选项2-str.replace

这是另外几行，但是它的好处是可以扩展，而不必检查是否需要在regex中转义某个字符。

1
2
3
4
5
6
7

my_str ="Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
my_str = my_str.replace(r, ' ')

words = my_str.split()

如果能够将str.replace映射到字符串，那就太好了，但是我不认为可以用不可变的字符串来完成，虽然根据字符列表进行映射可以工作，但是对每个字符执行每个替换听起来太过分了。(编辑：有关函数示例，请参见下一选项。)

选项3-Functools.Reduce

(在python 2中，reduce在全局命名空间中可用，而不从functools导入它。)

1
2
3
4
5
6
7

import functools

my_str ="Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()

相关讨论

试试这个：

1
2
3
4
5

import re

phrase ="Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

这将打印['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']。

使用两次替换：

1 2	a = '11223FROM33344INTO33222FROM3344' a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

结果：

1	['11223', '33344', '33222', '3344']

我喜欢Re，但我的解决方案是：

1
2
3
4

from itertools import groupby
sep = ' ,-!?'
s ="Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep."包含"是"in"运算符使用的方法。基本上和

1	lambda ch: ch in sep

但是这里更方便。

groupby获取字符串和函数。它使用该函数将字符串分组：每当函数值更改时，都会生成一个新组。所以，九月。"包含"正是我们需要的。

groupby返回一个成对序列，其中pair[0]是函数的结果，pair[1]是一个组。使用"如果不是k"，我们用分隔符筛选出组(因为sep的结果"包含"在分隔符上是真的)。好吧，就这些了-现在我们有了一系列的组，其中每个组都是一个单词(组实际上是一个iterable，所以我们使用join将其转换为字符串)。

这个解决方案非常一般，因为它使用一个函数来分隔字符串(您可以根据需要的任何条件进行拆分)。另外，它不会创建中间字符串/列表(您可以删除join，表达式将变为lazy，因为每个组都是迭代器)

不使用re-module函数re.split，您可以使用pandas的series.str.split方法获得相同的结果。

首先，使用上面的字符串创建一个序列，然后将该方法应用于该序列。

thestring = pd.Series("Hey, you - what are you doing here!?")
thestring.str.split(pat = ',|-')

参数pat接受分隔符并将拆分字符串作为数组返回。这里使用(或运算符)传递两个分隔符。输出如下：

[Hey, you , what are you doing here!?]

相关讨论

我已经熟悉了python，需要同样的东西。芬德尔的解决方案可能更好，但我想到了：

1	tokens = [x.strip() for x in data.split(',')]

相关讨论

在python 3中，您可以为每个人使用py4e-python中的方法。

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.

您可以看到"标点符号"：

1
2
3
4

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

例如：

1
2
3
4
5
6
7
8
9
10

In [12]: your_str ="Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

有关更多信息，请参阅：

PY4E-人人都有Python
STR翻译
大跨越
python string maketrans()方法

相关讨论

使用maketrans和translate可以轻松、整洁地完成它

1
2
3
4
5

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

首先，我不认为您的意图是实际使用标点作为拆分函数中的分隔符。您的描述表明您只需要从结果字符串中消除标点符号。

我经常遇到这种情况，我通常的解决方案不需要重复。

一个线性lambda函数，具有列表理解功能：

(要求import string：

1
2
3
4
5
6

split_without_punc = lambda text : [word.strip(string.punctuation) for word in
text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

功能(传统)

作为传统的功能，这仍然只是两行列表理解(除import string外)：

1
2
3
4
5
6
7
8
9
10

def split_without_punctuation2(text):

# Split by whitespace
words = text.split()

# Strip punctuation from each word
return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它也会自然地保留收缩和连字符单词的完整性。在拆分之前，可以始终使用text.replace("-","")将连字符转换为空格。

不带lambda或列表理解的常规函数

对于更一般的解决方案(可以指定要消除的字符)，如果不理解列表，则可以得到：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

def split_without(text: str, ignore: str) -> list:

# Split by whitespace
split_string = text.split()

# Strip any characters in the ignore string, and ignore empty strings
words = []
for word in split_string:
word = word.strip(ignore)
if word != '':
words.append(word)

return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然，也可以将lambda函数归纳为任何指定的字符串。

首先，在循环中执行任何regex操作之前，务必使用re.compile()，因为它的工作速度比正常操作快。

因此，对于您的问题，首先编译模式，然后对其执行操作。

1
2
3
4

import re
DATA ="Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

这是答案和一些解释。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

st ="Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey you what are you doing here '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

或者在一行中，我们可以这样做：

1
2
3
4

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新答案

创建一个函数，将两个字符串(要拆分的源字符串和分隔符的split list字符串)作为输入，并输出拆分单词列表：

1
2
3
4
5
6
7
8
9
10
11
12
13

def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output

实现这一点的另一种方法是使用自然语言工具包(NLTK)。

1
2
3
4

import nltk
data="Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

这张照片是：['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']。

这种方法最大的缺点是需要安装NLTK包。

好处是，一旦你得到了代币，你可以用剩余的NLTK包做很多有趣的事情。

1
2
3
4
5
6
7
8
9
10
11
12
13

def get_words(s):
l = []
w = ''
for c in s.lower():
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l

用法如下：

1
2
3

>>> s ="Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

我最喜欢replace()的方式。以下过程将字符串splitlist中定义的所有分隔符更改为splitlist中的第一个分隔符，然后拆分该分隔符上的文本。它还解释了如果splitlist恰好是一个空字符串。它返回一个单词列表，其中没有空字符串。

1
2
3
4

def split_string(text, splitlist):
for sep in splitlist:
text = text.replace(sep, splitlist[0])
return filter(None, text.split(splitlist[0])) if splitlist else [text]

我认为以下是满足您需求的最佳答案：

\W+可能适用于本案，但可能不适用于其他案件。

1	filter(None, re.compile('[ \|,\|\-\|!\|?]').split("Hey, you - what are you doing here!?")

相关讨论

以下是我与多个除沫器的分离：

1
2
3
4
5
6
7
8
9
10
11

def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w

和@ooboo有同样的问题，找到这个主题@Ghostdog74启发了我，也许有人发现我的解决方案有用

1
2
3

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

在空格处输入内容，如果不想在空格处拆分，请使用相同的字符进行拆分。

相关讨论

如果要执行可逆操作(保留分隔符)，可以使用以下函数：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]

for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)

listOfTokens = newListOfTokens

return listOfTokens

这是我的看法……

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 =""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 =""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

您需要python的regex模块的findall()方法：

http://www.regular-expressions.info/python.html网站

例子

相关讨论