Split a string into its sentences using python
我有下面的字符串:
1 | string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. ' |
现在,我想把它分成两个句子。
但是,当我这样做时:
1 | string.split('.') |
我得到:
1 2 3 4 5 | ['This is one sentence ${w_{1},', '', ',w_{i}}$', ' This is another sentence', ' '] |
有人知道如何改进它,以便在
另外,你会怎么做:
1 | string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! ' |
编辑1:
预期输出为:
对于字符串1:
1 | ['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence'] |
对于字符串2:
1 | ['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! '] |
您可以使用带有交替模式的
1 | re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string) |
对于第一个字符串,返回:
1 | ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence'] |
对于第二个字符串:
1 | ['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe'] |
对于更一般的情况,您可以使用
1 2 3 4 5 6 7 8 9 10 11 | import re mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. ' re.split("[.!?]\s{1,}", mystr) # ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', ''] str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! ' re.split("[.!?]\s{1,}", str2) ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', ''] |
括号中的字符是您选择作为标点的字符,并且在末尾
这里有一种(有点老套)方法可以把标点符号取回来。
1 2 3 4 5 6 | punct = re.findall("[.!?]\s{1,}", str2) ['! ', '. ', '? ', '! '] sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)] sent ['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! '] |
使用"。"(后面有一个空格)因为这只存在于句子结束时,而不是句子中间。
1 2 3 | string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. ' string.split('. ') |
这种回报:
['这是一句话$w_1,…,w i,','这是另一句话',']