使用python将字符串拆分为句子

Split a string into its sentences using python

我有下面的字符串:

1
string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

现在,我想把它分成两个句子。

但是,当我这样做时:

1
string.split('.')

我得到:

1
2
3
4
5
['This is one sentence  ${w_{1},',
 '',
 ',w_{i}}$',
 ' This is another sentence',
 ' ']

有人知道如何改进它,以便在$ $中不检测"."吗?

另外,你会怎么做:

1
string2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

编辑1:

预期输出为:

对于字符串1:

1
['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence']

对于字符串2:

1
['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe !  ']


您可以使用带有交替模式的re.findall。要确保句子以非空白开头和结尾,请在开头使用正向先行模式,在结尾使用正向先行模式:

1
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)

对于第一个字符串,返回:

1
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence']

对于第二个字符串:

1
['This is one sentence  ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']


对于更一般的情况,您可以使用re.split,如下所示:

1
2
3
4
5
6
7
8
9
10
11
import re

mystr = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', '']

str2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

re.split("[.!?]\s{1,}", str2)
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']

括号中的字符是您选择作为标点的字符,并且在末尾\s{1,}处至少添加一个空格,以忽略其他没有空格的.。这也将处理您的感叹号大小写

这里有一种(有点老套)方法可以把标点符号取回来。

1
2
3
4
5
6
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '!  ']

sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence  ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe !  ']


使用"。"(后面有一个空格)因为这只存在于句子结束时,而不是句子中间。

1
2
3
string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

string.split('. ')

这种回报:

['这是一句话$w_1,…,w i,','这是另一句话',']