关于.NET:regex如果有尾随点,则忽略尾随点

Regex to ignore trailing dot if there is one

我有以下regex(大致匹配类似于url的内容)

1
(https?://\S*)

然而,这是为了在句子中提取URL,所以后面的点可能是句子的结尾,而不是URL的合法部分。

让捕获组忽略尾随的句号、逗号、冒号、分号等的魔法咒语是什么?

(我知道匹配URL是一场噩梦,这只需要支持松散地匹配它们,因此非常简单的regex)

这是我的测试字符串:

1
2
3
lorem http://www.example.com lorem https://example.com lorem
http://www.example.com.
lorem https://example.com.

这应该与所有example.com实例匹配。

(我正在用expresso和.net测试它)

带尾随点和新行的测试结果:

1
2
3
4
5
6
7
8
9
  Expected string length 62 but was 64. Strings differ at index 31.
  Expected:"http://www.example.com.

"
  But was: "<a href="http://www.example.com.
">http://www.example.com.

"
  ------------------------------------------^

示例代码

1
2
3
4
5
6
7
8
9
10
11
12
public class HyperlinkParser
{
    private readonly Regex _regex =
        new Regex(
            @"(https?://\S*[^\.])");

    public string Parse(string original)
    {
        var parsed = _regex.Replace(original,"$1");
        return parsed;
    }
}

实例测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[TestFixture]
public class HyperlinkParserTests
{
    private readonly HyperlinkParser _parser = new HyperlinkParser();
    private const string NO_HYPERLINKS ="dummy-text";
    private const string FULL_URL ="http://www.example.com";
    private const string FULL_URL_PARSED ="" + FULL_URL +"";
    private const string FULL_URL_TRAILING_DOT = FULL_URL +".";
    private const string FULL_URL_TRAILING_DOT_PARSED ="" + FULL_URL +".";
    private const string TRAILING_DOT_AND_NEW_LINE = FULL_URL_TRAILING_DOT +"

";
    private const string TRAILING_DOT_AND_NEW_LINE_PARSED = FULL_URL_TRAILING_DOT_PARSED +"

";

    private const string COMPLEX_TEXT ="Leading stuff http://www.example.com.  Other stuff.";
    private const string COMPLEX_TEXT_PARSED ="Leading stuff http://www.example.com.  Other stuff.";

    [TestCase(NO_HYPERLINKS, NO_HYPERLINKS)]
    [TestCase(FULL_URL, FULL_URL_PARSED)]
    [TestCase(FULL_URL_TRAILING_DOT, FULL_URL_TRAILING_DOT_PARSED)]
    [TestCase(TRAILING_DOT_AND_NEW_LINE, TRAILING_DOT_AND_NEW_LINE_PARSED)]
    [TestCase(COMPLEX_TEXT, COMPLEX_TEXT_PARSED)]
    public void Parsing(string original, string expected)
    {
        var actual = _parser.Parse(original);

        Assert.That(actual, Is.EqualTo(expected));
    }
}


尝试此操作,它禁止将点作为最后一个字符:

1
(https?://\S*[^.])

例如,在Cygwin下,带出口:

1
2
3
4
5
6
7
8
9
$ cat ~/tmp.txt
lorem http://www.example.com lorem https://example.com lorem
http://www.example.com.
lorem https://example.com.
$ cat ~/tmp.txt | egrep -o 'https?://\S*[^.]'
http://www.example.com
https://example.com
http://www.example.com
https://example.com

(-o选项告诉egrep只打印匹配项。)