Regex to ignore trailing dot if there is one
我有以下regex(大致匹配类似于url的内容)
1 | (https?://\S*) |
然而,这是为了在句子中提取URL,所以后面的点可能是句子的结尾,而不是URL的合法部分。
让捕获组忽略尾随的句号、逗号、冒号、分号等的魔法咒语是什么?
(我知道匹配URL是一场噩梦,这只需要支持松散地匹配它们,因此非常简单的regex)
这是我的测试字符串:
1 2 3 | lorem http://www.example.com lorem https://example.com lorem http://www.example.com. lorem https://example.com. |
这应该与所有example.com实例匹配。
(我正在用expresso和.net测试它)
带尾随点和新行的测试结果:
1 2 3 4 5 6 7 8 9 | Expected string length 62 but was 64. Strings differ at index 31. Expected:"http://www.example.com. " But was: "<a href="http://www.example.com. ">http://www.example.com. " ------------------------------------------^ |
示例代码
1 2 3 4 5 6 7 8 9 10 11 12 | public class HyperlinkParser { private readonly Regex _regex = new Regex( @"(https?://\S*[^\.])"); public string Parse(string original) { var parsed = _regex.Replace(original,"$1"); return parsed; } } |
实例测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | [TestFixture] public class HyperlinkParserTests { private readonly HyperlinkParser _parser = new HyperlinkParser(); private const string NO_HYPERLINKS ="dummy-text"; private const string FULL_URL ="http://www.example.com"; private const string FULL_URL_PARSED ="" + FULL_URL +""; private const string FULL_URL_TRAILING_DOT = FULL_URL +"."; private const string FULL_URL_TRAILING_DOT_PARSED ="" + FULL_URL +"."; private const string TRAILING_DOT_AND_NEW_LINE = FULL_URL_TRAILING_DOT +" "; private const string TRAILING_DOT_AND_NEW_LINE_PARSED = FULL_URL_TRAILING_DOT_PARSED +" "; private const string COMPLEX_TEXT ="Leading stuff http://www.example.com. Other stuff."; private const string COMPLEX_TEXT_PARSED ="Leading stuff http://www.example.com. Other stuff."; [TestCase(NO_HYPERLINKS, NO_HYPERLINKS)] [TestCase(FULL_URL, FULL_URL_PARSED)] [TestCase(FULL_URL_TRAILING_DOT, FULL_URL_TRAILING_DOT_PARSED)] [TestCase(TRAILING_DOT_AND_NEW_LINE, TRAILING_DOT_AND_NEW_LINE_PARSED)] [TestCase(COMPLEX_TEXT, COMPLEX_TEXT_PARSED)] public void Parsing(string original, string expected) { var actual = _parser.Parse(original); Assert.That(actual, Is.EqualTo(expected)); } } |
尝试此操作,它禁止将点作为最后一个字符:
1 | (https?://\S*[^.]) |
例如,在Cygwin下,带出口:
1 2 3 4 5 6 7 8 9 | $ cat ~/tmp.txt lorem http://www.example.com lorem https://example.com lorem http://www.example.com. lorem https://example.com. $ cat ~/tmp.txt | egrep -o 'https?://\S*[^.]' http://www.example.com https://example.com http://www.example.com https://example.com |
(