How to match and highlight all terms in any order from an array of strings?
要求如下:
- 从一个数组中查找字符串(从这里在调用的选项上),该数组中包含任意顺序的所有项
- 正确突出显示匹配项-即在每个匹配项前后插入一个字符串-我在这里使用
和 。 - 搜索查询和选项都可以是"任何内容"
为了简单起见,答案可以集中在只包含ASCII字符的列表中进行不区分大小写的搜索,并假定术语分隔符是一个纯空格,即输入为"foo bar baz"的查询意味着搜索术语是
澄清:
- 在匹配的选项中,所有术语必须分开存在-即,较短的术语不应仅作为较长术语的子字符串存在,并且不应有两个术语重叠。
- 选项中必须至少存在与查询中相同次数的重复搜索词
最后一个应用程序是(也许并不奇怪)某种类型的自动完成。
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically"entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlightAt this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own"Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
病毒排列
理论上,这可以通过"手动"构建一个regexp来解决,该regexp包含由捕获组分隔的搜索项的每一个可能排列,以捕获项之间的任何内容-在
然后用
演示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']; var terms, terms_esc; function viral_permutations() { var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted; meta_elm = document.getElementById('viral_permutations_meta'); res_elm = document.getElementById('viral_permutations_result'); res_elm.innerHTML = meta_elm.innerHTML = ''; t0 = performance.now(); //Split query in terms at delimiter and lowercase them terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) { return n; }).map(function(term){ return term.toLowerCase(); }); meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + ''; //Escape terms terms_esc = terms.map(function(term) { return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g,"\\$&"); }); //Wrap terms in in individual capturing groups terms_esc = terms.map(function(term) { return '(' + term + ')'; }); //Find all permutations permuted = permutate_array(terms_esc); //Construct a group for each permutation match_groups = []; for (var i in permuted) { match_groups.push(permuted[i].join('(.*?)')); } try { //Construct the final regex regex_string = match_groups.join('|'); //Display it document.getElementById('viral_permutations_regex').innerHTML = regex_string; meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + ''; regex = new RegExp(regex_string, 'i'); //Loop through each option for (i = 0; i < options.length; i++) { option = options[i]; //Replace the terms with highlighted terms highlighted = option.replace(regex, viral_permutations_replace); //If anything was changed (or the query is empty) we have a match if (highlighted !== option || terms.length === 0) { //Append it to the result list li = document.createElement('li'); li.innerHTML = highlighted; res_elm.appendChild(li); } } //Print some meta t1 = performance.now(); meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms'; } catch(e) { meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>'; } } //The replacement function function viral_permutations_replace() { var i, m, j, retval, m_cased, unmatched; retval = ''; //Make a clone of the terms array (that we can modify without destroying the original) unmatched = terms.slice(0); //Loop arguments between the first (which is the full match) and //the last 2 (which are the offset and the full option) for (i = 1; i < arguments.length - 1; i++) { m = arguments[i]; //Check that we have a string - most of the arguments will be undefined if (typeof m !== 'string') continue; //Lowercase the match m_cased = m.toLowerCase(); //Append it to the return value - highlighted if it is among our terms j = unmatched.indexOf(m_cased); if (j >= 0) { //Remove it from the unmatched terms array unmatched.splice(j, 1); retval += '<u>' + m + '</u>'; } else { retval += m; } } return retval; } //Helper function to return all possible permutations of an array function permutate_array(arr) { var perm, perm_intern; perm_intern = function(perm, pre, post, n) { var elem, i, j, ref, rest; if (n > 0) { for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) { rest = post.slice(0); elem = rest.splice(i, 1); perm_intern(perm, pre.concat(elem), rest, n - 1); } } else { perm.push(pre); } }; perm = []; perm_intern(perm, [], arr, arr.length); return perm; } viral_permutations(); |
1 2 3 4 | <input type="text" id="viral_permutations" onkeyup="viral_permutations()"> <p id="viral_permutations_meta"> </p> [cc lang="javascript"] |
< /代码>
感谢Trincot指出,我的原始版本偶尔会突出显示一个重复出现的术语两次——这是在这段代码中修复的。
失败是因为:
- 正则表达式随着条件的增加呈指数增长。7个术语(甚至是单个字母)超过250kb,我的浏览器放弃了
Error: regexp too big …
其他一些不起作用的值得注意的战略:捕获每个组中包含所有术语的组:
1 | (foo|bar)(.*)(foo|bar) |
失败是因为:
- 将匹配包含重复条款-fx的选项。
The food in the food court 会匹配,但显然不应该。 - 如果我们"反复检查"所有条款,事实上,发现它将无法找到有效的匹配-外汇。
The food in the food bar 会发现foo 两次,永远不会到bar 去。
否定的lookaheads和backreferences:
1 | (foo|bar|baz)(.*?)((?!\1)(?:foo|bar|baz))(.*?)((?!\1|\3)(?:foo|bar|baz)) |
失败是因为:
- 当查询中的条件重复出现时,将达到"找一个不是
foo 、bar 、bar ,也不是foo 、bar "等不可能的条件。 - 我相当肯定它还有其他问题,但当我意识到它在逻辑上有缺陷时,我就不再去追求它了。
正面造型
1 | (?=.*foo)(?=.*bar)(?=.*baz) |
失败是因为:
- 很难(如果不可能)可靠地突出显示找到的匹配项。
- 我还没有找到任何方法来确保所有条款都实际存在——也就是说,它们可能单独存在于期权中,但较短的条款可能只存在于较长条款的子字符串中——或者条款可能重叠。
我建议在分而治之的思想上有一个微小的变体:您可以"清除"匹配的字符,而不是将字符串分割成块(位),然后在该字符串上执行进一步的搜索。要擦除的字符将是分隔符,因为它保证不会出现在任何术语中。
这里是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | function trincotWipeSearch(query, options, separator) { // Split query in terms at delimiter const terms = query.split(separator).filter(Boolean); if (!terms.length) return options; // Sort terms by descending size terms.sort( (a,b) => b.length - a.length ); // Escape terms, and enrich with size of original term // and a string of the same length filled with the separator char const items = terms.map(term => ({ size: term.length, wipe: separator.repeat(term.length), regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g,"\\$&"), 'gi') })); function getOffsets(termIndex, text) { // All terms found? if (termIndex >= terms.length) return []; let match; const { regex, size, wipe } = items[termIndex]; regex.lastIndex = 0; while (match = regex.exec(text)) { let index = match.index; // Wipe characters and recurse to find other terms let offsets = getOffsets(termIndex+1, text.substr(0, index) + wipe + text.substr(index + size)); if (offsets !== undefined) { // Solution found, backtrack all the way return offsets.concat([index, index + size]); } regex.lastIndex = match.index + 1; } } // Loop through each option return options.map( option => { // Get the offsets of the matches let offsets = getOffsets(0, option); if (offsets) { // Apply the offsets to add the markup offsets .sort( (a,b) => b - a ) .map((index, i) => { option = option.substr(0, index) + (i%2 ?"<u>" :"</u>") + option.substr(index); }); return option; } }).filter(Boolean); // get only the non-empty results } var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']; /* * I/O and performance measurements */ function processInput() { var query = this.value.toLowerCase(); const t0 = performance.now(); const matches = trincotWipeSearch(query, options, ' '); const spentTime = performance.now() - t0; // Output the time spent time.textContent = spentTime.toFixed(2); // Output the matches result.innerHTML = ''; for (var match of matches) { // Append it to the result list var li = document.createElement('li'); li.innerHTML = match; result.appendChild(li); } } findTerms.addEventListener('keyup', processInput); processInput.call(findTerms); |
1 2 3 4 5 | ul { height:300px; font-size: smaller; overflow: auto; } |
1 2 3 4 5 6 | Input terms: <input type="text" id="findTerms"> Trincot's Wipe Search Time: <span id="time"></span>ms <ul id="result"> </ul> |
我已将dom I/O从时间度量中排除。
下面是一个JSfiddle并排比较这两种算法。与其他算法相比,添加第三个算法并不难。
当分隔符可以是任何正则表达式时…则不能使用上述功能。克服这一问题的一种方法是引入一个"影子"字符串,与选项字符串一样长,但其中只有两个不同的可能字符(如
其中一个表示选项字符串中的对应字符(即在同一位置)已与某个术语匹配,因此不再可用于另一个术语的匹配。
另一个字符表示选项字符串中的相应字符仍然可以包含在术语匹配中。
显然,这会使函数慢一点,因为在检查此阴影字符串之后,可能需要拒绝匹配:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | function trincotShadowMarks (query, options, separator) { // Split query in terms at delimiter const terms = query.split(separator).filter(Boolean); if (!terms.length) return options; // Sort terms by descending size terms.sort( (a,b) => b.length - a.length ); // Escape terms, and enrich with size of original term // and a string of the same length filled with the separator char const items = terms.map(term => ({ size: term.length, used: 'x'.repeat(term.length), free: '.'.repeat(term.length), regex: new RegExp(term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g,"\\$&"), 'gi') })); function getOffsets(termIndex, text, shadow) { // All terms found? if (termIndex >= terms.length) return []; let match; const { regex, size, used, free } = items[termIndex]; regex.lastIndex = 0; while (regex.lastIndex > -1 && (match = regex.exec(text))) { let index = match.index; // Is this match not overlapping with another match? if (!shadow.substr(index, size).includes('x')) { // Mark position as used and recurse to find other terms let offsets = getOffsets(termIndex+1, text, shadow.substr(0, index) + used + shadow.substr(index + size)); if (offsets !== undefined) { // Solution found, backtrack all the way return offsets.concat([index, index + size]); } } regex.lastIndex = shadow.indexOf(free, match.index + 1); } } // Loop through each option return options.map( option => { // Get the offsets of the matches let offsets = getOffsets(0, option, '.'.repeat(option.length)); if (offsets) { // Apply the offsets to add the markup offsets .sort( (a,b) => b - a ) .map((index, i) => { option = option.substr(0, index) + (i%2 ?"<u>" :"</u>") + option.substr(index); }); return option; } }).filter(Boolean); // get only the non-empty results } var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']; /* * I/O and performance measurements */ function processInput() { var query = this.value.toLowerCase(); const t0 = performance.now(); const matches = trincotShadowMarks(query, options, ' '); const spentTime = performance.now() - t0; // Output the time spent time.textContent = spentTime.toFixed(2); // Output the matches result.innerHTML = ''; for (var match of matches) { // Append it to the result list var li = document.createElement('li'); li.innerHTML = match; result.appendChild(li); } } findTerms.addEventListener('keyup', processInput); processInput.call(findTerms); |
1 2 3 4 5 | ul { height:300px; font-size: smaller; overflow: auto; } |
1 2 3 4 5 6 | Input terms: <input type="text" id="findTerms"> Trincot's Wipe Search Time: <span id="time"></span>ms <ul id="result"> </ul> |
我试了一下,但我不确定这会有多大帮助。我的方法类似于你的分而治之的方法。好的。
我没有咬掉字符串中的一些位,而是提前搜索每个词,并存储所有匹配项的集合,记录开始和结束位置。如果没有足够的匹配项来匹配特定的搜索词,算法会立即为该"选项"下注。好的。
一旦集合了所有可能的匹配项,它就会递归地尝试找到一个不重叠的组合。在这个递归过程中有很多数据结构的复制,我怀疑它可能比这里的优化要好得多。我也只能为一些变量名道歉,我一直在努力想出有意义的名称。好的。
对于某些测试搜索,如
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | function search() { var options = [ 'ababababa', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe' ]; var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/); if (!terms[0]) { terms = []; } document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms); var startTime = performance.now(); // Term counts is a map storing how many times each search term appears in the query var termCounts = {}; terms.forEach(function(term) { termCounts[term] = (termCounts[term] || 0) + 1; }); // An array of search terms with the duplicates removed var uniqueTerms = Object.keys(termCounts); // Loop through each option and map to either a highlight version or null options = options.map(function(optionText) { var matches = {}, lastMatchIndex = {}, option = optionText.toLowerCase(); uniqueTerms.forEach(function(term) { // This array will be populated with start/end position of each match for this term matches[term] = []; // The index of the last match... which could be deduced from the matches but this is slightly easier lastMatchIndex[term] = -1; }); var incompleteMatchTerms = uniqueTerms.slice(), nextMatchTerm; // This is probably a premature optimization but doing it this // way ensures we check that each search term occurs at least // once as quickly as possible. while (nextMatchTerm = incompleteMatchTerms.shift()) { var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1); if (nextMatchIndex === -1) { // Haven't found enough matches for this term, so the option doesn't match if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) { return null; } } else { // Found another match, put the term back on the queue // for another round incompleteMatchTerms.push(nextMatchTerm); lastMatchIndex[nextMatchTerm] = nextMatchIndex; matches[nextMatchTerm].push({ start: nextMatchIndex, end: nextMatchIndex + nextMatchTerm.length }); } } // Pass in the original array of terms... we attempt to highlight in the order of the original query var highlights = performHighlight(terms, matches); if (!highlights) { return null; } // We need the highlights sorted so that we do the replacing from the end of the string highlights.sort(function(h1, h2) { return h2.start - h1.start; }); highlights.forEach(function(highlight) { optionText = optionText.slice(0, highlight.start) + '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>' + optionText.slice(highlight.end); }); return optionText; function performHighlight(terms, allMatches) { // If there are no terms left to match we've got a hit if (terms.length === 0) { return []; } var nextTerms = terms.slice(), term = nextTerms.shift(), matches = allMatches[term].slice(), match; while (match = matches.shift()) { var nextMatches = {}; // We need to purge any entries from nextMatches that overlap the current match uniqueTerms.forEach(function(nextTerm) { var nextMatch = term === nextTerm ? matches : allMatches[nextTerm]; nextMatches[nextTerm] = nextMatch.filter(function(match2) { return match.start >= match2.end || match.end <= match2.start; }); }); var highlights = performHighlight(nextTerms, nextMatches); if (highlights) { highlights.push(match); return highlights; } } return null; } }); document.getElementById('results').innerHTML = options.map(function(option) { if (option) { return ' <li> ' + option + ' </li> '; } return ''; }).join(''); document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms'; } |
1 2 3 4 5 6 7 8 9 10 11 12 | Permutations <input type="text" id="search" onkeyup="search()" autocomplete="off"> <p id="terms"> </p>Ok. <p id="time"> </p>Ok. <ul id="results"> </ul> |
好的。
更新:好的。
根据Mik3Pro在评论中的反馈,我做了一些性能调整,并得出了以下结论:好的。
https://jsfiddle.net/skirtle/ndeuqn02/1/好的。
核心算法是相同的,但我以性能的名义让它更难理解。大多数更改都与尽可能避免创建新对象有关。好的。
由于该算法需要预先搜索很多它可能永远不需要的东西,所以总是有机会让其他算法更快,特别是在简单的情况下。其中许多情况可以单独处理,但我没有尝试过这种优化。好的。
在Chrome中,它在许多不同的场景中都优于其他实现,尽管这是一个不公平的比较,因为它们还没有以相同的方式进行调整。对于简单的搜索,其他的实现在firefox中的速度可能稍快一些,但时间都在同一个范围内。好的。
一些特别有趣的搜索:好的。
a ab ba baba 。我添加了一个新的"选项",并调整了CSS来演示这一点。算法在执行突出显示的选择方式上有所不同。我的算法倾向于查询中术语的顺序,而不是基于术语的长度。如果我不担心订购,还有更多的优化可用,但我认为它们只在重叠的极端情况下有帮助。t r i s t a n d a c u n h a 。注意字母之间的空格,这是14个独立的搜索词。如果你一次只打一个词,分而治之很快就会开始斗争,但最终会恢复过来。擦拭和阴影可以处理更长的时间,但当你键入字母c 时,它们会从悬崖上掉下来。我认为这是回溯中的指数爆炸,但我还没有证实。我相信通过一些工作,它可以在简单的情况下得到解决,但是如果回溯是由不可解决的重叠引起的,那么修复它可能会更加棘手。
我相信所有的实现都可以通过更多的调优和一些精心设计的特殊情况处理来加快速度。对于真实的场景,哪一个是"最好的",我不确定,但我目前的感觉是,我的算法可能只有一个狭隘的优势,在一个真正公平的测试中,它会优于其他算法。对于真正的搜索来说,一个没有预先完成所有搜索的算法似乎很难击败。好的。
更新2好的。
我尝试了我以前的方法的另一种实现:好的。
https://jsfiddle.net/skirtle/ndeuqn02/9/好的。
请注意,我只更新了自己的实现,其他的实现仍然是过时的。好的。
我想我应该尽量避免不必要的搜索,懒散地执行它们,而不是预先执行它们。它仍然缓存它们,以便在算法需要回溯时重用它们。我不知道这是否有显著的区别,因为在短字符串上执行少量的额外搜索可能不会增加太多开销。好的。
我还尝试过去掉函数递归。虽然它看起来确实有效,但我觉得存在很高的bug风险(它需要大量的单元测试来确保它确实有效)。我不相信这一部分真的是成功的,因为所涉及的数据结构使它很难遵循。它看起来确实很快,但还不足以证明其复杂性。好的。
我还尝试了其他方法来构建最后的亮点。所有这些排序和切片看起来都是性能消耗,但同样,代码在试图避免它时变得更加复杂。不过,其中一些收益可能适用于其他算法。好的。
我在这里介绍的另一个想法是对查询词进行预搜索分析(只依赖于查询,而不依赖于选项)。它检查术语是否可以重叠,对于任何不可能重叠的术语(如
正如评论中所提到的,运行选项的某种预搜索分析的想法也是可能的,但我在这里还没有真正实现。很难知道哪种搜索索引最有用,因为它取决于内存使用情况和选项的具体情况。然而,尝试将少量信息从一次搜索传递到下一次搜索可能更为实际。好的。
例如,如果有人搜索
好啊。
分而治之
比单一的regex病毒排列策略要复杂一些-这个递归算法从最长的术语开始逐个搜索每个术语。
每次发现匹配时,它都将该"咬"分为三个(除非在开始或结束时),将匹配的"咬"标记为消耗,并尝试在任何未消耗的"咬"中匹配下一个最长的术语。
当它找不到最长的不匹配项时,它将回溯并尝试在不同的位置(甚至在不同的"咬")匹配上一个项。
如果它回到最长的期限,并且找不到另一个位置来匹配它,那么它将返回false。
这意味着在大多数情况下,它可以很快地返回非匹配项,因为它们甚至不包含最长的项。
当然,如果它超出了条件-即成功匹配最短的-它将返回突出显示的匹配,将所有"咬"重新连接在一起。
演示:为了提高性能而更新:基本算法完全相同,但是有一些非常昂贵的调用可以完全避免。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | function divide_and_conquer_replace(query, options, separator) { var terms, terms_esc; //The inner replacement function function divide_and_conquer_inner(bites, depth) { var this_term, i, bite, match, new_bites, found_all_others; depth = depth ? depth : 1; //Get the longest remaining term this_term = terms_esc[terms_esc.length - depth]; //Loop all the bites for (i = 0; i < bites.length; i++) { bite = bites[i]; //Reset the lastIndex since we're reusing the RegExp objects this_term.lastIndex = 0; //Check that we have a string (ie. do not attempt to match bites //that are already consumed) if (typeof bite === 'string') { //Find the next matching position (if any) while (match = this_term.exec(bite)) { new_bites = (i > 0) ? bites.slice(0, i) : []; if (match.index > 0) { new_bites.push(bite.slice(0, match.index)); } new_bites.push(['<u>' + match[0] + '</u>']); if (this_term.lastIndex < bite.length) { new_bites.push(bite.slice(this_term.lastIndex)); } if (i < bites.length - 1) { new_bites = new_bites.concat(bites.slice(i + 1)); } if (terms_esc.length > depth) { //Attempt to find all other terms found_all_others = divide_and_conquer_inner(new_bites, depth + 1); //If we found all terms we'll pass the modified string all the //way up to the original callee if (found_all_others) { return found_all_others; } //Otherwise try to match current term somewhere else this_term.lastIndex = match.index + 1; } else { //If no terms remain we have a match return new_bites.join(''); } } } } //If we reach this point at least one term was not found return null; }; // Split query in terms at delimiter terms = query.split(separator).filter(Boolean); if (!terms.length) return options; //Sort terms according to length - longest term last terms.sort(function(a, b) { return a.length - b.length; }); //Escape terms //And store RegExp's instead of strings terms_esc = terms.map(function (term) { return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g,"\\$&"); }).map(function (term) { return new RegExp(term, 'gi'); }); //Loop through each option return options.map(function(option){ return divide_and_conquer_inner([option]); }).filter(Boolean); } var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']; var separator = ' '; function divide_and_conquer(){ var query = document.getElementById('divide_and_conquer').value; var res_elm = document.getElementById('divide_and_conquer_result'); var t0 = performance.now(); var results = divide_and_conquer_replace(query, options, separator); var t1 = performance.now(); document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms'; res_elm.innerHTML = ''; for (var result of results) { res_elm.innerHTML += ' <li> ' + result + ' </li> '; } }; divide_and_conquer(); |
1 2 3 4 5 | <input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()"> <p id="divide_and_conquer_meta"> </p> <ul style="height:300px;overflow:auto" id="divide_and_conquer_result"> </ul> |
当查询仅由(通常非常短)字符串组成时,此策略会出现性能问题,这些字符串在许多选项(如
在现实场景中,它目前的性能优于其他建议的算法——请参阅添加到问题中的JSPERF链接。
这里有一个与我之前的答案完全不同的方法——我不能将下面的所有内容都添加到(大小限制),所以……这是一个单独的答案。
通用后缀树:预处理选项广义后缀树是一种结构,理论上允许以有效的方式搜索一组字符串中的子字符串。所以我想我可以试试看。
以有效的方式构建这样一棵树远不是一件容易的事情,从这个令人敬畏的Ukkonen算法解释中可以看出,它涉及为一个短语(选项)构建一个后缀树。
我从这里发现的实施中获得了灵感,需要对以下内容进行一些调整:
- 应用更好的编码样式(例如,去掉未显式声明的全局变量)
- 使其在文本后不需要添加分隔符。这真的很棘手,我希望我没有错过一些边界条件。
- 使其适用于多个字符串(即通用字符串)
所以这里是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | "use strict"; // Implementation of a Generalized Suffix Tree using Ukkonen's algorithm // See also: https://stackoverflow.com/q/9452701/5459839 class Node { constructor() { this.edges = {}; this.suffixLink = null; } addEdge(ch, textId, start, end, node) { this.edges[ch] = { textId, start, end, node }; } } class Nikkonen extends Node { constructor() { super(); // root node of the tree this.texts = []; } findNode(s) { if (!s.length) return; let node = this, len, suffixSize = 0, edge; for (let i = 0; i < s.length; i += len) { edge = node.edges[s.charAt(i)]; if (!edge) return; len = Math.min(edge.end - edge.start, s.length - i); if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return; node = edge.node; } return { edge, len }; } findAll(term, termId = 1) { const { edge, len } = this.findNode(term) || {}; if (!edge) return {}; // not found // Find all leaves const matches = new Map; (function recurse({ node, textId, start, end }, suffixLen) { suffixLen += end - start; const edges = Object.values(node.edges); if (!edges.length) { // leaf node: calculate the match if (!(matches.has(textId))) matches.set(textId, []); matches.get(textId).push({ offset: end - suffixLen, termId }); return; } edges.forEach( edge => recurse(edge, suffixLen) ); })(edge, term.length - len); return matches; } addText(text) { // Implements Nikkonen's algorithm for building the tree // Inspired by https://felix-halim.net/misc/suffix-tree/ const root = this, active = { node: root, textId: this.texts.length, start: 0, end: 0, }, texts = this.texts; // Private functions function getChar(textId, i) { return texts[textId].charAt(i) || '$' + textId; } function addEdge(fromNode, textId, start, end, node) { fromNode.addEdge(getChar(textId, start), textId, start, end, node); } function testAndSplit() { const ch = getChar(active.textId, active.end); if (active.start < active.end) { const edge = active.node.edges[getChar(active.textId, active.start)], splitPoint = edge.start + active.end - active.start; if (ch === getChar(edge.textId, splitPoint)) return; const newNode = new Node(); addEdge(active.node, edge.textId, edge.start, splitPoint, newNode); addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node); return newNode; } if (!(ch in active.node.edges)) return active.node; } function canonize() { while (active.start < active.end) { const edge = active.node.edges[getChar(active.textId, active.start)]; if (edge.end - edge.start > active.end - active.start) break; active.start += edge.end - edge.start; active.node = edge.node; } } function update() { let prevNewNode = root, newNode; while (newNode = testAndSplit()) { addEdge(newNode, active.textId, active.end, text.length+1, new Node()); // Rule 2: add suffix-link from previously inserted node if (prevNewNode !== root) { prevNewNode.suffixLink = newNode; } prevNewNode = newNode; // Rule 3: follow suffixLink after split active.node = active.node.suffixLink || root; canonize(); // because active.node changed } if (prevNewNode !== root) { prevNewNode.suffixLink = active.node; } } texts.push(text); if (!root.suffixLink) root.suffixLink = new Node(); for (let i = 0; i < text.length; i++) { addEdge(root.suffixLink, active.textId, i, i+1, root); } // Main Ukkonen loop: add each character from left to right to the tree while (active.end <= text.length) { update(); active.end++; canonize(); // because active.end changed } } } function trincotSuffixTree(query, options, suffixTree, separator) { // Split query in terms at delimiter const terms = query.split(separator).filter(Boolean); if (!terms.length) return options; // Sort terms by descending size terms.sort( (a,b) => b.length - a.length ); // create Map keyed by term with count info const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] )); terms.forEach( (term) => termMap.get(term).count++ ); function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) { // All terms found? if (!leftOver) return []; let giveUpAt = Infinity; // While still enough matches left over: while (offsetIndex + leftOver <= offsets.length) { const { termId, offset } = offsets[offsetIndex++]; if (offset < lastIndex) continue; // overlap, try next if (offset >= giveUpAt) break; // Looking further makes no sense const termInfo = termMap.get(terms[termId]); //console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex); if (!termInfo.leftOver) continue; // too many of the same term, try next termInfo.leftOver--; const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex); // If success, then completely backtrack out of recursion. if (result) return result.concat([offset + termInfo.size, offset]); termInfo.leftOver++; // restore after failed recursive search and try next // If a term-match at a given offset could not lead to a solution (in recursion), // and if we keep those matched character postions all unmatched and only start matching after // the end of that location, it will certainly not lead to a solution either. giveUpAt = Math.min(giveUpAt, offset + termInfo.size); } } let allTermsAllOptionsOffsets; // Loop through the unique terms: for (let [term, termInfo] of termMap) { // Get the offsets of the matches of this term in all options (in the preprocessed tree) const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId); //console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets))); if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out if (!allTermsAllOptionsOffsets) { allTermsAllOptionsOffsets = thisTermAllOptionsOffsets; } else { // Merge with all previously found offsets for other terms (intersection) for (let [optionId, offsets] of allTermsAllOptionsOffsets) { let newOffsets = thisTermAllOptionsOffsets.get(optionId); if (!newOffsets || newOffsets.length < termInfo.count) { // this option does not have enough occurrences of this term allTermsAllOptionsOffsets.delete(optionId); } else { allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets)); } } if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out } } // Per option, see if (and where) the offsets can serve non-overlapping matches for each term const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => { // Indicate how many of each term must (still) be matched: termMap.forEach( obj => obj.leftOver = obj.count ); return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)]; }) // Remove options that could not provide non-overlapping offsets .filter( ([_, offsets]) => offsets ) // Sort the remaining options in their original order .sort( (a,b) => a[0] - b[1] ) // Replace optionId, by the corresponding text and apply mark-up at the offsets .map( ([optionId, offsets]) => { let option = options[optionId]; offsets.map((index, i) => { option = option.substr(0, index) + (i%2 ?"<u>" :"</u>") + option.substr(index); }); return option; }); //console.log(JSON.stringify(matches)); return matches; } function trincotPreprocess(options) { const nikkonen = new Nikkonen(); // Add all the options (lowercased) to the suffic tree options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen)); return nikkonen; } const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']; /* * I/O and performance measurements */ let preprocessed; function processInput() { if (!preprocessed) { // Only first time const t0 = performance.now(); preprocessed = trincotPreprocess(options); const spentTime = performance.now() - t0; // Output the time spent on preprocessing pretime.textContent = spentTime.toFixed(2); } var query = this.value.toLowerCase(); const t0 = performance.now(); const matches = trincotSuffixTree(query, options, preprocessed, ' '); const spentTime = performance.now() - t0; // Output the time spent time.textContent = spentTime.toFixed(2); // Output the matches result.innerHTML = ''; for (var match of matches) { // Append it to the result list var li = document.createElement('li'); li.innerHTML = match; result.appendChild(li); } } findTerms.addEventListener('keyup', processInput); processInput.call(findTerms); |
1 2 3 4 5 | ul { height:300px; font-size: smaller; overflow: auto; } |
1 2 3 4 5 6 7 | Input terms: <input type="text" id="findTerms"> Trincot's Suffix Tree Search Preprocessing Time: <span id="pretime"></span>ms (only done once) Time: <span id="time"></span>ms <ul id="result"> </ul> |
这个方法背后有相当多的代码,所以我想对于小的数据集它可能不会显示出有趣的性能,而对于大的数据集,它将消耗内存:树比原始选项数组占用更多的内存。
更新2
由于Vue中的工作字符串恢复问题,放弃了缩小集合的概念。
现在,方法简单如下:
代码被注释。
原始javascript(记录过滤/操作选项数组):https://jsfiddle.net/pvlj9uxe/14/
新的Vue实现:https://jsfiddle.net/15prcpxn/30/
计算速度似乎相当快——dom更新会使其失效。
添加到比较*:https://jsfiddle.net/ektyx133/4/
*警告:预先处理选项(被视为"静态")是策略的一部分,因此它已经在基准之外进行了处理。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | var separator = /\s|\*|,/; // this function enhances the raw options array function enhanceOptions(options) { return options.map(option => ({ working: option.toLowerCase(), // for use in filtering the set and matching display: option // for displaying })) } // this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string function processInput(input) { return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({ value: term.toLowerCase(), size: term.length, wipe:"".repeat(term.length) })).sort((a, b) => b.size - a.size); } // this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted function filterAndHighlight(terms, enhancedOptions) { let options = enhancedOptions, l = terms.length; // filter the options - consider recursion instead options = options.filter(option => { let i = 0, working = option.working, term; while (i < l) { if (!~working.indexOf((term = terms[i]).value)) return false; working = working.replace(term.value, term.wipe); i++; } return true; }) // generate the display string array let displayOptions = options.map(option => { let rangeSet = [], working = option.working, display = option.display; // find the match ranges terms.forEach(term => { working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets rangeSet.push({ start: offset, end: offset + term.size }); return term.wipe; }) }) // sort the match ranges, last to first rangeSet.sort((a, b) => b.start - a.start); // insert the html tags within the string around each match range rangeSet.forEach(range => { display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end) }) return display; }) return displayOptions; } |
旧尝试
https://jsfiddle.net/15prcpxn/25/
我的尝试是,使用Vue进行渲染(这些方法是连续的,因此您可能不费吹灰之力就可以将它们全部放入一个整体函数中——输入将是条件和完整选项集;输出将被过滤选项集和突出显示的范围)。
如果有许多匹配项/无用项(例如输入单个字符),则性能较差。对于最终用途,我可能会输入计算延迟。
我应该能够将这些步骤中的一些汇总到更少的步骤中,从而提高性能。我明天再来。
Vue可能还通过虚拟DOM等来处理一些优化,因此它不一定反映普通的JavaScript/DOM呈现。