关于java：用于存储按字母顺序排列的单词列表的最有效的数据结构

Most efficient data structure for storing an alphabetically ordered word list

我的程序将读入一段单词(存储在文本文件中)。然后需要执行以下操作：

打印出所有单词的列表(按字母顺序排列)。对于每个单词，打印频率计数(单词出现在整个段落中的次数)和单词出现的行号(不需要订购)。如果一个单词出现在一行多次，则行号不需要存储两次(该单词的频率计数仍将更新)
显示从最频繁到最不频繁排序的单词列表。
用户将输入特定单词。如果找到该单词，则打印出其频率计数。

限制：我不能使用Collections类，我不能多次存储数据。 (例如，从段落中读取单词并将它们存储到Set和ArrayList中)

编写这个并不难，但我无法弄清楚什么是最有效的实现，因为数据大小可能是维基百科文章中的几个段落。这是我现在的想法：

有一个Word课程。此Word类将包含返回单词的频率计数和单词出现的行(以及其他相关数据)的方法。
该段落将存储在文本文件中。程序将逐行读取数据。将该行拆分为一个数组并逐个读入。
当从文本文件中读取单词时，将单词放入某种结构中。如果结构不包含单词，请创建一个新的单词对象。
如果结构已包含该单词，请更新该单词的频率计数器。
- 我还会有一个int来记录行号。这些行号将相应更新。

这有点不完整，但这正是我现在想的。整个'Word'类可能也完全没必要。

相关讨论

首先，您可以创建一个类，用于保存事件和行号(以及单词)的数据。这个类可以实现Comparable接口，提供基于单词频率的简单比较：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

public class WordOccurrence implements Comparable<WordOccurrence> {

private final String word;
private int totalCount = 0;
private Set<Integer> lineNumbers = new TreeSet<>();

public WordOccurrence(String word, int firstLineNumber) {
this.word = word;
addOccurrence(firstLineNumber);
}

public final void addOccurrence(int lineNumber) {
totalCount++;
lineNumbers.add(lineNumber);
}

@Override
public int compareTo(WordOccurrence o) {
return totalCount - o.totalCount;
}

@Override
public String toString() {
StringBuilder lineNumberInfo = new StringBuilder("[");
for (int line : lineNumbers) {
if (lineNumberInfo.length() > 1) {
lineNumberInfo.append(",");
}
lineNumberInfo.append(line);
}
lineNumberInfo.append("]");
return word +", occurences:" + totalCount +", on rows"
+ lineNumberInfo.toString();
}
}

从文件中读取单词时，返回Map中的数据，将单词映射到WordOccurrence s非常有用。使用TreeMap，您将获得"免费"字母顺序。此外，您可能希望从行中删除标点符号(例如，使用像\\p{P}这样的正则表达式)并忽略单词的大小写：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
throws IOException {
TreeMap<String, WordOccurrence> words = new TreeMap<>();

File file = new File(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(file)));
String line = null;
int lineNumber = 0;

while ((line = reader.readLine()) != null) {
// remove punctuation and normalize to lower-case
line = line.replaceAll("\\p{P}","").toLowerCase();
lineNumber++;
String[] tokens = line.split("\\s+");
for (String token : tokens) {

if (words.containsKey(token)) {
words.get(token).addOccurrence(lineNumber);
} else {
words.put(token, new WordOccurrence(token, lineNumber));
}
}
}

return words;
}

使用上面的代码按字母顺序显示事件非常简单

1
2
3
4

for (Map.Entry<String, WordOccurrence> entry :
countOccurrences("path/to/file").entrySet()) {
System.out.println(entry.getValue());
}

如果您不能使用Collections.sort()(和Comparator)按事件排序，则需要自己编写排序。这样的事情应该这样做：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

public static void displayInOrderOfOccurrence(
Map<String, WordOccurrence> words) {

List<WordOccurrence> orderedByOccurrence = new ArrayList<>();

// sort
for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
WordOccurrence wo = entry.getValue();

// initialize the list on the first round
if (orderedByOccurrence.isEmpty()) {
orderedByOccurrence.add(wo);
} else {

for (int i = 0; i < orderedByOccurrence.size(); i++) {
if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
orderedByOccurrence.add(i, wo);
break;
} else if (i == orderedByOccurrence.size() - 1) {
orderedByOccurrence.add(wo);
break;
}
}
}
}

// display
for (WordOccurrence wo : orderedByOccurence) {
System.out.println(wo);
}
}

使用以下测试数据运行上述代码：

1
2
3

Potato; orange.
Banana; apple, apple; potato.
Potato.

将产生此输出：

1
2
3
4
5
6
7
8
9

apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]
potato, occurrences: 3, on rows [1, 2, 3]

potato, occurrences: 3, on rows [1, 2, 3]
apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]

您可以使用简单的TreeMap进行频率查找。

鉴于单词很短(即你会找到普通文本)，查找应该是O(1)。如果您期望大量不成功的查找(大量搜索不存在的单词)，您可以使用Bloom过滤器进行预过滤。

我将从一个简单的实现开始，并在需要时进一步优化(直接解析流，而不是使用分隔符拆分每一行并重复)。

相关讨论

你可以使用TreeMap它非常适合获取订购的数据。使用您的单词作为键，频率作为值。例如，让以下是你的段落

Java是优秀的语言Java是面向对象的
所以我将执行以下操作以存储每个单词及其频率

1
2
3
4
5
6
7
8
9
10
11
12

String s ="Java is good language Java is object oriented" ;
String strArr [] = s.split("") ;
TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
for(String str : strArr){
if(tm.get(str) == null){
tm.put(str, 1) ;
}else{
int count = tm.get(str) ;
count+=1 ;

}
}

希望这会对你有所帮助

相关讨论

你可以有这样的结构：
https://gist.github.com/jeorfevre/946ede55ad93cc811cf8

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

/**
*
* @author Jean-Emmanuel [email protected]
*
*/
public class WordsIndex{
HashMap<String, Word> words = new HashMap<String, Word>();

public static void put(String word, int line, int paragraph){
word=word.toLowerCase();

if(words.containsKey(word)){
Word w=words.get(word);
w.count++;

}else{
//new word
Word w = new Word();
w.count=1;
w.line=line;
w.paragraph=paragraph;
w.word=word;
words.put(word, w);
}

}
}

public class Word{
String word;
int count;
int line;
int paragraph;
}

请享用

相关讨论