How to tokenize an input file in java
我正在用Java标记文本文件。我想读取一个输入文件,对其进行标记化,并将某个已标记化的字符写入输出文件。这就是我迄今为止所做的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | package org.apache.lucene.analysis; </p> <p> import java.io.*; </p> <p> class StringProcessing { // Create BufferedReader class instance public static void main(String[] args) throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input); System.out.print("Please enter a java file name:"); String filename = keyboardInput.readLine(); if(!filename.endsWith(".DAT")) { System.out.println("This is not a DAT file."); System.exit(0); } File File = new File(filename); if(File.exists()) { FileReader file = new FileReader(filename); StreamTokenizer streamTokenizer = new StreamTokenizer(file); int i=0; int numberOfTokensGenerated = 0; while(i != StreamTokenizer.TT_EOF) { i = streamTokenizer.nextToken(); numberOfTokensGenerated++; } // Output number of characters in the line System.out.println("Number of tokens =" + numberOfTokensGenerated); // Output tokens for (int counter=0; counter < numberOfTokensGenerated; counter++) { char character = file.toString().charAt(counter); if (character == ' ') System.out.println(); else System.out.print(character); } } else { System.out.println("File does not exist!"); System.exit(0); } </p> <wyn>System.out.println(" "); }//end main }//end class <wyn> |
当我运行此代码时,我得到的是:
Number of tokens = 129
java.io.FileReader@19821fException in thread"main" java.lang.StringIndexOutOfBoundsException: String index out of range: 25
at java.lang.String.charAt(Unknown Source)
at org.apache.lucene.analysis.StringProcessing.main(StringProcessing.java:40)
输入文件将如下所示:
'-K1账户
——OP1退出
PARAM1
Type Int
---参数2金额
Type Int
——OP2矿床
PARAM1
Type Int
---参数2金额
Type Int
——CA1 ACNO
--- Type Int
-K2支票账户
SC帐户
--CA1信用额度
--- Type Int
-K3客户
--CA1的名字
--- Type String
-K4交易
——CA1期
--- Type Date
——CA2时间
--- Type Time
-K5支票簿
-K6检查
-K7平衡账户
--SC帐户
我只想读以
如果您想要标记输入文件,那么最明显的选择是使用扫描仪。scanner类读取给定的输入流,并可以输出令牌或其他扫描类型(scanner.nextint()、scanner.nextline()等)。
1 2 3 4 5 6 7 8 9 10 | import java.util.Scanner; import java.io.File; import java.io.IOException; public static void main(String[] args) throws IOException { Scanner in = new Scanner(new File("filename.dat")); while (in.hasNext) { String s = in.next(); //get the next token in the file // Now s contains a token from the file } } |
查看Oracle关于scanner类的文档以了解更多信息。
问题出在这条线上--
1 | char character = file.toString().charAt(counter); |
要正确读取文件,应该使用BufferedReader包装文件读取器,然后将每个readline放入StringBuffer。
1 2 3 4 5 6 7 | FileReader fr = new FileReader(filename); BufferedReader br = new BufferedReader(fr); StringBuilder sb = new StringBuilder(); String s; while((s = br.readLine()) != null) { sb.append(s); } |
//