Number of lines in a file in Java
我使用大量的数据文件,有时我只需要知道这些文件中的行数,通常我打开它们,一行一行地读取它们,直到到达文件的末尾。
我想知道有没有更聪明的方法
这是迄今为止我发现的最快的版本,大约是阅读速度的6倍。在150MB日志文件中,这需要0.35秒,而使用readlines()时需要2.40秒。为了好玩,Linux的wc-l命令需要0.15秒。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | public static int countLinesOld(String filename) throws IOException { InputStream is = new BufferedInputStream(new FileInputStream(filename)); try { byte[] c = new byte[1024]; int count = 0; int readChars = 0; boolean empty = true; while ((readChars = is.read(c)) != -1) { empty = false; for (int i = 0; i < readChars; ++i) { if (c[i] == ' ') { ++count; } } } return (count == 0 && !empty) ? 1 : count; } finally { is.close(); } } |
编辑,9年后1/2年:我几乎没有Java经验,但不管怎样,我已经尝试将此代码与EDOCX1×1的解决方案相比,因为没有人这样做,这让我很困扰。特别是对于大文件,我的解决方案似乎更快。不过,在优化器完成一项像样的工作之前,它似乎需要运行几次。我已经对代码进行了一些处理,并生成了一个始终速度最快的新版本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | public static int countLinesNew(String filename) throws IOException { InputStream is = new BufferedInputStream(new FileInputStream(filename)); try { byte[] c = new byte[1024]; int readChars = is.read(c); if (readChars == -1) { // bail out if nothing to read return 0; } // make it easy for the optimizer to tune this loop int count = 0; while (readChars == 1024) { for (int i=0; i<1024;) { if (c[i++] == ' ') { ++count; } } readChars = is.read(c); } // count remaining characters while (readChars != -1) { System.out.println(readChars); for (int i=0; i<readChars; ++i) { if (c[i] == ' ') { ++count; } } readChars = is.read(c); } return count == 0 ? 1 : count; } finally { is.close(); } } |
一个1.3GB文本文件的基准测试结果,Y轴以秒为单位。我用同一个文件执行了100次运行,并用
我已经为这个问题实现了另一个解决方案,我发现它在计算行数方面更有效:
1 2 3 4 5 6 7 8 9 10 11 12 13 | try ( FileReader input = new FileReader("input.txt"); LineNumberReader count = new LineNumberReader(input); ) { while (count.skip(Long.MAX_VALUE) > 0) { // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file } result = count.getLineNumber() + 1; // +1 because line index starts at 0 } |
对于不以换行符结尾的多行文件,接受的答案有一个逐行错误。没有换行符结尾的单行文件将返回1,但是没有换行符结尾的两行文件也将返回1。下面是修复此问题的公认解决方案的实现。除了最后一次读取以外,ENDSWITHONWLINE检查对所有内容都是浪费,但与整个函数相比,它应该是微不足道的时间方面的检查。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | public int count(String filename) throws IOException { InputStream is = new BufferedInputStream(new FileInputStream(filename)); try { byte[] c = new byte[1024]; int count = 0; int readChars = 0; boolean endsWithoutNewLine = false; while ((readChars = is.read(c)) != -1) { for (int i = 0; i < readChars; ++i) { if (c[i] == ' ') ++count; } endsWithoutNewLine = (c[readChars - 1] != ' '); } if(endsWithoutNewLine) { ++count; } return count; } finally { is.close(); } } |
使用Java-8,您可以使用流:
1 2 3 4 | try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) { long numOfLines = lines.count(); ... } |
如果文件结尾没有换行符,上面count()方法的答案会给我错误的行数-它无法计算文件中的最后一行。
这种方法对我更有效:
1 2 3 4 5 6 7 8 9 10 | public int countLines(String filename) throws IOException { LineNumberReader reader = new LineNumberReader(new FileReader(filename)); int cnt = 0; String lineRead =""; while ((lineRead = reader.readLine()) != null) {} cnt = reader.getLineNumber(); reader.close(); return cnt; } |
我知道这是一个古老的问题,但被接受的解决方案与我需要它做的不完全匹配。因此,我对它进行了改进,以接受各种行终止符(而不仅仅是换行符)并使用指定的字符编码(而不是ISO-8859-N)。一体式方法(根据需要重构):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | public static long getLinesCount(String fileName, String encodingName) throws IOException { long linesCount = 0; File file = new File(fileName); FileInputStream fileIn = new FileInputStream(file); try { Charset encoding = Charset.forName(encodingName); Reader fileReader = new InputStreamReader(fileIn, encoding); int bufferSize = 4096; Reader reader = new BufferedReader(fileReader, bufferSize); char[] buffer = new char[bufferSize]; int prevChar = -1; int readCount = reader.read(buffer); while (readCount != -1) { for (int i = 0; i < readCount; i++) { int nextChar = buffer[i]; switch (nextChar) { case ' ': { // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed. linesCount++; break; } case ' ': { if (prevChar == ' ') { // The current line is terminated by a carriage return immediately followed by a line feed. // The line has already been counted. } else { // The current line is terminated by a line feed. linesCount++; } break; } } prevChar = nextChar; } readCount = reader.read(buffer); } if (prevCh != -1) { switch (prevCh) { case ' ': case ' ': { // The last line is terminated by a line terminator. // The last line has already been counted. break; } default: { // The last line is terminated by end-of-file. linesCount++; } } } } finally { fileIn.close(); } return linesCount; } |
这个解决方案与可接受的解决方案速度相当,在我的测试中大约慢了4%(尽管Java中的时序测试是众所周知的不可靠的)。
我测试了上面的计数线的方法,这里是我在系统上测试的不同方法的观察结果。
文件大小:1.6 GB方法:
此外,java8方法似乎非常方便:files.lines(paths.get(filepath),charset.defaultcharset()).count()[返回类型:long]
1 2 3 4 5 6 7 8 9 10 11 12 13 | /** * Count file rows. * * @param file file * @return file row count * @throws IOException */ public static long getLineCount(File file) throws IOException { try (Stream<String> lines = Files.lines(file.toPath())) { return lines.count(); } } |
在JDK8 U31上测试。但实际上,与此方法相比,性能是缓慢的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | /** * Count file rows. * * @param file file * @return file row count * @throws IOException */ public static long getLineCount(File file) throws IOException { try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) { byte[] c = new byte[1024]; boolean empty = true, lastEmpty = false; long count = 0; int read; while ((read = is.read(c)) != -1) { for (int i = 0; i < read; i++) { if (c[i] == ' ') { count++; lastEmpty = true; } else if (lastEmpty) { lastEmpty = false; } } empty = false; } if (!empty) { if (count == 0) { count = 1; } else if (!lastEmpty) { count++; } } return count; } } |
经过测试,速度非常快。
使用扫描仪的直线前进方式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | static void lineCounter (String path) throws IOException { int lineCount = 0, commentsCount = 0; Scanner input = new Scanner(new File(path)); while (input.hasNextLine()) { String data = input.nextLine(); if (data.startsWith("//")) commentsCount++; lineCount++; } System.out.println("Line Count:" + lineCount +"\t Comments Count:" + commentsCount); } |
我的结论是,
并且@er.vikas解决方案基于linenumberreader,但是在行数中添加一个,会在最后一行以换行符结尾的文件上返回不直观的结果。
因此,我制作了一种处理如下问题的算法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | @Test public void empty() throws IOException { assertEquals(0, count("")); } @Test public void singleNewline() throws IOException { assertEquals(1, count(" ")); } @Test public void dataWithoutNewline() throws IOException { assertEquals(1, count("one")); } @Test public void oneCompleteLine() throws IOException { assertEquals(1, count("one ")); } @Test public void twoCompleteLines() throws IOException { assertEquals(2, count("one two ")); } @Test public void twoLinesWithoutNewlineAtEnd() throws IOException { assertEquals(2, count("one two")); } @Test public void aFewLines() throws IOException { assertEquals(5, count("one two three four five ")); } |
看起来是这样的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | static long countLines(InputStream is) throws IOException { try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) { char[] buf = new char[8192]; int n, previousN = -1; //Read will return at least one byte, no need to buffer more while((n = lnr.read(buf)) != -1) { previousN = n; } int ln = lnr.getLineNumber(); if (previousN == -1) { //No data read at all, i.e file was empty return 0; } else { char lastChar = buf[previousN - 1]; if (lastChar == ' ' || lastChar == ' ') { //Ending with newline, deduct one return ln; } } //normal case, return line number + 1 return ln + 1; } } |
如果你想要直观的结果,你可以使用这个。如果您只想与
1 2 3 4 | try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) { while(lnr.skip(Long.MAX_VALUE) > 0){}; return lnr.getLineNumber(); } |
在Java代码中使用进程类如何?然后读取命令的输出。
1 2 3 4 5 6 7 8 9 10 | Process p = Runtime.getRuntime().exec("wc -l" + yourfilename); p.waitFor(); BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream())); String line =""; int lineCount = 0; while ((line = b.readLine()) != null) { System.out.println(line); lineCount = Integer.parseInt(line); } |
不过需要试试。将发布结果。
这个有趣的解决方案实际上非常有效!
1 2 3 4 5 6 7 8 9 | public static int countLines(File input) throws IOException { try (InputStream is = new FileInputStream(input)) { int count = 1; for (int aChar = 0; aChar != -1;aChar = is.read()) count += aChar == ' ' ? 1 : 0; return count; } } |
如果您没有任何索引结构,那么您将无法阅读完整的文件。但是,您可以通过避免逐行读取和使用regex匹配所有行终止符来优化它。
带regex的扫描仪:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | public int getLineCount() { Scanner fileScanner = null; int lineCount = 0; Pattern lineEndPattern = Pattern.compile("(?m)$"); try { fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern); while (fileScanner.hasNext()) { fileScanner.next(); ++lineCount; } }catch(FileNotFoundException e) { e.printStackTrace(); return lineCount; } fileScanner.close(); return lineCount; } |
还没打卡呢。
在基于Unix的系统上,在命令行上使用
对于在eof中没有换行符("")的多行文件的最佳优化代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | /** * * @param filename * @return * @throws IOException */ public static int countLines(String filename) throws IOException { int count = 0; boolean empty = true; FileInputStream fis = null; InputStream is = null; try { fis = new FileInputStream(filename); is = new BufferedInputStream(fis); byte[] c = new byte[1024]; int readChars = 0; boolean isLine = false; while ((readChars = is.read(c)) != -1) { empty = false; for (int i = 0; i < readChars; ++i) { if ( c[i] == ' ' ) { isLine = false; ++count; }else if(!isLine && c[i] != ' ' && c[i] != ' '){ //Case to handle line count where no New Line character present at EOF isLine = true; } } } if(isLine){ ++count; } }catch(IOException e){ e.printStackTrace(); }finally { if(is != null){ is.close(); } if(fis != null){ fis.close(); } } LOG.info("count:"+count); return (count == 0 && !empty) ? 1 : count; } |
知道文件中有多少行的唯一方法是对它们进行计数。当然,您可以从数据中创建一个指标,给出一行的平均长度,然后获取文件大小,并将其除以平均长度,但这并不准确。
如果你用这个
1 2 3 4 5 6 7 8 9 10 | public int countLines(String filename) throws IOException { LineNumberReader reader = new LineNumberReader(new FileReader(filename)); int cnt = 0; String lineRead =""; while ((lineRead = reader.readLine()) != null) {} cnt = reader.getLineNumber(); reader.close(); return cnt; } |
您不能运行到大的num行,比如100k行,因为reader.getlinenumber的返回是int。您需要长数据类型来处理最大行数。