Java中文件的行数 | 码农家园

Number of lines in a file in Java

我使用大量的数据文件，有时我只需要知道这些文件中的行数，通常我打开它们，一行一行地读取它们，直到到达文件的末尾。

我想知道有没有更聪明的方法

这是迄今为止我发现的最快的版本，大约是阅读速度的6倍。在150MB日志文件中，这需要0.35秒，而使用readlines()时需要2.40秒。为了好玩，Linux的wc-l命令需要0.15秒。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '
') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}

编辑，9年后1/2年：我几乎没有Java经验，但不管怎样，我已经尝试将此代码与EDOCX1×1的解决方案相比，因为没有人这样做，这让我很困扰。特别是对于大文件，我的解决方案似乎更快。不过，在优化器完成一项像样的工作之前，它似乎需要运行几次。我已经对代码进行了一些处理，并生成了一个始终速度最快的新版本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];

int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}

// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '
') {
++count;
}
}
readChars = is.read(c);
}

// count remaining characters
while (readChars != -1) {
System.out.println(readChars);
for (int i=0; i<readChars; ++i) {
if (c[i] == '
') {
++count;
}
}
readChars = is.read(c);
}

return count == 0 ? 1 : count;
} finally {
is.close();
}
}

一个1.3GB文本文件的基准测试结果，Y轴以秒为单位。我用同一个文件执行了100次运行，并用System.nanoTime()测量了每个运行。你可以看到countLinesOld有一些异常值，countLinesNew没有异常值，虽然速度稍微快一点，但差异在统计学上是显著的。LineNumberReader明显较慢。

Benchmark Plot

相关讨论

你是对的，戴维，我想jvm会来说是足够为这个………………我有一个更新的代码，这一个也更快。
bufferedinputstream应该做的buffering给你，所以我不要看到用一个字节intermediate [ ]的阵列会使它的任何更快。*你unlikely做得太多，比用readline()repeatedly反正(因为这将optimized朝着由API)。
我必须继续benchmarked它与和没有的buffered inputstream，和它也afaster时使用它。
你要闭上inputstream，当你用它做的，不是吗？
如果buffering helped它会因为bufferedinputstream buffers 8K的违约。增加你的字节[ ]给该尺寸或大的和你能降的bufferedinputstream。例如1024×1024字节的尝试。
工程很好，直到我用它在一些Mac格式文件或一些文件，在最后的线没有让"的"N"字符。一些将incorrect在那些situations。虽然它是快的，但我想我会坚持到到"适合所有"readline()方法。
@ bendin谢谢，固定
两个东西：(1)定义的线的终结者，在Java源代码的返回的道路，线的饲料，或返回的道路followed by线的饲料。你的解决方案不起作用，为AS CR用线的终结者。granted，唯一的一部，我可以认为是用途的CR AS的违约行终结者也Mac OS之前到Mac OS X。(2)你的解决方案assumes字符编码的这样一个年代美国ASCII或UTF - 8。"行数可能inaccurate为这样一个encodings UTF - 16。
@弥敦_瑞恩：我刚得到日志从Java应用程序outputting一些TCP主机服务的反应和有是号的CRS的内部。"计划使用的gracefully failed片断的上面。
好的。我会使这个方法的静态和rename countlines它。干杯
我为它的价值，我已经让我的字节()，并用以下的："私人countlines int(字节[文件])throws ioexception inputstream = {也在bytearrayinputstream `(文件)；
这种方法显示一个在线看………………试着看我的回答下面的。
它将失败的C文件，用别的东西比这includes
号线的终结者。"伯爵也从一个由(一看)noeol文件。但是我需要去的counted不是一些银行
但数的occurrences的性格sequences分离线的终结者。
在尝试与资源的更好的方式来这样做。尝试(inputstream =新bufferedinputstream(在fileinputstream(filename))){ /休息/ * *的代码级以上没有最后的块}
可怕的代码………………为400mb正文的文件，它就把第二次。谢谢alot martinus @

我已经为这个问题实现了另一个解决方案，我发现它在计算行数方面更有效：

1
2
3
4
5
6
7
8
9
10
11
12
13

try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}

result = count.getLineNumber() + 1; // +1 because line index starts at 0
}

相关讨论

对于不以换行符结尾的多行文件，接受的答案有一个逐行错误。没有换行符结尾的单行文件将返回1，但是没有换行符结尾的两行文件也将返回1。下面是修复此问题的公认解决方案的实现。除了最后一次读取以外，ENDSWITHONWLINE检查对所有内容都是浪费，但与整个函数相比，它应该是微不足道的时间方面的检查。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '
')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '
');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}

相关讨论

使用Java-8，您可以使用流：

1
2
3
4

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}

相关讨论

如果文件结尾没有换行符，上面count()方法的答案会给我错误的行数-它无法计算文件中的最后一行。

这种方法对我更有效：

1
2
3
4
5
6
7
8
9
10

public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead ="";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber();
reader.close();
return cnt;
}

相关讨论

我知道这是一个古老的问题，但被接受的解决方案与我需要它做的不完全匹配。因此，我对它进行了改进，以接受各种行终止符(而不仅仅是换行符)并使用指定的字符编码(而不是ISO-8859-N)。一体式方法(根据需要重构)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '
': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '
': {
if (prevChar == '
') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '
':
case '
': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}

这个解决方案与可接受的解决方案速度相当，在我的测试中大约慢了4%(尽管Java中的时序测试是众所周知的不可靠的)。

我测试了上面的计数线的方法，这里是我在系统上测试的不同方法的观察结果。

文件大小：1.6 GB方法：

使用扫描仪：约35秒

使用BufferedReader:5s左右

使用Java 8：5S近似

使用线号读取器：大约5秒

此外，java8方法似乎非常方便：files.lines(paths.get(filepath)，charset.defaultcharset()).count()[返回类型：long]

1
2
3
4
5
6
7
8
9
10
11
12
13

/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {

try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}

在JDK8 U31上测试。但实际上，与此方法相比，性能是缓慢的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {

try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {

byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '
') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}

if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}

return count;
}
}

经过测试，速度非常快。

相关讨论

使用扫描仪的直线前进方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

static void lineCounter (String path) throws IOException {

int lineCount = 0, commentsCount = 0;

Scanner input = new Scanner(new File(path));
while (input.hasNextLine()) {
String data = input.nextLine();

if (data.startsWith("//")) commentsCount++;

lineCount++;
}

System.out.println("Line Count:" + lineCount +"\t Comments Count:" + commentsCount);
}

我的结论是，wc -l：的计算换行数的方法很好，但在最后一行没有换行的文件上返回非直观的结果。

并且@er.vikas解决方案基于linenumberreader，但是在行数中添加一个，会在最后一行以换行符结尾的文件上返回不直观的结果。

因此，我制作了一种处理如下问题的算法：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

@Test
public void empty() throws IOException {
assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
assertEquals(1, count("
"));
}

@Test
public void dataWithoutNewline() throws IOException {
assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
assertEquals(1, count("one
"));
}

@Test
public void twoCompleteLines() throws IOException {
assertEquals(2, count("one
two
"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
assertEquals(2, count("one
two"));
}

@Test
public void aFewLines() throws IOException {
assertEquals(5, count("one
two
three
four
five
"));
}

看起来是这样的：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

static long countLines(InputStream is) throws IOException {
try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
char[] buf = new char[8192];
int n, previousN = -1;
//Read will return at least one byte, no need to buffer more
while((n = lnr.read(buf)) != -1) {
previousN = n;
}
int ln = lnr.getLineNumber();
if (previousN == -1) {
//No data read at all, i.e file was empty
return 0;
} else {
char lastChar = buf[previousN - 1];
if (lastChar == '
' || lastChar == '
') {
//Ending with newline, deduct one
return ln;
}
}
//normal case, return line number + 1
return ln + 1;
}
}

如果你想要直观的结果，你可以使用这个。如果您只想与wc -l兼容，只需使用@er.vikas解决方案，但不要在结果中添加一个，然后重试跳过：

1
2
3
4

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
while(lnr.skip(Long.MAX_VALUE) > 0){};
return lnr.getLineNumber();
}

在Java代码中使用进程类如何？然后读取命令的输出。

1
2
3
4
5
6
7
8
9
10

Process p = Runtime.getRuntime().exec("wc -l" + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line ="";
int lineCount = 0;
while ((line = b.readLine()) != null) {
System.out.println(line);
lineCount = Integer.parseInt(line);
}

不过需要试试。将发布结果。

这个有趣的解决方案实际上非常有效！

1
2
3
4
5
6
7
8
9

public static int countLines(File input) throws IOException {
try (InputStream is = new FileInputStream(input)) {
int count = 1;
for (int aChar = 0; aChar != -1;aChar = is.read())
count += aChar == '
' ? 1 : 0;
return count;
}
}

如果您没有任何索引结构，那么您将无法阅读完整的文件。但是，您可以通过避免逐行读取和使用regex匹配所有行终止符来优化它。

相关讨论

带regex的扫描仪：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

public int getLineCount() {
Scanner fileScanner = null;
int lineCount = 0;
Pattern lineEndPattern = Pattern.compile("(?m)$");
try {
fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
while (fileScanner.hasNext()) {
fileScanner.next();
++lineCount;
}
}catch(FileNotFoundException e) {
e.printStackTrace();
return lineCount;
}
fileScanner.close();
return lineCount;
}

还没打卡呢。

在基于Unix的系统上，在命令行上使用wc命令。

相关讨论

对于在eof中没有换行符("")的多行文件的最佳优化代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

/**
*
* @param filename
* @return
* @throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '
' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '
' && c[i] != '
'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count:"+count);
return (count == 0 && !empty) ? 1 : count;
}

知道文件中有多少行的唯一方法是对它们进行计数。当然，您可以从数据中创建一个指标，给出一行的平均长度，然后获取文件大小，并将其除以平均长度，但这并不准确。

相关讨论

如果你用这个

1
2
3
4
5
6
7
8
9
10

您不能运行到大的num行，比如100k行，因为reader.getlinenumber的返回是int。您需要长数据类型来处理最大行数。

相关讨论