Fastest way to find the number of lines in a text (C++)


Question

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).

As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.

Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.

1
22
5/9/2009 11:28:46 AM

Accepted Answer

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way tom do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks., say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
        if ( p[i] == '\n' ) {
            newlines++;
        }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
        cout << "lines\n";
        ifstream ifs( "lines.dat" );
        int n = 0;
        string s;
        while( getline( ifs, s ) ) {
            n++;
        }
        cout << n << endl;
    }
    else {
        cout << "buffer\n";
        const int SZ = 1024 * 1024;
        std::vector <char> buff( SZ );
        ifstream ifs( "lines.dat" );
        int n = 0;
        while( int cc = FileRead( ifs, buff ) ) {
            n += CountLines( buff, cc );
        }
        cout << n << endl;
    }
    cout << time(0) - now << endl;
}
17
5/9/2009 1:16:34 PM

You wrote that it keeps get larger. This sounds like it is a log file or something similar where new lines are appended but exisiting lines are not changed. If this is the case you could try an incremental approach.

Parse to the end of file. Remember the line count and the offset of EOF. When the file grows fseek to the offset, parse to EOF and update the line count and the offset.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon