Ever try to run a grep search on a large file, and wish there was a way to speed things up?

After some late night Googling, I ran across a proposed method of significantly speeding up a grep search from dogbane over on StackOverflow.

I went ahead and dug deeper with research and even setup a little test to try things out and understand what's going on.

As someone that's used grep for nearly a decade, I'm a bit embarassed to say I'd never heard of this.

If you care to skip over my extensive research on this and are just curious in the actual testing results, I won't get offended, much.

speed up grep search

Locale and internationalisation variables

In a shell execution environment, you alter the environment behaviour with variables.

There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.

You can easily view your server's current locale setting by running:

root@server [~] locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

LC_ALL variable

One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.

If we simply append LC_ALL=C before our command. We change the locale used by the command.

When using the locale C it will default to the server's base Unix/Linux language of ASCII.

root@server [~] LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"       
LC_ALL=C

UTF8 vs ASCII

This all might not make a whole lot of sense yet, but hang with me.

Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.

UTF8 can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters.

So what's the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.

Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.

Using strace to see what's going on

I won't get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.

Below I'm displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.

Then I call egrep to only show mentions of open and read operations:

root@server [~] strace -o TRACE cat TEST_FILE
This is a test

root@server [~] egrep "open|read" TRACE
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\332\1\0\0\0\0\0"..., 832) = 832
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/share/locale/locale.alias", O_RDONLY) = 3
read(3, "# Locale name alias data base.\n#"..., 4096) = 2528
read(3, "", 4096)                       = 0
open("/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MEASUREMENT", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_ADDRESS", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_NAME", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_PAPER", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MONETARY", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY) = 3
open("TEST_FILE", O_RDONLY)             = 3
read(3, "This is a test\n", 4096)       = 15
read(3, "", 4096)

Now here is the same thing with our little LC_ALL=C trick:

root@server [~] LC_ALL=C strace -o TRACE cat TEST_FILE
This is a test

root@server [~] egrep "open|read" TRACE

open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\332\1\0\0\0\0\0"..., 832) = 832
open("TEST_FILE", O_RDONLY)             = 3
read(3, "This is a test\n", 4096)       = 15
read(3, "", 4096)

That was 19 opens and 5 reads for my first test, and 3 opens and 3 reads for the LC_ALL=C test.

You can see that in the default test we had to open multiple files in the /usr/lib/locale/en_US.utf8 directory.

The largest of these locale files is LC_COLLATE and LC_CTYPE.

I threw in a plain en_US locale for comparison sake, which is much smaller than the utf8 version:

root@server [~] ls -lahSr /usr/lib/locale/en_US/LC_C* /usr/lib/locale/en_US.utf8/LC_C*
-rw-r--r--  43 root root  19K May 30 17:10 /usr/lib/locale/en_US/LC_COLLATE
-rw-r--r--  73 root root 203K May 30 17:10 /usr/lib/locale/en_US/LC_CTYPE

-rw-r--r-- 152 root root 233K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_CTYPE
-rw-r--r--  98 root root 860K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_COLLATE    

Sorting things out

If you're anything like me, I'm sure at some point in your life you've had to recite your ABCs to figure out the alphabetical sorting of something. Imagine having thousands and thousands of letters to keep track of and having to keep starting over, doesn't sound too efficient does it?

Now I also bring up an alphabetial sorting example, because it's important to note that you don't want to just go always using LC_ALL=C for everything. I won't go in depth here but basically just know that when using the sort command it's gonna give you different types of sorting based on the locale.

root@server [~] cat TEST_FILE
C
B
A
c
a
b

root@server [~] sort TEST_FILE
a
A
b
B
c
C

root@server [~] LC_ALL=C sort TEST_FILE
A
B
C
a
b
c

Proof is in the pudding

Once I understood the basic principles of what was supposed to be happening, I was excited to start testing right away to see just how much of a boost in search speed I could get.

Avoiding filesystem caching

Now due to the way Linux caches things from the disk into memory, you might have noticed if you grep a file, the first time you do it could take 5-10 seconds. But if you do the exact same search a bit later, it's almost instant.

That's because the filesystem caches the file into memory which is way faster than your hard drive.

I knew this going into my tests, so I knew I couldn't just run a timed grep against my file, and then do it again a few seconds later without severly skewed results due to the system caching.

So what I did was first build up a 5MB or so test file, by running the following command:

grep wp-login.php /usr/local/apache/domlogs/ -R > WP_LOGINS

So now my WP_LOGINS file had about 21,000 lines of attempted wp-login.php requests.

I wanted a ton more to really see the impact on large files, so I proceeded to duplicate the contents of my WP_LOGINS file 100 times into a new file called WP_LOGINS2 with this command:

for i in {1..100}; do cat WP_LOGINS >> WP_LOGINS2; done

Now I've got a 504MB file with 2,100,000 lines, and that should provide a great testbed, at least for one of the tests. So I also duplicated this file multiple times to again avoid filesystem caching in-between tests.

Testing LC_ALL=C grep and fgrep performance

I ran 2 test with the default grep command looking for hits of wp-login.php and providing a count.

I also did 2 with LC_ALL=C set first, and 2 using both LC_ALL=C and fgrep which matches only fixed strings and is even more efficient when doing simple searches like in this case.

Here is the series of tests I ran:

time grep wp-login.php WP_LOGINS_001 -c
time grep wp-login.php WP_LOGINS_002 -c
time LC_ALL=C grep wp-login.php WP_LOGINS_003 -c
time LC_ALL=C grep wp-login.php WP_LOGINS_004 -c
time LC_ALL=C fgrep wp-login.php WP_LOGINS_005 -c
time LC_ALL=C fgrep wp-login.php WP_LOGINS_006 -c

The results are broken up from the time command into 3 values, funny enough the LC_ALL=C locale also alters the output from the time command, which is why the results are different.

Here are the meanings behind these values:

real - How much wall clock time the test took

user - CPU seconds consumed in user space

sys - CPU seconds consumed in system space

Now here are the results from the tests:

real    0m9.545s
user    0m9.416s
sys     0m0.126s

real    0m9.445s
user    0m9.316s
sys     0m0.130s

1.37user 0.13system 0:01.50elapsed
1.37user 0.11system 0:01.48elapsed

0.54user 0.12system 0:00.67elapsed
0.54user 0.12system 0:00.66elapsed

Here it is in a table:

  real user sys
grep 1 9.56 9.42 0.13
grep 2 9.45 9.32 0.13
LC_ALL 1 1.50 1.37 0.13
LC_ALL 2 1.48 1.37 0.11
fgrep 1 0.67 0.54 0.12
fgrep 2 0.66 0.54 0.12

Conclusion

So there you have it, standard grep took 9 1/2 seconds.

Using the LC_ALL=C locale increased our performance 640% and brought that time down to 1 1/2 seconds.

Using fgrep increased our performance 1427% and brought that time down to just over a 1/2 second.

If you skipped down here and were wondering why things got faster, check out my locale research above.

Needless to say, I'll be using this tactic in a ton of scripts and when doing manual grep searches going forward. Hopefully this information will help speed along your own searches as well.

Did you find this article helpful?

We value your feedback!

Why was this article not helpful? (Check all that apply)
The article is too difficult or too technical to follow.
There is a step or detail missing from the instructions.
The information is incorrect or out-of-date.
It does not resolve the question/problem I have.
How did you find this article?
Please tell us how we can improve our Support Center:
Email Address
Optional, but our team may contact you for more information.
Like this Article?

Post a Comment

Name:
Email Address:
Comment:
Are you a bot?
Submit

Please note: Your name and comment will be displayed, but we will not show your email address.

Related Questions

Here are a few questions related to this article that our customers have asked:
Ooops! It looks like there are no questions about this page.
Would you like to ask a question about this page? If so, click the button below!
Ask a Question

Need more Help?

Search

Ask the Community!

Get help with your questions from our community of like-minded hosting users and InMotion Hosting Staff.

Current Customers

Chat: Click to Chat Now E-mail: support@InMotionHosting.com
Call: 888-321-HOST (4678) Ticket: Submit a Support Ticket

Not a Customer?

Get web hosting from a company that is here to help. Sign up today!