Ever try to run a grep search on a large file, and wish there was a way to speed things up?

After some late night Googling, I ran across a proposed method of significantly speeding up a grep search from dogbane over on StackOverflow.

I went ahead and dug deeper with research and even setup a little test to try things out and understand what's going on.

As someone that's used grep for nearly a decade, I'm a bit embarassed to say I'd never heard of this.

If you care to skip over my extensive research on this and are just curious in the actual testing results, I won't get offended, much.

speed up grep search

Locale and internationalisation variables

In a shell execution environment, you alter the environment behaviour with variables.

There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.

You can easily view your server's current locale setting by running:

root@server [~] locale

LC_ALL variable

One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.

If we simply append LC_ALL=C before our command. We change the locale used by the command.

When using the locale C it will default to the server's base Unix/Linux language of ASCII.

root@server [~] LC_ALL=C locale


This all might not make a whole lot of sense yet, but hang with me.

Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.

UTF8 can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters.

So what's the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.

Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.

Using strace to see what's going on

I won't get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.

Below I'm displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.

Then I call egrep to only show mentions of open and read operations:

root@server [~] strace -o TRACE cat TEST_FILE
This is a test

root@server [~] egrep "open|read" TRACE
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\332\1\0\0\0\0\0"..., 832) = 832
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
open("/usr/share/locale/locale.alias", O_RDONLY) = 3
read(3, "# Locale name alias data base.\n#"..., 4096) = 2528
read(3, "", 4096)                       = 0
open("/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION", O_RDONLY) = 3
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MEASUREMENT", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_ADDRESS", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_NAME", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_PAPER", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_MONETARY", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY) = 3
open("/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY) = 3
open("TEST_FILE", O_RDONLY)             = 3
read(3, "This is a test\n", 4096)       = 15
read(3, "", 4096)

Now here is the same thing with our little LC_ALL=C trick:

root@server [~] LC_ALL=C strace -o TRACE cat TEST_FILE
This is a test

root@server [~] egrep "open|read" TRACE

open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\332\1\0\0\0\0\0"..., 832) = 832
open("TEST_FILE", O_RDONLY)             = 3
read(3, "This is a test\n", 4096)       = 15
read(3, "", 4096)

That was 19 opens and 5 reads for my first test, and 3 opens and 3 reads for the LC_ALL=C test.

You can see that in the default test we had to open multiple files in the /usr/lib/locale/en_US.utf8 directory.

The largest of these locale files is LC_COLLATE and LC_CTYPE.

I threw in a plain en_US locale for comparison sake, which is much smaller than the utf8 version:

root@server [~] ls -lahSr /usr/lib/locale/en_US/LC_C* /usr/lib/locale/en_US.utf8/LC_C*
-rw-r--r--  43 root root  19K May 30 17:10 /usr/lib/locale/en_US/LC_COLLATE
-rw-r--r--  73 root root 203K May 30 17:10 /usr/lib/locale/en_US/LC_CTYPE

-rw-r--r-- 152 root root 233K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_CTYPE
-rw-r--r--  98 root root 860K May 30 17:10 /usr/lib/locale/en_US.utf8/LC_COLLATE    

Sorting things out

If you're anything like me, I'm sure at some point in your life you've had to recite your ABCs to figure out the alphabetical sorting of something. Imagine having thousands and thousands of letters to keep track of and having to keep starting over, doesn't sound too efficient does it?

Now I also bring up an alphabetial sorting example, because it's important to note that you don't want to just go always using LC_ALL=C for everything. I won't go in depth here but basically just know that when using the sort command it's gonna give you different types of sorting based on the locale.

root@server [~] cat TEST_FILE

root@server [~] sort TEST_FILE

root@server [~] LC_ALL=C sort TEST_FILE

Proof is in the pudding

Once I understood the basic principles of what was supposed to be happening, I was excited to start testing right away to see just how much of a boost in search speed I could get.

Avoiding filesystem caching

Now due to the way Linux caches things from the disk into memory, you might have noticed if you grep a file, the first time you do it could take 5-10 seconds. But if you do the exact same search a bit later, it's almost instant.

That's because the filesystem caches the file into memory which is way faster than your hard drive.

I knew this going into my tests, so I knew I couldn't just run a timed grep against my file, and then do it again a few seconds later without severly skewed results due to the system caching.

So what I did was first build up a 5MB or so test file, by running the following command:

grep wp-login.php /usr/local/apache/domlogs/ -R > WP_LOGINS

So now my WP_LOGINS file had about 21,000 lines of attempted wp-login.php requests.

I wanted a ton more to really see the impact on large files, so I proceeded to duplicate the contents of my WP_LOGINS file 100 times into a new file called WP_LOGINS2 with this command:

for i in {1..100}; do cat WP_LOGINS >> WP_LOGINS2; done

Now I've got a 504MB file with 2,100,000 lines, and that should provide a great testbed, at least for one of the tests. So I also duplicated this file multiple times to again avoid filesystem caching in-between tests.

Testing LC_ALL=C grep and fgrep performance

I ran 2 test with the default grep command looking for hits of wp-login.php and providing a count.

I also did 2 with LC_ALL=C set first, and 2 using both LC_ALL=C and fgrep which matches only fixed strings and is even more efficient when doing simple searches like in this case.

Here is the series of tests I ran:

time grep wp-login.php WP_LOGINS_001 -c
time grep wp-login.php WP_LOGINS_002 -c
time LC_ALL=C grep wp-login.php WP_LOGINS_003 -c
time LC_ALL=C grep wp-login.php WP_LOGINS_004 -c
time LC_ALL=C fgrep wp-login.php WP_LOGINS_005 -c
time LC_ALL=C fgrep wp-login.php WP_LOGINS_006 -c

The results are broken up from the time command into 3 values, funny enough the LC_ALL=C locale also alters the output from the time command, which is why the results are different.

Here are the meanings behind these values:

real - How much wall clock time the test took

user - CPU seconds consumed in user space

sys - CPU seconds consumed in system space

Now here are the results from the tests:

real    0m9.545s
user    0m9.416s
sys     0m0.126s

real    0m9.445s
user    0m9.316s
sys     0m0.130s

1.37user 0.13system 0:01.50elapsed
1.37user 0.11system 0:01.48elapsed

0.54user 0.12system 0:00.67elapsed
0.54user 0.12system 0:00.66elapsed

Here it is in a table:

  real user sys
grep 1 9.56 9.42 0.13
grep 2 9.45 9.32 0.13
LC_ALL 1 1.50 1.37 0.13
LC_ALL 2 1.48 1.37 0.11
fgrep 1 0.67 0.54 0.12
fgrep 2 0.66 0.54 0.12


So there you have it, standard grep took 9 1/2 seconds.

Using the LC_ALL=C locale increased our performance 640% and brought that time down to 1 1/2 seconds.

Using fgrep increased our performance 1427% and brought that time down to just over a 1/2 second.

If you skipped down here and were wondering why things got faster, check out my locale research above.

Needless to say, I'll be using this tactic in a ton of scripts and when doing manual grep searches going forward. Hopefully this information will help speed along your own searches as well.

News / Announcements

Support Center Login

Our Login page has moved, Click the button below to be taken to the login page.

Social Media Login


Related Questions

Here are a few questions related to this article that our customers have asked:
Ooops! It looks like there are no questions about this page.
Would you like to ask a question about this page? If so, click the button below!
n/a Points
2014-05-29 5:12 pm

You could also eliminate the Linux buffer cache from skewing your testing by droppig the caches before each test.

    echo 3 > /proc/sys/vm/drop_caches



9,968 Points
2014-05-29 5:24 pm
Hello Noah, and thanks for the comment!

You are correct! That is another great way to make sure the pagecache isn't skewing results. However be careful because your system could seem a bit sluggish as it rebuilds back up the pagecache after totally clearing it out.

Thanks again!

- Jacob
n/a Points
2014-09-12 3:07 pm

As you wrote, this may not give you what you were expecting:

LC_ALL=C sort moop.txt

But, this might:

LC_ALL=C sort -f moop.txt

n/a Points
2014-10-15 8:29 am

Hey great article Jacob,This does affect more than meets the eye.  For instance download a file with UTF-8 characters in it, like many web pages, and then use an strace to see how greps regex is affected:

$ export LANG=C LC_ALL=C;$ strace -f -q -e trace=write -o TRACE.CC 2>&1 grep -o '.\{1\}' t.htmlWhen in ASCII mode grep will incorrectly count utf-8 characters.write(1, "\342\n", 2)write(1, "\200\n", 2)write(1, "\235\n", 2)vs$ export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8;$ strace -f -q -e trace=write -o TRACE.8 2>&1 grep -o '.\{1\}' t.htmlWhen in UTF-8 mode grep will correctly count a utf-8 character as 1 character.write(1, "\342\200\235\n", 4)
n/a Points
2015-02-20 1:50 pm

Tried this on a Centos 5 system with no luck.  Is it OS or distro specific?

10,077 Points
2015-02-20 3:08 pm
Hello MadMan,

This tutorial was made either on a Centos 5.6 or 6.0 server. This should work. Are you getting any errors?

Best Regards,
TJ Edens
n/a Points
2015-02-20 3:23 pm

No errors, just no improvement in time to run grep commands.

n/a Points
2015-03-02 3:14 pm
Thank you for sharing this great finding! This trick made my process go from 1.20 hours to a matter of seconds! Thanks again for making my life better =)
n/a Points
2016-11-03 4:54 pm

Does not have any effect on Ubuntu 14.04 and 16.04. It did work for me for sure last time I've tried it back in 2008.


n/a Points
2017-05-30 3:51 pm

I imagine the reason it did not work for many people is that their default language was already C (mine is). If you are unsure of what your default locale is, you should set LC_LANG to en_US.UTF-8 or whatever before running the tests

Post a Comment

Email Address:
Phone Number:

Please note: Your name and comment will be displayed, but we will not show your email address.

10 Questions & Comments

Post a comment

Back to first comment | top

Need more Help?


Ask the Community!

Get help with your questions from our community of like-minded hosting users and InMotion Hosting Staff.

Current Customers

Chat: Click to Chat Now E-mail: support@InMotionHosting.com
Call: 888-321-HOST (4678) Ticket: Submit a Support Ticket

Not a Customer?

Get web hosting from a company that is here to help. Sign up today!