Ever try to run a grep search on a large file, and wish there was a way to speed things up?
I went ahead and dug deeper with research and even setup a little test to try things out and understand what’s going on.
As someone that’s used grep for nearly a decade, I’m a bit embarassed to say I’d never heard of this.
If you care to skip over my extensive research on this and are just curious in the actual testing results, I won’t get offended, much.
Locale and internationalisation variables
In a shell execution environment, you alter the environment behaviour with variables.
There is a special sub-set of internationalisation variables that deal with how support for internationalised applications behave, with grep being one of these applications.
You can easily view your server’s current locale setting by running:
[email protected] [~] locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
One variable you can adjust is called LC_ALL. This sets all LC_ type variables at once to a specified locale.
If we simply append LC_ALL=C before our command. We change the locale used by the command.
When using the locale C it will default to the server’s base Unix/Linux language of ASCII.
[email protected] [~] LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C
UTF8 vs ASCII
This all might not make a whole lot of sense yet, but hang with me.
Basically when you grep something, by default your locale is going to be internationalised and set to UTF8.
UTF8 can represent every character in the Unicode character set to help display any of the world’s writing systems, currently over more than 110,000 unique characters.
So what’s the big deal? Well typically you grep through files encoded in ASCII. The ASCII character set is comprised of a whopping 128 unique characters.
Servers and computers these days can quickly process data thrown at them, but the more efficiently we hand it data, the faster it will be able to accomplish the task and with fewer resources.
Using strace to see what’s going on
I won’t get too technical on it in this article, but strace is a utility to keep tabs on what a process is up to.
Below I’m displaying a file with 1 line with the cat command. The strace output is stored in a file called TRACE.
Then I call egrep to only show mentions of open and read operations:
[email protected] [~] strace -o TRACE cat TEST_FILE This is a test [email protected] [~] egrep "open|read" TRACE open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "177ELF211 3 >