Tuesday, June 17, 2008

HeX 021: Learning PCRE and its performance

PCRE stands for Perl Compatible Regular Expressions, it is mainly used for pattern matching. If you want to learn more about PCRE, take a good read of its manual -

shell>man pcre

shell>man pcrematching

shell>man pcrepartial

shell>man pcrepattern

shell>man pcreperform

So why do you need to learn regular expressions(regex), here's the answer -

http://geek00l.blogspot.com/2006/12/regex-magic-for-netsexcanalyst.html

Next look at the tool that comes with pcre - pcretest, as the name implies, you can use pcretest to test your regex. Lets go -

shell>pcre --help
Usage: pcretest [options] [input file [output file]]

Input and output default to stdin and stdout.
This version of pcretest is not linked with readline().

Options:
-b show compiled code (bytecode)
-C show PCRE compile-time options and exit
-d debug: show compiled code and information (-b and -i)
-dfa force DFA matching for all subjects
-help show usage information
-i show information about compiled patterns
-m output memory used information
-o set size of offsets vector to
-p use POSIX interface
-q quiet: do not output PCRE version number at start
-S set stack size to megabytes
-s output store (memory) used information
-t time compilation and execution
-t time compilation and execution, repeating times
-tm time execution (matching) only
-tm time execution (matching) only, repeating times

If you have already read the man pages above, you should be able to understand some of the options, I normally use the option -C to check the compiles-time option first -

shell>pcretest -C
PCRE version 7.7 2008-05-07
Compiled with
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack

Other option I usually use is -t to test on the time compilation and execution of particular regex I write.

shell>pcretest -t
PCRE version 7.7 2008-05-07

re>

So you may see the prompt goes to interactive mode - re>, it is for you to define your regex, bear in mind that your regex must use forward slash as delimeter, for example -

re>/[a-z0-9]+/

This means your regex is [a-z0-9]+, once you enter you will see this -

Compile time 0.0028 milliseconds
data>

You may notice the compile time for this regex is 0.0028 milliseconds, now you try to put any data to see if they match the regex,

data>ABC

Once you hit the enter, you will see this -

Execute time 0.0008 milliseconds
No match

The execution time is 0.0008 milliseconds and there's no match, lets change the data -

data> abc
Execute time 0.0004 milliseconds
0: abc

We can now see the execution time is 0.0004 milliseconds and the data seems to match the regex.

You can also figure out multiple regex compile time on the fly by defining them in a file instead of using interactive mode. For example I write the lines below to a file - pcre-testing.txt

/\d{,10000}/

/([a-z0-9]+)?/i

Do remember that if you want to test multi regex at once, you have to split them with a blank line, you can't do like this and it will incur errors -

/\d{,10000}/
/([a-z0-9]+)?/i

Now we can run this -

shell>pcretest -t pcre-testing
PCRE version 7.7 2008-05-07

/\d{,10000}/
Compile time 0.0032 milliseconds

/([a-z0-9]+)?/i
Compile time 0.0054 milliseconds

There are other options that you may want to try out, but I think I have given you enough guide to carry on, you may be interested in reading some of my related posts here -

http://geek00l.blogspot.com/2007/11/regex-learning-tool-kregexpeditor.html

http://geek00l.blogspot.com/2007/07/visualregexp-nice-regex-learning-tool.html

I advocate pcretest because it comes with pcre and available in HeX, and you can evaluate the performance of the regex quickly.

Enjoy (;])

1 comment:

Nathaniel Richmond said...

Good post. It's definitely important to test regular expressions.