Perl/Addendum: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Georg Heidenreich
No edit summary
 
imported>Georg Heidenreich
m (changed some wording, added category lines)
Line 2: Line 2:
In the standard Unix tools such as [[sed]], a regular expression is enclosed in a pair of slashes, i.e. '<code>/pattern/</code>' .  
In the standard Unix tools such as [[sed]], a regular expression is enclosed in a pair of slashes, i.e. '<code>/pattern/</code>' .  
A non-printing character is written by using the backslash ("escape") character '<code>\</code>', e.g. '<code>/\n/</code>'  
A non-printing character is written by using the backslash ("escape") character '<code>\</code>', e.g. '<code>/\n/</code>'  
represents a newline character in a pattern. But, what if a '<code>/</code>' (slash) happens to be part of the pattern,  
represents a newline character in a pattern. Certain printing characters --of course the metacharacter '<code>/</code>' itself
or worse, a string such as '<code>\/</code>' is to be substituted by its duplicate '<code>\/\/</code>'?
is one of them-- also need to be escaped. So, to match against '<code>/</code>', the pattern would be written as '<code>/\//</code>'.
Both the backslash and the slash (because it is the delimiter) need to be escaped: '<code>/\\\//</code>' represents  
This is not uncommon, for example in file (path) names.
 
It gets confusing quickly if e.g. '<code>\/</code>' is to be substituted by its duplicate '<code>\/\/</code>'.
Both the backslash and the slash need to be escaped: '<code>/\\\//</code>' represents  
the string '<code>\/</code>' inside a pattern definition. The "substitute" construct '<code>$g =~ s/a/b/</code>'  
the string '<code>\/</code>' inside a pattern definition. The "substitute" construct '<code>$g =~ s/a/b/</code>'  
(substitute 'a' by 'b') explodes into '''slasheritis''':  
(substitute 'a' by 'b') explodes into the so-called '''slasheritis''':  
'<code>$g =~s/\\\//\\\/\\\//</code>', i.e. such regular Expression patterns become quickly unreadable.
'<code>$g =~s/\\\//\\\/\\\//</code>', i.e. such regular Expression patterns become quickly unreadable.


Line 60: Line 63:


</code>
</code>
[[Category:CZ Live]]
[[Category:Computers Workgroup]]

Revision as of 23:11, 7 June 2007

healing slasheritis

In the standard Unix tools such as sed, a regular expression is enclosed in a pair of slashes, i.e. '/pattern/' . A non-printing character is written by using the backslash ("escape") character '\', e.g. '/\n/' represents a newline character in a pattern. Certain printing characters --of course the metacharacter '/' itself is one of them-- also need to be escaped. So, to match against '/', the pattern would be written as '/\//'. This is not uncommon, for example in file (path) names.

It gets confusing quickly if e.g. '\/' is to be substituted by its duplicate '\/\/'. Both the backslash and the slash need to be escaped: '/\\\//' represents the string '\/' inside a pattern definition. The "substitute" construct '$g =~ s/a/b/' (substitute 'a' by 'b') explodes into the so-called slasheritis: '$g =~s/\\\//\\\/\\\//', i.e. such regular Expression patterns become quickly unreadable.

Perl's solution is to allow the definition of pattern delimiters "on-the-fly", after all Perl knows exactly that a pattern definition begins after the '=~' operator, so why not take the well-chosen next character to represent the delimiter? Now you can resolve the above slasheritis by writing '$g =~ s#\\/#\\/\\/#' (you still need to escape the backslash), and everything is (somewhat) clearer again. It is customary to use non-alphanumeric characters, such as '!#|' as delimiters, but since Perl knows about paired characters such as '<>' or '{}', some well known Perl authors prefer this style: $a =~ s{\\/} {\\/\\/}, because it is even clearer.

special symbols

Perl introduced a whole new flock of shortcuts for classes of characters, usually combined with their (upper case) complement, i.e., '/\w/' stands for all "white" characters (blank, tab, newline, and a few special ones), and '/\W/' (capital 'W') stands for all non-white characters. Similarly, '/\d/' stands for numerical characters ("digit"), '/\D/' for non-digits, etc. The whole list can be found in the "Camel" book [1].

inline comments

Since version 5.002 a regular expression can be written with inline comments, if the closing delimiter is followed by the 'x' oprerator. Here a short program to eliminate comments from html code (by Perl author Tom Christiansen, with his original comments):

#!/usr/bin/perl -p0777
#
# htdecom -- remove html comments from a document
# tchrist@perl.com
# 
# taken from the larger striphtml program

require 5.002;

s{ <!                  # comments begin with a `<!'
                       # followed by 0 or more comments;

   (.*?)               # this is actually to eat up comments in non 
                       # random places

    (                  # not suppose to have any white space here

                       # just a quick start; 
     --                # each comment starts with a `--'
       .*?             # and includes all text up to and including
     --                # the *next* occurrence of `--'
       \s*             # and may have trailing while space
                       #   (albeit not leading white space XXX)
    )+                 # repetire ad libitum  XXX should be * not +
   (.*?)               # trailing non comment text
  >                    # up to a `>'
}{
   if ($1 || $3) {     # this silliness for embedded comments in tags
       "<!$1 $3>";
 } 
}gesx;                 # mutate into nada, nothing, and niente