Perl

From Citizendium
Revision as of 15:12, 8 November 2007 by imported>Michael G Schwern (→‎Data types)
Jump to navigation Jump to search

Perl is a dynamic, interpreted programming language created by Larry Wall and was first released in 1987. Wall's original intention was to combine the best features of a variety of other languages, including C, Unix shell scripting, lisp, awk, sed, and Unix tools such as the grep family, into a succinct and easy to understand language for system administration. Perl has since evolved into one of the most flexible and powerful scripting languages, with a huge following and professional support. The Perl interpreter has been ported to many operating systems, allowing to execute the same Perl code in different environments, usually without any changes.

Perl is widely adopted for its powerful string processing capability, which is mainly owed to Perl's benchmark regular expression engine [6]. In version 5, Perl introduced a set of important abstractions, which allow the creation, export, and import of objects and methods, leading to a vast public library of well-maintained modules and packages (CPAN). Due to its strengths in manipulating strings and the large amount of publicly available modules, Perl adopts itself easily as a "glue" language between different worlds, for example database access and web programming. Another main use is system administration. Many system scripts of today's Linux distributions are written in Perl, and even some commercial Unix systems install Perl by default.

At the same time Perl strives to be easy, elegant, and fun to use. Perl's motto is "There's more than one way to do it". This is of course true for many programming languages; but while most enforce as much structure as possible, Perl has taken the opposite route: leave as much choice as possible to the programmer and don't require what is not absolutely necessary. As an example, Perl does not enforce prior declaration of variables (but you can choose to do so) or exact data typing. Instead, the interpreter handles variables flexibly as required, and is mostly right about it. Purists may have difficulties with such a concept, but many a C (or C++) programmer never wrote another line in these languages once they migrated to Perl.

Examples

For short programs, a Perl script can be invoked directly from the command line, using the '-e' option:

Code

Result

$ perl -e 'print "Hello, world!\n"' Hello, world
$ perl -e '$g='Hello'; $g=~s/e/a/; print "$g, world!\n"' Hallo, world
$ perl -e 'print grep(s/^\d+(.*)/$1 /, sort(split(/ /,"8hacker, 4Perl 1Just 2another")));' Just another Perl hacker,

The second statement in the second example shows an aspect that is often discouraged for the sake of clarity, the option to write very compact (terse) code. In the middle statement, the 'e' in $g is replaced with an 'a'.

In usenet days it was customary to sign one's posting in a Perl thread by a one-liner that produced the string "Just another Perl hacker," (JAPH), the master of which was Randal L. Schwartz, author of several Perl books. The third line is one of his simpler examples, but already too involved to explain in the context of an introductory article. You can see the analysis here. Both examples 2 and 3 contain regular expression matches, which are introduced further down. Real world Perl programs are usually stored in files and passed as parameters to the Perl interpreter. Some mechanisms of this form of invocation depend largely on the host's operating system, see a separate example for more details.

Syntax highlights

It is impossible to give a complete overview of Perl's syntax here. The "Camel" book [1] has over 1000 pages in its 3rd edition. But a few highlights may give some idea of the character of Perl to the interested reader.

Variables

Data types

  • Scalars are the fundamental data type. A scalar stores a single, simple value, usually a string or a number, or a reference to another variable. A scalar is prepended by '$', e.g. $var = 1;
  • Arrays are ordered lists of scalars, where each element can be accessed by an index (integer). An array is prepended by a '@', e.g. @list. All indexing in Perl starts with 0, i.e. $list[0] (scalar) is the first element of @list.
  • Hashes are unordered sets of key/value pairs, where the value (scalar) is accessed using a key (string). A hash is prepended by '%', e.g. %colors. A value is addressed by using the key in braces, e.g to assign a value to %colors for the key 'ball':
    $colors{'ball'} = 'green';
  • Globs (or 'typeglobs') are symbol tables. They associate a reference to another variable with a global name. A glob is prepended by '*', e.g. *colors. The most common use of a glob is as a filehandle, such as open *FILE, $filename. Other uses include aliasing global variables, *colour = \$color.

All variables in Perl are of these four data types. Other data types are abstractions such as filehandles, subroutines, symbol table entries, etc. Perl keeps variables of each type separately, so that it is always clear which value you want to access. Example: @color, $color, %color, *color are different variables, so $color, $color[2] or $color{'ball'} hold different values.

Scope

Since version 5, the scope of a variable can be differentiated through the "lexical" new 'my' and 'our' declarators. Before, only global and local, declared by the 'local' operator, variables existed. A global variable is visible throughout its namespace, while a local variable will overlay an existing global with a temporary value. But in Perl, a local variable has some confusing properties; for example, it can be accessed from a subroutine that is called from within its scope. The reason for this is that the global variable is overlayed in symbol tables, which are hidden, but truly global. The new "lexical" 'my' declarator does not create symbol table entries, so anything outside its scope (e.g. a subroutine) cannot know its name. A lexical variable is created when declared, Perl keeps track of its usage. It ceases to exist after all references to it have ended, which is usually the case when the program exits the scope.

Operators

The list of operators is too numerous to fully reproduce here. Besides the common assignment operator '=' and the arithmetic operators "+ - / *", Perl has a large amount of complex operators, some of which may only work in conjunction with certain data types or constructs. As an example this piece of code:

open FILE, "/etc/passwd";  # open file
@line=<FILE>;              # read the complete file in one gulp
close FILE;                # finished

In this example a file is opened, the '<' and '>' around the file handle 'FILE' in the second line are one operator, telling Perl to read the file line by line and assigning each line to successive elements of the array @line. Thus, $line[0] will contain the complete first line of the file, $line[1] the second, etc.

Statements and Declarations

Statements are the parts of a script that are executed during runtime. They can be assignments, built-in functions, calls to subroutines, control structures, etc. Statements may be enclosed in blocks delimited by "curly" braces: '{' '}'. A block may stand by itself, but usually it is dependent on some controlling expression, such as an "if" statement, a loop, or the eval function. A block defines a new scope within a surrounding scope.

Perl statements are usually terminated by a semicolon, though there are other implicit means in which Perl knows a statement ends, such as the closing of a block.

The following statements are syntactically correct:


{ $a = 1 }       # legal because closing brace follows
$a = 1, $b = 2;
$a = 1; $b = 2;

Declarations are syntactically similar to statements but are evaluated during compile time, i.e. when the interpreter reads the script and creates its internal representation.

Unlike many programming languages, Perl does not require a variable to be declared. It will "spring into existence" (as a global variable) the first time it is used in a statement at runtime. But you can enforce the explicit declaration of variables by the "use strict" pragma (a compiler directive); any undeclared variable (in the scope of the pragma) will then produce a compile time error:


## preventing the accidental invocation of a variable is considered "good practise"

$server = 'Citizendium';  ## ok in non-strict surrounding scope
{                         ## start of new scope
  use strict 'vars';      ## pragma
  $page = 'Perl';         ## this will generate an error
  $main::page = 'Perl';   ## this is a legal global variable
  my $var = 1;            ## 'my' declaration makes it legal
}

A global variable must now be declared by using its full package name explicitly. But the main beneficiary is the "my" operator, and thus cleaner programming. By the gentle force of lazyness a great number of local variables get declared instead of sloppily created globals!

Subroutines

Subroutines in Perl are declared by the reserved word 'sub', an (optional) name, and a body enclosed by braces. A named subroutine is global in its namespace. Unnamed ("anonymous") subroutines can obviously not be called elsewhere, but they have benefits for special and rather complicated purposes. The calling statement may pass parameters to the subroutine and will receive a return value. If the subroutine was declared before its first use, it can be called by its name, just like a builtin function, otherwise it has to be called with a prepended '&'. Examples:


sub tst1 { print $_,"\n"; }  ## subroutine declared before the call

tst1('tst1');                ## output 'tst1'
tst2('tst2');                ## error, not declared
&tst2('tst2');               ## output 'tst2'

my $var = sub {'hello'};     ## anonymous subroutine

sub tst2 { print $_,"\n"; }  ## subroutine declared after the call

Perl allows to declare details of function arguments in subroutine prototypes, e.g. whether the subroutine expects a scalar or a reference. But the name seems to be badly chosen because they don't work as compile-time type checking of function arguments, as programmers may expect[1].

Regular expressions

Regular expressions are a well-defined syntax by which a specialized program (the so-called regular expression engine) within Perl is directed to process text strings and produce certain side effects based on the findings. This process is usually called pattern matching. Regular Expression engines were built into many of Unix' standard programs from the beginning, but especially in the early days they differed slightly in their features, or sometimes even not so slightly.

When Larry Wall created Perl, he unified these flavors and added several "shorthand" patterns that made it easier to define and apply them. In later versions it became possible to comment patterns inline, to choose "non-greedy" match strategies, and Unicode and Posix were integrated into the engine. Perl itself (as opposed to the Regular Expression engine) provides several standard variables which will hold certain parts of the latest pattern match, such as $& for the match itself, $` for the part ("left" of) before the match, $' the part after it, $1, $2, ... for cached parts of a match, etc. These, together with some enhancements for readability, make it very easy and efficient to manipulate strings. As a result, for many years Perl's Regular Expression engine was considered the most advanced to be found in any programming language.

In Perl, the application of regular expressions can be very casual. The match operator '=~' also allows to match a variable "in place", i.e. it can be used as an assignment operator. In the above code snippet "$g =~ s/e/a/;", the contents of variable $g gets substituted (the 's' operator) by the result of the expression on the right hand side, applied on its original value. If $g contains 'geek', the first 'e' will be replaced by an 'a'. If you want to replace all 'e', the statement must be written as $g =~ s/e/a/g; (appended 'g' for "globally"). Here is a more explicit example, for editing convenience it is called as a script file 'tst':

# file 'tst'
$string  = 'Hello, world';
$pattern = '[ae]';           ## [] defines a class of characters, 'e' OR 'a' will match
$string  =~ m/$pattern/;     ## 'm' for 'match only', nothing is assigned to $string
print "match: \'$&\' before: \'$`\' after: \'$'\' \n";

Run as

$ perl tst

it will produce

match: 'e' before: 'H' after: 'llo, world'

For those interested in the use of Regular Expressions, [6] is highly recommended.

Namespace and scope

By declaring a package, a separate symbol table is created for all globals (variables, subroutines, etc.). This namespace continues until a new package is declared or the file ends. All symbols which are created outside of an explicit package automatically belong to default namespace 'main'. A variable or subroutine of a different namespace can be addressed by the '<namespace>::' qualifier. A scope is delimited by curly braces. Inside a scope a symbol from outside of this scope may be overlayed by "local" or lexical variables. As soon as the program leaves this scope, the previous scope is valid again.
An example:

## program tst
$var = 'global';                      ## global: 'main' assumend
{
  package test;
  $var         = 'package global';
  local $local = 'local';
  {                            ## new scope
    my $var = 'private var';   ## will not exist outside
    print "scope: $var \n";
    {
      print "subscope: $var local: $local \n";
    }
  }
  print "test: $test::var main: $main::var var: $var local: $local \n";
}
print "test: $test::var main: $main::var var: $var local: $local \n";

Running this:

$ perl tst

Produces:

scope: private var
subscope: private var local: local
test: package global main: global var: package global local: local
test: package global main: global var: global local:

Modules, objects

Although the package declaration itself goes back to the earlier days of Perl, many of the features introduced in version 5 gave it a whole new significance. It had become much easier to isolate parts of a program from one another and even assign them object character by "blessing" them into the package, automatically changing the package into a class. All subroutines inside this package become "methods", even their call behavior changes significantly. Of course, purists are not entirely happy with the way Perl has implemented some of the classic features, but looking at the large number of modules implementing Perl's flavor of object oriented programming, it can't be so wrong.


External links

Literature

The literature on Perl is numerous and growing. This is the list that Larry Wall recommends on his web page:

  • [1] Larry Wall, Tom Christiansen, Jon Orwant: Programming Perl - (the Camel Book). O'Reilly Media, Inc.; 3 edition (July 14, 2000). ISBN 0596000278. The standard reference.
  • [2] Randal L. Schwartz, Tom Phoenix, brian d foy: Learning Perl - (the Llama Book). ISBN 1565922840. First trainig.
  • [3] Tom Christiansen, Nathan Torkington: Perl Cookbook - (the Ram Book). O'Reilly Media, Inc.; 2 edition (August 21, 2003). ISBN 0596003137. Perl solutions for standard problems, example applications of Perl features.
  • [4] Randal L. Schwartz, Erik Olson, Tom Christiansen: Learning Perl on Win32 Systems - (the Gecko Book). O'Reilly Media, Inc.; 1 edition (August 1, 1997). ISBN 1565923243.
  • [5] Simon Cozens: Advanced Perl Programming - (the Panther Book). O'Reilly Media, Inc.; 2 edition (June 1, 2005). ISBN 0596004567. High-level Perl tutorial.
  • [6] Jeffrey E. F. Friedl: Mastering Regular Expressions - (the Owls Book). O'Reilly Media, Inc.; 3 edition (August 8, 2006). ISBN 0596528124. All you ever need to know about Regular Expressions, not Perl specific

Also:

  • Johnson, Andrew L. Elements of Programming with Perl. Greenwich, CT: Manning Publications, 2000. ISBN 1884777805. An excellent Perl text that doubles as an introduction to computer programming--not that programmers widely recommend Perl as a first programming language to study.
  • Conway, Damian Object Oriented Perl. Manning Publications, 2000. ISBN 1884777791. A Comprehensive Guide to Concepts and Programming Techniques.

Notes and references

  1. raised by Tom Christiansen, discussed here: Perl's Sins)