Data-Driven Programming: GAWK

"The Enlightened Ones say that....

You should never use C if you can do it with a script;
You should never use a script if you can do it with gawk;
Never use gawk if you can do it with sed;
Never use sed if you can do it with grep."

In this subject we teach some sophisticated programming systems that scale to very large applications using state of the art techniques. But before we can appreciate the mountains, we need to understand a little something about the foothills.

GAWK is a good old-fashioned UNIX filtering tool invented in the 1970s. The language is simple and GAWK programs are generally very short. GAWK is useful when the overheads of more sophisticated approaches is not worth the bother. Over the years, I find myself GAWK-ing more and more as I learn to break up large problems into lots of little ones. Often my students report back to me that, years later, they have forgotten everything I ever taught them except how to write lots of little GAWK scripts.

Teaching GAWK is fast. For example, GAWK can be quickly taught to data mining students and there still be lost of time left over to explore many mining method.

But aren't there better scripting languages? Faster? Well, maybe maybe yes and maybe no.

And GAWK is old (mid-70s). Aren't modern languages more productive? Well again, maybe yes and maybe no. One measure of the productivity of a language is how lines of code are required to code up one business level `function point'. Compared to many popular languages, GAWK scores very highly:

loc/fp   language
------   --------

    6,   excel 5
   13,   sql
   21,   awk       <================
   21,   perl
   21,   eiffel
   21,   clos
   21,   smalltalk
   29,   delphi
   29,   visual basic 5
   49,   ada 95
   49,   ai shells
   53,   c++
   53,   java
   64,   lisp
   71,   ada 83
   71,   fortran 95
   80,   3rd generation default
   91,   ansi cobol 85
   91,   pascal
  107,   2nd generation default
  107,   algol 68
  107,   cobol
  107,   fortran
  128,   c
  320,   1st generation default
  640,   machine language
 3200,   natural language

Anyway, there are other considerations. GAWK is real succinct, simple enough to teach, and easy enough to recode in C (if you want raw speed). For examples, here's the complete listing of someone's AWK spell-checking program.

BEGIN { 
    while (getline < "Usr.Dict.Words") 
        dict[$0] = 1
}
{   if (!dict[$1]) print $1
}

Sure, there's about a gazillion enhancements you'd like to make on this one but you gotta say, this is real succinct.

For me, GAWK is like some Zen thing. If I don't know what I am doing, the code gets dirty++. But when I get it, the GAWK code is clean (IMHO).

GAWK (and Prolog) are my tools in my private war against late execution of software syndrome (a.k.a. LESS). The symptoms of LESS are a huge time delay before a new idea is executable. In extreme cases, I can hack up in days prototypes that it takes months to years to eternity for my students to replicate in so-called better languages like C and JAVA and ...

Sure, I drool over the language features offered by more advanced languages like pointers, generic iterators, continuations, etc etc. And GAWK's lack of data structures (except num, string, and array) is a real pest. So every year I take a break from GAWK and try the latest and greatest new language (Python, Ruby, etc etc).

But years of bitter experience have showed me that the cleverer I get, the smaller my audience gets. If it is possible for me to explain something succinctly in a simple language like GAWK, then it is also possible that more folks will read my code.

Running Gawk

There's four standard ways to run GAWK source code:

With a GAWK interpreter, using -f;
In a "she-bang"
As part of another script;
With the GAWK debugging options.

gawk -f

The most common way is to write the source in a file x.awk, and they ask the GAWK interpreter to run it. E.g.

gawk -f x.awk -f y.awk -f z.awk InputFile

Note that multiple files can be run using multiple -t flags.

"She-bang"

Another way, which involves less typing on the command line, is to include all you GAWK in one file called, say, all, then add a "she-bang" to the first line; e.g.

#!/usr/bin/gawk -f
# /* vim: set filetype=awk : */ -*- awk -*- 

.. rest of the awk code

Line one of this file tells the operating system to run this script using the interpreter /usr/bin/gawk. Note that if you move this code to another machine then the first line must be changed to point to the GAWK interpreter on that machine.

Line two of this file is optional and contains some editor-specific commands that tell VIM and EMACS to highlight this code as if it was GAWK syntax.

Once such a all file is made executable (with chmod +x all) then it can be run on the command line like any other GAWK script:

./all InputFile

As part of another script

It is standard to use GAWK scripts as workers in some other scripting language. For example, a Unix BASH script could be:

#!/bin/bash
# /* vim: set filetype=sh : */ -*- sh -*- 
gawk -f x.awk -f y.awk -f z.awk Pass=1 $1 Pass=2 $1

(Note that many scripting languages like GAWK and BASH support she-bang and the editor commands on lines one and two.) This script runs some data file through GAWK in two passes (perhaps pass one collects some statistics and pass two fills in missing values with the mean values).

With the GAWK debugging options

It is very useful to add the following line to your $HOME/.bashrc file:

export Audit="pgawk --profile=$HOME/tmp/awkprof.out 
                    --dump-variables=$HOME/tmp/awkvars.out 
                    --lint "

Then, if you run GAWK programs as follows, you will get a lot of debugging information about your GAWK program:

$Audit -f x.awk -f y.awk -f z.awk InputFile

Specifically, the file $HOME/tmp/awkprof.out will show how many times each line of the program was run while it processed InputFile. This can be used to:

Find bad conditionals. If a condition is run many times, but the condition body is never run, then the condition is never satisfied.
Find dead code. If a line of code is never used, then maybe it is unnecessary.
Optimize your code. If some function gets called much more than the rest of the code, then that is a function that might need optimization.

Also,the file $HOME/tmp/awkvars.out will list all the global variables in your GAWK code. I read awkvars.out looking for bad globals; i.e. variables that I forgot to declare as local and so become globals. I've lost weeks of my life debugging functions that are failing because of bad globals. From bitter experience, I've learned to:

Name all my locals using lower case;
Name all my globals in MixedCase
Run $Audit frequently, looking for lower case globals in awkvars.out. Such globals are really local variables that have escaped from a function and should be declared local in their home function.

Finally, running $Audit generates pages of lint warnings, most of which can be ignored. However, some deserve your attention such as function called but never defined.

Key GAWK Concepts

Imagine GAWK as a kind of a cut-down C language with four tricks: self-initializing variables, pattern-based programming, regular expressions, and associative arrays.

Self-initializing variables.

You don't need to define variables- they appear as your use them.

There are only three types: stings, numbers, and arrays.

To ensure a number is a number, add zero to it.

x=x+0

To ensure a string is a string, add an empty string to it.

x= x "" "the string you really want to add"

To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two PLUS the local variables:

 function haslocals(passed1,passed2,         local1,local2,local3) {
        passed1=passes1+1  # changes externally
        local1=7           # only changed locally
 }

Note that its good practice to add white space between passed and local variables.

Pattern-based programming

GAWK programs can contain functions AND pattern/action pairs.

If the pattern is satisfied, the action is called.

 /^\.P1/ { if (p != 0) print ".P1 after .P1, line", NR;
           p = 1;
         }
 /^\.P2/ { if (p != 1) print ".P2 with no preceding .P1, line", NR;
           p = 0;
         }
 END     { if (p != 0) print "missing .P2 at end" }

Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:

 BEGIN {
        while (getline < "Usr.Dict.Words") #slurp in dictionary 
                dict[$0] = 1
        FS=",";                            #set field seperator
        srand();                           #reset random seed
        Round=10;                          #always start globals with U.C.
 }

The default action is {print $0}; i.e. print the whole line.

The default pattern is 1; i.e. true.

Patterns are checked, top to bottom, in source-code order.

Patterns can contain regular expressions. In the above example /^\.P1/ means "front of line followed by a full stop followed by P1". Regular expressions are important enough for their own section.

A Small Example

Ok, so now we know enough to explain an example. How does hist.awk work in the following?

 
% cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
                    uniq -c | sort -r -n | gawk -f hist.awk

              **************************  26 /var/empty
                                      **   2 /var/virusmails
                                      **   2 /var/root
                                       *   1 /var/xgrid/controller
                                       *   1 /var/xgrid/agent
                                       *   1 /var/teamsserver
                                       *   1 /var/spool/uucp
                                       *   1 /var/spool/postfix
                                       *   1 /var/spool/cups
                                       *   1 /var/pcast/server
                                       *   1 /var/pcast/agent
                                       *   1 /var/imap
                                       *   1 /Library/WebServer

We can use this trick to we could find the most common entries in passwd as follows:

command action

cat /etc/passwd

grep -v \#

cut -d: -f 6 grab just the sixth column

sort

uniq -c prints the frequencies of the unique lines

sort -r -n sort by frequencies in reverse order

gawk -f hist.awk a little histogram printer

hist.awk reads the maximum width from line one (when NR==1), then scales it to some maximum width value. For each line, it then prints the line ($0) with some stars at front.

NR==1  { Width = Width ? Width : 40 ; sets Width if it is missing
         Scale = $1 > Width ? $1 / Width : 1 
       }
       { Stars=int($1*Scale);  
         print str(Width - Stars," ") str(Stars,"*") $0 
       }

# note that, in the following "tmp" is a local variable
function str(n,c, tmp) { # returns a string, size "n", of all  "c" 
    while((n--) > 0 ) tmp= c tmp 
    return tmp 
}

Regular Expressions

Do you know what these mean?

/^[ \t\n]*/
/[ \t\n]*$/
/^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:

  function trim(s,     t) {
    t=s;
    sub(/^[ \t\n]*/,"",t);
    sub(/[ \t\n]*$/,"",t);
    return t
 }

or recognize something that isn't a number:

if ( $i !~ /^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/ ) 
    {print "ERROR: " $i " not a number}

Regular expressions are an astonishingly useful tool supported by many languages (e.g. Awk, Perl, Python, Java). The following notes review the basics. For full details, see http://www.gnu.org/manual/gawk-3.1.1/html_node/Regexp.html#Regexp.

Syntax: Here's the basic building blocks of regular expressions:

c
matches the character c (assuming c is a character with no special meaning in regexps).

\c
matches the literal character c; e.g. tabs and newlines are \t and \n respectively.

.
matches any character except newline.

^
matches the beginning of a line or a string.

$
matches the end of a line or a string.

[abc...]
matches any of the characters ac... (character class).

[^ac...]
matches any character except abc... and newline (negated character class).

r*
matches zero or more r's.

And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.

More Syntax:

But that's only the start of regular expressions. There's lots more. For example:

r+
matches one or more r's.

r?
matches zero or one r's.

r1|r2
matches either r1 or r2 (alternation).

r1r2
matches r1, and then r2 (concatenation).

(r)
matches r (grouping).

Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:

^[+-]? ...
Numbers begin with zero or one plus or minus signs.

...[0-9]+...
Simple numbers are just one or more numbers.

...[.]?[0-9]*...
which may be followed by a decimal point and zero or more digits.

...|[.][0-9]+...
Alternatively, a number can have zero leading numbers and just start with a decimal point.

.... ([eE]...)?$
Also, there may be an exponent added

...[+-]?[0-9]+)?$
and that exponent is a positive or negative bunch of digits.

Associative arrays

GAWK has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):

gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename

The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? GAWK has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:

 #!/usr/bin/awk -f
  {for(i=1;i <=NF;i++) freq[$i]++ }
  END{for(word in freq) print word, freq[word]  }

You can find out if an element exists in an array at a certain index with the expression:

index in array

This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present.

You can remove an individual element of an array using the delete statement:

delete array[index]

It is not an error to delete an element which does not exist.

GAWK has a special kind of for statement for scanning an array:

 for (var in array)
        body

This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.

There order in which the array is scanned is not defined.

To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack

 function top(a)        {return a[a[0]]}
 function push(a,x,  i) {i=++a[0]; a[i]=x; return i}
 function pop(a,   x,i) {
   i=a[0]--;  
   if (!i) {return ""} else {x=a[i]; delete a[i]; return x}}

The pop function can be used in the usual way:

 BEGIN {push(a,1); push(a,2); push(a,3);
        while(x=pop(a)) print x
 3
 2
 1

We can catch everything in an array to a string:

 function a2s(a,  i,s) {
        s=""; 
        for (i in a) {s=s " " i "= [" a[i]"]\n"}; 
        return s}

  BEGIN {push(L,1); push(L,2); push(L,3);
        print a2s(L);}
  0= [3]
  1= [1]
  2= [2]
  3= [3]

And we can go the other way and convert a string into an array using the built in split function. These pod files were built using a recursive include function that seeks patterns of the form:

^=include file

This function splits likes on space characters into the array `a' then looks for =include in a[1]. If found, it calls itself recursively on a[2]. Otherwise, it just prints the line:

 function rinclude (line,    x,a) {
   split(line,a,/ /);
   if ( a[1] ~ /^\=include/ ) { 
     while ( ( getline x < a[2] ) > 0) rinclude(x);
     close(a[2])}
   else {print line}
 }

Note that the third argument of the split function can be any regular expression.

By the way, here's a nice trick with arrays. To print the lines in a files in a random order:

 BEGIN {srand()}
       {Array[rand()]=$0}
 END   {for(I in Array) print $0}

Short, heh? This is not a perfect solution. GAWK can only generate 1,000,000 different random numbers so the birthday theorem cautions that there is a small chance that the lines will be lost when different lines are written to the same randomly selected location. After some experiments, I can report that you lose around one item after 1,000 inserts and 10 to 12 items after 10,000 random inserts. Nothing to write home about really. But for larger item sets, the above three liner is not what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%) are lost after 100,000 random inserts. Not good!

Alternatives to GAWK

Nice lecture notes comparing different scripting languages: http://www.cs.utk.edu/~plank/plank/classes/cs494/notes.html

A shoot-em-up between N languages, including GAWK: http://dada.perl.it/shootout/craps.html

GAWK has some advantages over other scripting language like (e.g.) Perl:

GAWK is simpler (especially important if deciding which to learn first).
GAWK syntax is far more regular (another advantage for the beginner, even without considering syntax-highlighting editors)
you may already know GAWK well enough for the task at hand
you may have only GAWK installed
GAWK can be smaller, thus much quicker to execute for small programs
GAWK variables don't have $ in front of them :-)
Clear perl code is better than unclear GAWK code; but NOTHING comes close to unclear perl code
Tom Christiansen may have said it best: GAWK is a venerable, powerful, elegant, and simple tool that everyone should know. Perl is a superset and child of AWK, but has much more power that comes at expense of sacrificing some of that simplicity.

Some Gawk vs PERL Samples

Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'.

Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it.

hello world

PERL:

 print "hello world\n"

GAWK:

 BEGIN { print "hello world" }

One plus one

PERL

 $x= $x+1;

GAWK

 x= x+1

Printing

PERL

 print $x, $y, $z;

GAWK

 print x,y,z

Printing the first field in a file

PERL

 while (<>) { 
   split(/ /);
   print "@_[0]\n" 
 }

GAWK

 { print $1 }

Printing lines, reversing fields

PERL

 while (<>) { 
  split(/ /);
  print "@_[1] @_[0]\n" 
 }

GAWK

 { print $2, $1 }

Concatenation of variables

PERL

 command = "cat $fname1 $fname2 > $fname3"

GAWK

 command = "cat " fname1 " " fname2 " > " fname3

Looping

PERL:

 for (1..10) { print $_,"\n" }

GAWK:

 BEGIN { 
  for (i=1; i<=10; i++) print i
 }

Pairs of numbers

PERL:

 for (1..10) { print "$_ ",$_-1 }
 print "\n"

GAWK:

 BEGIN { 
  for (i=1; i<=10; i++) printf i " " i-1
  print ""
 }

List of words into a hash

PERL

  foreach $x ( split(/ /,"this is not stored linearly") ) 
  { print "$x\n" }

GAWK

 BEGIN { 
  split("this is not stored linearly",temp)
  for (i in temp) print temp[i]
 }

Printing a hash in some key order

PERL

 $n = split(/ /,"this is not stored linearly");
 for $i (0..$n-1) { print "$i @_[$i]\n" }
 print "\n";
 for $i (@_) { print ++$j," ",$i,"\n" }

AWK

 BEGIN { 
  n = split("this is not stored linearly",temp)
  for (i=1; i<=n; i++) print i, temp[i]
  print ""
  for (i in temp) print i, temp[i]
 }

Printing all lines in a file

PERL

 open file,"/etc/passwd";
 while (<file>) { print $_ }

GAWK

  BEGIN { 
  while (getline < "/etc/passwd") print
 }

Printing a string

PERL

 $x = "this " . "that " . "\n";
 print $x

GAWK

 BEGIN {
  x = "this " "that " "\n" ; printf x
 }

Building and printing an array

PERL

 $assoc{"this"} = 4;
 $assoc{"that"} = 4;
 $assoc{"the other thing"} = 15;
 for $i (keys %assoc) { print "$i $assoc{$i}\n" }

GAWK

 BEGIN {
   assoc["this"] = 4
   assoc["that"] = 4
   assoc["the other thing"] = 15
   for (i in assoc) print i,assoc[i]
 }

Sorting an array

PERL

 split(/ /,"this will be sorted once in an array");
 foreach $i (sort @_) { print "$i\n" }

GAWK

 BEGIN {
  split("this will be sorted once in an array",temp," ")
  for (i in temp) print temp[i] | "sort"
  while ("sort" | getline) print
 }

Sorting an array (#2)

GAWK

 BEGIN {
  split("this will be sorted once in an array",temp," ")
  n=asort(temp)
  for (i=1;i<=n;i++) print temp[i] 
 }

Print all lines, vowels changed to stars

PERL

 while (<STDIN>) {
  s/[aeiou]/*/g;
  print $_
 }

GAWK

 {gsub(/[aeiou]/,"*"); print }

Report from file

PERL

 #!/pkg/gnu/bin/perl
 # this is a comment
 #
 open(stream1,"w | ");
 while ($line = <stream1>) {
   ($user, $tty, $login, $junk) = split(/ +/, $line, 4);
   print "$user $login ",substr($line,49)
 }

GAWK

#!/pkg/gnu/bin/gawk -f
 # this is a comment
 #
 BEGIN {
   while ("w" | getline) {
     user = $1; tty = $2; login = $3
     print user, login, substr($0,49)
   }
 }

Web Slurping

PERL

 open(stream1,"lynx -dump 'cs.wustl.edu/~loui' | ");
 while ($line = <stream1>) {
   if ($flag && $line =~ /[0-9]/) { print $line }
   if ($line =~ /References/) { $flag = 1 }
 }

GAWK

 BEGIN {
  com = "lynx -dump 'cs.wustl.edu/~loui' &> /dev/stdout"
  while (com | getline line) {
    if (flag && line ~ /[0-9]/) { print line }
    if (line ~ /References/) { flag = 1 }
  }
 }

Teaching GAWK

R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.

Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.

After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.

After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).

By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.

To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.

PERL and GAWK appear to have similar programming, development, and debugging cycle times.

Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.

GAWK for AI

by R. Loui

Most people are surprised when I tell them what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a `scripting language.' It has no objects; it is not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't even been close (we have not seen work in PROLOG or JAVA).

There are some quick answers that have to do with the pragmatics of undergraduate programming. Then there are more instructive answers that might be valuable to those who debate programming paradigms or to those who study the history of AI languages. And there are some deep philosophical answers that expose the nature of reasoning and symbolic AI. I think the answers, especially the last ones, can be even more surprising than the observed effectiveness of GAWK for AI.

First it must be confessed that PERL programmers can cobble together AI projects well, too. Most of GAWK's attractiveness is reproduced in PERL, and the success of PERL forebodes some of the success of GAWK. Both are powerful string-processing languages that allow the programmer to exploit many of the features of a UNIX environment. Both provide powerful constructions for manipulating a wide variety of data in reasonably efficient ways. Both are interpreted, which can reduce development time. Both have short learning curves. The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonful of syntactic sugar. Some will argue that PERL has superior functionality, but for quick AI applications, the additional functionality is rarely missed. In fact, PERL's terse syntax is not friendly when regular expressions begin to proliferate and strings contain fragments of HTML, WWW addresses, or shell commands. PERL provides new ways of doing things, but not necessarily ways of doing new things.

In the end, despite minor difference, both PERL and GAWK minimize programmer time. Neither really provides the programmer the setting in which to worry about minimizing run-time.

There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI test bed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is turn left; turn right. If the robot is Netscape, then the right language is something that can generate netscape -remote 'openURL(http://cs.wustl.edu/~loui) with elan.

Of course, there are deeper answers. Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays. GAWK asks the programmer to use the file system for data organization and the operating system for debugging tools and subroutine libraries. There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time. (footnote: I am reminded of my IBM colleague Ben Grosof's advice for Palo Alto: Don't worry about whether it's highway 101 or 280. Don't worry if you have to head south for an entrance to go north. Just get on the highway as quickly as possible.)

There are some similarities between GAWK and LISP that are illuminating. Both provided a powerful uniform data structure (the associative array implemented as a hash table for GAWK and the S-expression, or list of lists, for LISP). Both were well-supported in their environments (GAWK being a child of UNIX, and LISP being the heart of lisp machines). Both have trivial syntax and find their power in the programmer's willingness to use the simple blocks to build a complex approach.

Deeper still, is the nature of AI programming. AI is about functionality and exploratory programming. It is about bottom-up design and the building of ambitions as greater behaviors can be demonstrated. Woe be to the top-down AI programmer who finds that the bottom-level refinements, `this subroutine parses the sentence,' cannot actually be implemented. Woe be to the programmer who perfects the data structures for that heap sort when the whole approach to the high-level problem needs to be rethought, and the code is sent to the junk heap the next day.

AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor.

Now for the surprising philosophical answers. First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution. Chess, neural nets, and genetic programming show the limits of brute computation. The alternative is clever program organization. (footnote: One might add that the former are the AI approaches that work, but that is easily dismissed: those are the AI approaches that work in general, precisely because cleverness is problem-specific.) So AI programmers always want to maximize the content of their program, not optimize the efficiency of an approach. They want minds, not insects. Instead of enumerating large search spaces, they define ways of reducing search, ways of bringing different knowledge to the task. A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.

Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call reasoning'' instead oflogic.'' The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.

I view these last two points as news not only to the programming language community, but also to much of the AI community that has not reflected on the past decade's lessons.

In the puny language, GAWK, which Aho, Weinberger, and Kernighan thought not much more important than grep or sed, I find lessons in AI's trends, Airs history, and the foundations of AI. What I have found not only surprising but also hopeful, is that when I have approached the AI people who still enjoy programming, some of them are not the least bit surprised.

command	action
*cat /etc/passwd*
*grep -v \#*
*cut -d: -f 6*	grab just the sixth column
*sort*
*uniq -c*	prints the frequencies of the unique lines
*sort -r -n*	sort by frequencies in reverse order
*gawk -f hist.awk*	a little histogram printer

Data-Driven Programming: GAWK

Running Gawk

gawk -f

"She-bang"

As part of another script

With the GAWK debugging options

Key GAWK Concepts

Self-initializing variables.

Pattern-based programming

A Small Example

Regular Expressions

More Syntax:

Associative arrays

Alternatives to GAWK

Some Gawk vs PERL Samples

Teaching GAWK

GAWK for AI

Further Reading