Data-Driven Programming:
Pipe & Filter

This page comes from Kevin Heard's excellent UNIX tutorials.

The Enlightened Ones say that....

You should never use C if you can do it with a script;
You should never use a script if you can do it with awk;
Never use awk if you can do it with sed;
Never use sed if you can do it with grep."

Summary

Filters take input and transform it. E.g.

cut selects columns from text files.
grep selects rows from text files.
sort sorts stuff.
head -N print top N lines
uniq throws away repeated lines.
tr translates characters from one set to another

And we can wire all these filers together using pipes.

Pipes

Redirection is useful for making a process use a file for input or output, but what if we want to send the output of one process directly to the input of another? To do that, we need a pipe. A pipe is a method of inter-process communication (IPC) which allows a one-way flow of information between two processes on the same machine. We use the "vertical bar" character ("|") to create a pipe between two processes.

Let's combine the actions we carried out in the exercises of the previous section into a single command using a pipe. To do this, we will issue the who command, followed by the pipe symbol, followed by the mail command.

% who | mail -s "let's try a pipe" username@domain.com

The result of this operation is that the output of the who command will be sent directly to the input of the mail command, without the intermediate step of storing the data in a file.

The pipe construct is very powerful, because it allows the user to join several small programs to create a custom tool to solve the problem at hand. This often leads to uses that the developers never imaged while creating the individual programs. We'll use another example to illustrate this idea further.

Example: who is on the system?

Let's say you want a sorted list of the user names for all the users on the system. Since there might be a lot of user names, you'll also want to display the user names in several columns. The password database (usually stored in the file /etc/passwd) includes the user name along with other information about each user. Each line in the file contains the record for a given user. The individual fields are delimited with the colon character (':') as shown in the example below:

joe:x:60:60:Joe Quigley:/home/joe:/usr/bin/tcsh>

On many UNIX systems, the password database is not stored locally on each machine. Instead, it is distributed via the Network Information Service (NIS). (This was formerly called Yellow Pages, or YP.) If your system uses NIS, you can list the password database by entering ypcat passwd. If your system uses a local password file, you would enter cat /etc/passwd instead.

What we need to solve our problem is a tool that will echo the contents of the password database, cut out the first field (the user name) for each entry, sort the resulting list, and then print the list in multiple columns. By combining a number of smaller tools, we can create the tool we need right on the spot. Table 11.1 summarizes the commands we'll use to build our pipeline.

command action

ypcat passwd lists the contents of the password database
or cat /etc/passwd if you don't have yellow pages

grep -v \# prints all lines that are not comments (-v is short for aVoid).
The \# needs a leading slosh, otherwise it would comment out the rest of the pipe.

cut -f1 -d: cuts out the first field from each line (using ':' as the delimiter)

sort sorts the list (duh!)

pr -t -5 prints the list in 5 columns (the -t suppresses any page breaks)

Now let's try out our pipeline:

% ypcat passwd | grep -v \# | cut -f1 -d: | sort | pr -t -5

aaron         cpresley      jam           lefty         pdank
adriano       cringely      jand          lucinda       pearl
agorman       ddale         joebob        ludmilla      pinky
.             .             .             .             .
.             .             .             .             .
.             .             .             .             .

If it's still not clear how you got the result you did, try removing each piece of the pipeline one at a time, going from right to left. In other words, remove "| pr -5 -t" from the command above, then remove "| sort" and so on.

Example: Sorting on userid

Sorting on a particular column is so common that sort has special flags for just that purpose. In the following example, we will report the top five lines of passwd based on the numeric user ifs (column 3).

command action

cat /etc/passwd

grep -v \#

sort -n -r -t: -k3 -n sort numerically
-r sorts in reverse order (biggest first)
-t: assume that the column delimiter is ":"
-k3 sort on column 3 "

sed "s/:/\t/g" substitute ":" with "\t" (and "g" means, repeat for all repeats on the line)
Note: when working on the command line, it may be necessary to enter "\t" by typing "control-v TAB".

head -5 print top 5 lines

And how does this all work?

% cat /etc/passwd | grep -v \# | sort -n -r -t: -k 3  | sed "s/:/\t/g"  | head -5 

_unknown        *       99      99      Unknown User    /var/empty      /usr/bin/false
_atsserver      *       97      97      ATS Server      /var/empty      /usr/bin/false
_installer      *       96      -2      Installer       /var/empty      /usr/bin/false
_update_sharing *       95      -2      Update Sharing  /var/empty      /usr/bin/false
_teamsserver    *       94      94      TeamsServer     /var/teamsserver        /usr/bin/false

So _unknown has the largest user id.

Example: uniq entries

Another neat trick is uniq -c that prints a count of unique lines. This filter is pretty fast: it assumes the input is already sorted so it only has to check pairs of lines.

For example, suppose data.dat contains:

/var/empty
/var/root
/var/root
/var/spool/uucp
/var/spool/cups
/var/spool/postfix
/var/empty
/var/pcast/agent
/var/pcast/server
/var/empty
/var/empty
/var/empty
/var/empty
/var/empty
/Library/WebServer
/var/empty
/var/empty
/var/empty
/var/empty
/var/empty
/var/empty
/var/imap
/var/empty
/var/empty
/var/virusmails
/var/virusmails
/var/empty
/var/xgrid/controller
/var/xgrid/agent
/var/empty
/var/empty
/var/empty
/var/empty
/var/empty
/var/empty
/var/teamsserver
/var/empty
/var/empty
/var/empty
/var/empty

Then can find a count of the unique lines as follows:

command action

cat /etc/passwd

grep -v \#

cut -d: -f 6 grab just the sixth column

sort

uniq -c prints the frequencies of the unique lines

How does that all work

%  cat data.dat | sort | uniq -c

   1 /Library/WebServer
  26 /var/empty
   1 /var/imap
   1 /var/pcast/agent
   1 /var/pcast/server
   2 /var/root
   1 /var/spool/cups
   1 /var/spool/postfix
   1 /var/spool/uucp
   1 /var/teamsserver
   2 /var/virusmails
   1 /var/xgrid/agent
   1 /var/xgrid/controller

Example: Counting Unique entries

If we pipe the output of uniq -c to another sort, so we can find the most common entries.

We can use this trick to we could find the most common entries in passwd as follows:

command action

cat /etc/passwd

grep -v \#

cut -d: -f 6 grab just the sixth column

sort

uniq -c prints the frequencies of the unique lines

sort -r -n sort by frequencies in reverse order

gawk -f hist.awk a little histogram printer

And how does this all work?

$ cat /etc/passwd | grep -v \# | cut -d: -f 6|sort |
                    uniq -c | sort -r -n | gawk -f hist.awk

              **************************  26 /var/empty
                                      **   2 /var/virusmails
                                      **   2 /var/root
                                       *   1 /var/xgrid/controller
                                       *   1 /var/xgrid/agent
                                       *   1 /var/teamsserver
                                       *   1 /var/spool/uucp
                                       *   1 /var/spool/postfix
                                       *   1 /var/spool/cups
                                       *   1 /var/pcast/server
                                       *   1 /var/pcast/agent
                                       *   1 /var/imap
                                       *   1 /Library/WebServer

So the most common root directory is /var/empty.

Oh, and to be complete, here's the code the hist.awk. It reads the maximum width from line one, scales it to some maximum width value, then prints each line ($0) with some starts at front.

NR==1  { Width = Width ? Width : 40 ; sets Width if it is missing
         Scale = $1 > Width ? $1 / Width : 1 
       }
       { Stars=int($1*Scale);  
         print str(Width - Stars," ") str(Stars,"*") $0 
       }
function str(n,c, tmp) { # returns a string, size "n", of all  "c" 
    while((n--) > 0 ) tmp= c tmp 
    return tmp 
}

Example: Writing a Test Engine

This example has no filtering- but it sets us up for the next example.

In the following, test1 is some code to test. When it runs, it generates a file that we store in $Tmp/test1.got.

test1 > $Tmp/test1.got 
if   diff -s $Tmp/test1.got $Testdir/test1.want > /dev/null
then echo PASSED test1 
else echo FAILED test1,  got $(Tmp)/test1.got
fi

We assume that some expectation of the code output is stored in $Testdir/test1.want. The code diff -sreports when two files are the same. If our test output is the same as what we want, we print PASSED. Else, we print FAILED.

'nough said. Moving on...

Example: Scoring a test suite

Suppose we have lots of tests (not just test1) and they are all stored inside the tests code. We can run them all, and pipe them to get a count of the number of PASSes and FAILs.

The following code uses all constructs described above. The egrep filter is a smarter grep that lets us specify more complex patterns (in this cases PASSED or FAILED).

tests | cut -d\  -f 1 | egrep '(PASSED|FAILED)' | sort | uniq -c

Note that we use the code for this example in the test engine used in this class.

Example: Your own Spell Checker

(From Wikipedia.)

This example implements a spell checker for the web resource indicated by a URL. It is a little advanced so don't fuss on the details. Just note that BIG functionality can be implemented via a little pipe and filtering.

curl "http://en.wikipedia.org/wiki/Pipeline_(Unix)" | 
sed 's/[^a-zA-Z ]/ /g' | 
tr 'A-Z ' 'a-z\n' | 
grep '[a-z]' | 
sort -u | 
comm -23 - /usr/dict/words

First, curl obtains the HTML contents of a web page (could use wget on some systems).
Second, sed removes all characters that are not spaces or letters from the web page's content, replacing them with spaces.
Third, tr changes all of the uppercase letters into lowercase and converts the spaces in the lines of text to newlines (each 'word' is now on a separate line).
Fourth, grep includes only lines that contain at least one lowercase alphabetical character (removing any blank lines).
Fifth, sort sorts the list of 'words' into alphabetical order, and the -u switch removes duplicates.
Finally, comm finds lines in common between two files, -23 suppresses lines unique to the second file, and those that are common to both, leaving only those that are found only in the first file named. The - in place of a filename causes comm to use its standard input (from the pipe line in this case). This results in a list of "words" (lines) that are not found in /usr/dict/words.
The special character "|" tells the operating system to pipe the output from the previous command in the line into the next command in the line. That is, the output of the curl command is given as the input of the sed command.,

command	action
*ypcat passwd*	lists the contents of the password database
or cat /etc/passwd	if you don't have yellow pages
*grep -v \#*	prints all lines that are not comments (-v is short for aVoid). The \# needs a leading slosh, otherwise it would comment out the rest of the pipe.
*cut -f1 -d:*	cuts out the first field from each line (using ':' as the delimiter)
*sort*	sorts the list (duh!)
*pr -t -5*	prints the list in 5 columns (the -t suppresses any page breaks)

command	action
*cat /etc/passwd*
*grep -v \#*
*sort -n -r -t: -k3*	-n sort numerically -r sorts in reverse order (biggest first) *-t:* assume that the column delimiter is ":" *-k3* sort on column 3 "
*sed "s/:/\t/g"*	substitute ":" with "\t" (and "g" means, repeat for all repeats on the line) Note: when working on the command line, it may be necessary to enter "\t" by typing "control-v TAB".
*head -5*	print top 5 lines

Data-Driven Programming:Pipe & Filter