This page comes from
Kevin Heard's
excellent UNIX tutorials.
The Enlightened Ones say that....
Filters take input and transform it. E.g.
And we can wire all these filers together using pipes.
Redirection is useful for making a process use a file for input or output, but what if we want to send the output of one process directly to the input of another? To do that, we need a pipe. A pipe is a method of inter-process communication (IPC) which allows a one-way flow of information between two processes on the same machine. We use the "vertical bar" character ("|") to create a pipe between two processes.
Let's combine the actions we carried out in the exercises of the previous section into a single command using a pipe. To do this, we will issue the who command, followed by the pipe symbol, followed by the mail command.
% who | mail -s "let's try a pipe" username@domain.com
The result of this operation is that the output of the who command will be sent directly to the input of the mail command, without the intermediate step of storing the data in a file.
The pipe construct is very powerful, because it allows the user to join several small programs to create a custom tool to solve the problem at hand. This often leads to uses that the developers never imaged while creating the individual programs. We'll use another example to illustrate this idea further.
Let's say you want a sorted list of the user names for all the users on the system. Since there might be a lot of user names, you'll also want to display the user names in several columns. The password database (usually stored in the file /etc/passwd) includes the user name along with other information about each user. Each line in the file contains the record for a given user. The individual fields are delimited with the colon character (':') as shown in the example below:
joe:x:60:60:Joe Quigley:/home/joe:/usr/bin/tcsh>
On many UNIX systems, the password database is not stored locally on each machine. Instead, it is distributed via the Network Information Service (NIS). (This was formerly called Yellow Pages, or YP.) If your system uses NIS, you can list the password database by entering ypcat passwd. If your system uses a local password file, you would enter cat /etc/passwd instead.
What we need to solve our problem is a tool that will echo the contents of the password database, cut out the first field (the user name) for each entry, sort the resulting list, and then print the list in multiple columns. By combining a number of smaller tools, we can create the tool we need right on the spot. Table 11.1 summarizes the commands we'll use to build our pipeline.
command | action |
ypcat passwd | lists the contents of the password database |
or cat /etc/passwd | if you don't have yellow pages |
grep -v \# |
prints all lines that are not comments (-v is short for
aVoid). The \# needs a leading slosh, otherwise it would comment out the rest of the pipe. |
cut -f1 -d: | cuts out the first field from each line (using ':' as the delimiter) |
sort | sorts the list (duh!) |
pr -t -5 | prints the list in 5 columns (the -t suppresses any page breaks) |
Now let's try out our pipeline:
% ypcat passwd | grep -v \# | cut -f1 -d: | sort | pr -t -5 aaron cpresley jam lefty pdank adriano cringely jand lucinda pearl agorman ddale joebob ludmilla pinky . . . . . . . . . . . . . . .
If it's still not clear how you got the result you did, try removing each piece of the pipeline one at a time, going from right to left. In other words, remove "| pr -5 -t" from the command above, then remove "| sort" and so on.
Sorting on a particular column is so common that sort has special flags for just that purpose. In the following example, we will report the top five lines of passwd based on the numeric user ifs (column 3).
command | action |
cat /etc/passwd | |
grep -v \# | |
sort -n -r -t: -k3 |
-n sort numerically -r sorts in reverse order (biggest first) -t: assume that the column delimiter is ":" -k3 sort on column 3 " |
sed "s/:/\t/g" |
substitute ":" with "\t" (and "g" means, repeat for all repeats on the line) Note: when working on the command line, it may be necessary to enter "\t" by typing "control-v TAB". |
head -5 | print top 5 lines |
And how does this all work?
% cat /etc/passwd | grep -v \# | sort -n -r -t: -k 3 | sed "s/:/\t/g" | head -5 _unknown * 99 99 Unknown User /var/empty /usr/bin/false _atsserver * 97 97 ATS Server /var/empty /usr/bin/false _installer * 96 -2 Installer /var/empty /usr/bin/false _update_sharing * 95 -2 Update Sharing /var/empty /usr/bin/false _teamsserver * 94 94 TeamsServer /var/teamsserver /usr/bin/false
So _unknown has the largest user id.
Another neat trick is uniq -c that prints a count of unique lines. This filter is pretty fast: it assumes the input is already sorted so it only has to check pairs of lines.
For example, suppose data.dat contains:
/var/empty /var/root /var/root /var/spool/uucp /var/spool/cups /var/spool/postfix /var/empty /var/pcast/agent /var/pcast/server /var/empty /var/empty /var/empty /var/empty /var/empty /Library/WebServer /var/empty /var/empty /var/empty /var/empty /var/empty /var/empty /var/imap /var/empty /var/empty /var/virusmails /var/virusmails /var/empty /var/xgrid/controller /var/xgrid/agent /var/empty /var/empty /var/empty /var/empty /var/empty /var/empty /var/teamsserver /var/empty /var/empty /var/empty /var/empty
Then can find a count of the unique lines as follows:
command | action |
cat /etc/passwd | |
grep -v \# | |
cut -d: -f 6 | grab just the sixth column |
sort | |
uniq -c | prints the frequencies of the unique lines |
How does that all work
% cat data.dat | sort | uniq -c 1 /Library/WebServer 26 /var/empty 1 /var/imap 1 /var/pcast/agent 1 /var/pcast/server 2 /var/root 1 /var/spool/cups 1 /var/spool/postfix 1 /var/spool/uucp 1 /var/teamsserver 2 /var/virusmails 1 /var/xgrid/agent 1 /var/xgrid/controller
If we pipe the output of uniq -c to another sort, so we can find the most common entries.
We can use this trick to we could find the most common entries in passwd as follows:
command | action |
cat /etc/passwd | |
grep -v \# | |
cut -d: -f 6 | grab just the sixth column |
sort | |
uniq -c | prints the frequencies of the unique lines |
sort -r -n | sort by frequencies in reverse order |
gawk -f hist.awk | a little histogram printer |
And how does this all work?
$ cat /etc/passwd | grep -v \# | cut -d: -f 6|sort | uniq -c | sort -r -n | gawk -f hist.awk ************************** 26 /var/empty ** 2 /var/virusmails ** 2 /var/root * 1 /var/xgrid/controller * 1 /var/xgrid/agent * 1 /var/teamsserver * 1 /var/spool/uucp * 1 /var/spool/postfix * 1 /var/spool/cups * 1 /var/pcast/server * 1 /var/pcast/agent * 1 /var/imap * 1 /Library/WebServer
So the most common root directory is /var/empty.
Oh, and to be complete, here's the code the hist.awk. It reads the maximum width from line one, scales it to some maximum width value, then prints each line ($0) with some starts at front.
NR==1 { Width = Width ? Width : 40 ; sets Width if it is missing Scale = $1 > Width ? $1 / Width : 1 } { Stars=int($1*Scale); print str(Width - Stars," ") str(Stars,"*") $0 } function str(n,c, tmp) { # returns a string, size "n", of all "c" while((n--) > 0 ) tmp= c tmp return tmp }
This example has no filtering- but it sets us up for the next example.
In the following, test1 is some code to test. When it runs, it generates a file that we store in $Tmp/test1.got.
test1 > $Tmp/test1.got if diff -s $Tmp/test1.got $Testdir/test1.want > /dev/null then echo PASSED test1 else echo FAILED test1, got $(Tmp)/test1.got fi
We assume that some expectation of the code output is stored in $Testdir/test1.want. The code diff -sreports when two files are the same. If our test output is the same as what we want, we print PASSED. Else, we print FAILED.
Suppose we have lots of tests (not just test1) and they are all stored inside the tests code. We can run them all, and pipe them to get a count of the number of PASSes and FAILs.
The following code uses all constructs described above. The egrep filter is a smarter grep that lets us specify more complex patterns (in this cases PASSED or FAILED).
tests | cut -d\ -f 1 | egrep '(PASSED|FAILED)' | sort | uniq -c
Note that we use the code for this example in the test engine used in this class.
(From Wikipedia.)
This example implements a spell checker for the web resource indicated by a URL. It is a little advanced so don't fuss on the details. Just note that BIG functionality can be implemented via a little pipe and filtering.
curl "http://en.wikipedia.org/wiki/Pipeline_(Unix)" | sed 's/[^a-zA-Z ]/ /g' | tr 'A-Z ' 'a-z\n' | grep '[a-z]' | sort -u | comm -23 - /usr/dict/words