# -*- markdown -*-
"""
The Other Side of Complexity
-----------------------------
_"Because easy is not wrong."_ - Anon.
_"Good design is as little design as possible."_
- [Dieter Rams](http://en.wikipedia.org/wiki/Dieter_Rams)
"I would not give a fig for the simplicity on this side of
complexity, but I would give my life for the simplicity on the
other side of complexity." -
[Oliver Wendell Holmes, Sr.](http://en.wikipedia.org/wiki/Oliver_Wendell_Holmes,_Sr)
## About AUK
AUK is a simple, succinct, interpreted C-like language. This is a
langauge with no complex features. That is, is has no pointers, no
closures, no macros, etc. Hence it is useful for succinctly
describing algorithms in their simplest terms. The two rules of AUK
are:
+ If you understand "it" then you can code it up, very simply, in AUK.
+ If your AUK code is not simple, they you have yet to understand "it".
So, to find simplicity:
1. Code it in AUK;
2. Code it simpler, in AUK;
3. Goto step 2.
From a technical perspective, AUK is a pre-processor that translates
to [GNU AWK (a.k.a. GAWK)](http://awk.info). With a few exceptions
(e.g. the dashed functions and dashed variables discussed below) AUK
looks like GAWK which looks a lot like a simple version of "_C_".
For example, the following function pushes `x` ono the end of the list
`a` then returns the length of the new list.
"""
function push(x,a, i) {
i = 1 + length(a)
a[i] = x
return i
}
"""
Note that `push` is called with two variables so the `i` variable
(defined in `push`'s arguments) is a local variable that is only
accessible inside `push` (and which is cleared from memory when `push`
quits).
## Dashing Through Auk
To read AUK code, look first for the dashes. These can be found:
+ In the _dashed variables_ of the `.h` header files;
+ In the _dashed functions_ found either in each source file
or in a helper demo file.
To be accurate, by _dash_ I really mean the underscore character.
But "dash" sounds much better than underscore so...
### Dashed Functions
Dashed functions start with an underscore. These are high-level calls that
test/demonstrate the rest of the code.
For example,
`_push` is the dashed function for `push`:
"""
function _push() {
team[1] = "mead"
team[2] = "einstein"
team[3] = "curie"
print push("carson",team) #<== returns "4"
}
"""
In AUK, every file should have one dashed function that is an underscore,
followed by the name of that file.
+ E.g the file `mycode` should have a demo function `_mycode`.
+ AUK has a comment-line tool, called `demo` that loads some
_file.auk_, then tries to run the demo function of that `_file`).
### Helper Files
Dash functions can also be found in a helper file. These helpers
are test scripts long enough to deserve their own file. When reading
AUK code, look for these helper files since they contain examples
of high-level calls to the code.
By convention, an AUK file _x.auk_ has the helper _xed.auk_
(where "ed" stands for something like "excellent demo").
For example:
+ Suppose there is a _reader.auk_ file that converts
comma-separated files into something called a `_Table` (and by the way,
`_Table` is one of those dashed variables discussed in the
next section).
+ This file has a helper called _readered.auk_.
That helper file contains the following code:
# readered.auk: demos of readcsv from reader.auk
@include "reader.awk"
function _readered( _Table) {
readcsv("data/weather1.csv",0,_Table)
o(dep[0],"dep")
o(indep[0],"indep")
o(order[0],"order")
o(name[0],"name")
tableprint(_Table[0],"%4.2f")
}
If we run the AUK demo script `demo readered`, then
the following will happen, automatically:
+ _readered.auk_ will load _reader.auk_;
+ To be technically correct, `demo` will compile
all the _.auk_ files to _.awk_ then run _readered.awk_-
which is why the file does `@include "reader.awk"` and not
`@include "reader.auk"`. But you get the idea.
+ The dashed function `_readered` (shown above) will be called.
+ This in-turn will call the `readcsv` function (defined in
_reader.auk_) which load the contents of
_data/weather1.csv_ into a `_Table` at index position 0.
+ Then, a whole bunch of print statements to show the internal
structures of `_Table[0]`.
The details of those prints do not concern us here, expect to say that
this is _demo_ code that can be safely separated from the core
functionality inside _reader.auk_.
### Dashed Variables
A dashed variable is a macro that expands to a set of other variables.
For example, the function `_readered` uses some variables that look
like they are strays from elsewhere in the code:
function _readered( _Table) {
readcsv("data/weather1.csv",0,_Table)
o(dep[0],"dep")
o(indep[0],"indep")
o(order[0],"order")
o(name[0],"name")
tableprint(_Table[0],"%4.2f")
}
Where do `dep,indep,order,name` come from? The answer is,
from with the `_Table` dashed variable:
+ When AUK expands this function into GAWK, it replaces `_Table` with
a set of named fields.
+ That expansion is shown below.
Note that, in the following, `_Table` has been replaced by a set of
variables that includes `dep,indep,order,name`.
# readered.auk: demos of readcsv from reader.auk
@include "reader.awk"
function _readered( data,name,order,nump,wordp,
indep,dep,less,more,klass,
term,num,n,count,mode,most,
hi,lo,mu,m2,sd) {
readcsv("data/weather1.csv",0,
data,name,order,nump,wordp,
indep,dep,less,more,klass,
term,num,n,count,mode,most,
hi,lo,mu,m2,sd)
o(dep[0],"dep")
o(indep[0],"indep")
o(order[0],"order")
o(name[0],"name")
tableprint(data[0],name[0],order[0],nump[0],wordp[0],
indep[0],dep[0],less[0],more[0],klass[0],
term[0],num[0],n[0],count[0],mode[0],most[0],
hi[0],lo[0],mu[0],m2[0],sd[0],
"%4.2f")
}
To find the expansion for the dashed variables,
AUK looks for any _.h_ header files in the same directory
as the _.auk_ files. For example, in the same directory
as _reader.auk_, AUK might find:
_Table
data # data[row][col] holds 1 cell of datum
name # name of i-th colum
#
# meta knowledge
#
order # order of the columns
nump # is i-th column numeric
wordp # is i-th column non-numeric?
indep # list of indep columns
dep # list of dep columns
less # numeric goal to be minimized
more # numeric goal to be maximized
klass # non-numeric goal
term # non-numeric non-goal
num # numeric non-goal
#
# about all cols
n # count of things about this col
#
# about wordp columns:
#
count # count of each word
mode # most common word
most # count of most common word
#
# about nump columns:
#
hi # upper bound
lo # lower bound
mu # mean
m2 # sum of all nums
sd # standard deviation
After removing comments and whitespace, AUK tells itself that any
string of the form `_Table[X]?` can be replaced with the following:
dataX.nameX,orderX,numpX, wordpX,
indepX,depX,lessX,moreX,klassX,
termX,numX,nX,countX,modeX,
mostX,hiX,loX,muX,m2X,sedX
The dashed variables let us reference complex lists of variables in a
succinct manner. For example, in the above code for `_readered`,
it we want to talk about (say) the third table, then `_Table[3]`
gives us easy access to:
data[3].name[3],order[3],nump[3], wordp[3],
indep[3],dep[3],less[3],more[3],klass[3],
term[3],num[3],n[3],count[3],mode[3],
most[3],hi[3],lo[3],mu[3],m2[3],sed[3]
### In Summary
That's it: the main difference between AUK and GAWK
are the dashes. To understand AUK code, go chase:
+ The dashed variables in the _*.h_ files,
+ The dashed functions in the source code _file.auk_ and the helper
functions _filed.auk_.
## The Rest of AUK
To understand AUK code,
there are several lower level AUK/GAWK details that
need explaining. The rest of this page discusses those details. In
the following, we will say _in AUK_ when more precisely we should say
_in the source code that is pre-processed by AUK and coverted into
GAWK_... But that gets a little clumsy so we'll say "AUK" all the
time.
## Comments
Comments on a single line are any text that follows a `#` symbol.
Multi-line comments start and end on lines that start with three
double quotes `"""` (AUK extension).
## Control Structures
To define combinations of `tests`, AUK uses a C-like syntax for ands
`&&`, ors `||`, and nots `!`. These combinations can be used in the
following control structures. In the following, if `test` returns the
empty string or zero, then `test` is false. Otherwise it is true.
+ `if` (test) `else` Y
+ `while` (test) body
+ `do` body `while` (test)
+ `for`(start ; until ; increment) body
For `while`, `do` and `for`, the following two keywords that can be
used within `body`:
+ `continue` jumps to the next iteration of the inner-most loop;
+ `break` terminates this loop.
AUK also supports the tenary operator `x ? y : z` which is a way of
returning a value from an if-then-else statement. The ternary operator
is used in, for example, the AUK absolute value function. This
following code negates the passed value if it is negative. Else it
returns the value unmodified:
"""
function abs(x) {
return x > 0 ? x : -1*x
}
"""
Note that if a `return x` command is called anywhere in a function,
then that function exits with a return value of `x`. Functions
without `return` values output empty strings empty strings.
## Assignment, Equality, and Match
For assignment, use one equals; e.g. `x=1`.
For testing equality, use two equals; e.g. `if (x==1) ...`.
For partial match, use the regular expression operator `~`. For
example, the following line tries to read the _header_ from a CSV
file:
+ In this file, a line can continue to the next line if it ends with a
`,` character.
+ In which case, we have to recurse to get the remainder of the line.
The fifth line of this function uses `~` to check for a comma,
followed by zero ot more white space characters `[ \t]*`, followed by
end of line `$`.`
"""
function line(f, str) {
if ((getline str < f) > 0) { # get "str"
gsub(/[ \t\r]*/,"",str) # kill whilespace
gsub(/#.*$/,"",str) # kill comments
if ( str ~ /,[ \t]*$/ ) # does str end with `,`...
return str line(f) # get rest,recursively
else
return str # output the string
} else # if getline finds nothing, output error code
return -1
}
"""
To illustrate this example, here is the top of the
file _data/weather1.csv_. Ignoring the comments,
observe how the first line ends with a `,`.
"""
outlook, # forecast
?+$temperature, # degrees Fahrenheit,
-$humidity, # percent of dew point
windy, # boolean
=play # goal
########################################
sunny ,85 ,90 ,FALSE ,no
sunny ,80 ,90 ,TRUE ,no
overcast ,83 ,86 ,FALSE ,yes
...
"""
For this file, the `line` function would return the following:
outlook,?+$temperature,_$humidity,windy,=play
That is, `line` studies the header and
returns everything up to the line that _does not_ end
with a comma.
## Variables
### Types of Variables
In AUK, variables do not need to be declared-
they appear as your use them.
There are only three types of variables: strings, numbers, and lists.
To ensure a number is a number, add zero to it.
"""
x=x+0
"""
To ensure a string is a string, add an empty string to it.
"""
x= x ""
"""
To ensure an list is an list, create it with some function
that generates list. For example, the `split` functions generates
a list by splitting some string on some separator. We can
(ab)use it to make lists as follows:
"""
split("",lst,"")
"""
This is such a common idiom that the AUK library includes the function
`new`, defined as follows:
"""
function new(lst) {
split("",lst,"")
}
"""
### Global Variables
AUK uses a set of special global variables:
+ `NF` is the number of fields read on the latest line of input.
+ `FS` is the field seperator used to seperate field inputs.
+ Defaults to `FS=" "`;
+ For comma-sperated files, use `FS=","`;
+ After reading a line, each field is `$1`, `$2`, ... `$NF`;
+ Also, the entire line is `$0`.
+ `OFS` is the field separator between output fields.
+ Defaults to `OFS=" "`.
+ `RS` is the record seperator for input fields.
+ Defaults to `RS="\n"`;
+ The magic setting `RS=""; FS="\n"` lets AUK read paragraphs
at a time with one field per line in the paragraph.
+ `CONVFMT` controls the printing of floats.
+ Defaults to `%.6g`,
which prints a value with at least six significant digits
+ `PROCINFO["sorted_in"]` controls the order of traversal of keys
in a list:
+ Note that usually this variable is an empty string so
the traversal order is undefined.
+ If you want more control than that:
+ `@ind_num_asc` sorts up numerically;
+ `@ind_str_asc` sorts up alphanumerically;
+ If you switch `asc` with `desc`, the above are reversed;
+ For other sort orders, GAWK offers
[a set of built-ins](http://goo.gl/QQPtV4).
+ If they do not suffice, you can `@f` to
control sort using your own function `f`.
### Local Variables
To ensure your variables aren't global, use them within a function and
add more variables to the call. For example if a function is passed
two variables, define it with two _as well as_ the local variables.
For example:
+ In the above `_push` function, there are no local variables.
+ In the above `push` function, the `i` variable is local.
Note that its good practice to add white space between passed and
local variables.
One interesting idiom for setting defaults for local variables
is as follows. The following function checks if some arguments
have been passed in. If not, it sets defaults:
function hasDefaults(mustHave1, mustHave2,
optional1, optional2,
local1, local2) {
optional1 = optional1 ? optional1 : 3.142
optional2 = optional2 ? optional2 : "tuesday"
...
}
In the above, if nothing is passed for `optional1` and `optional2`,
then they are set to `3.142` and `"tuesday"`, respectively.
This example uses the _ternary operator_ that is discussed below
(see _Control Structures_).
### Lists are "Associative Arrays"
Lists are accessed via strings. Such key strings
are hashed so all access is in constatnt time.
+ e.g. `lst["emp"]`
+ And `lst[1]` is treated by AUK as `lst["1"]`.
To test if a key exists in a list:
+ `if (key in lst) { ... } `
To create something like a hash table of booleans:
+ The idiom `flags["thing"]` creates a key `"thing"`
with a value of empty string.
+ After that idiom, the test `if( "thing" in flags) {...}`
executes the `if` block if `flags` has a
key `things`.
You can remove an individual element of an list using the delete
statement `delete lst[index]`. It is not an error to delete an
element which does not exist.
AUK has a special kind of for statement for scanning an list:
"""
for (var in list)
body
"""
This loop executes body once for each different key that your
program has previously used as an index in list, with the variable
var set to that index. Note that, in classic GAWK,
the order of traversal is
not defined (newer versions of GAWk can guide that
traversal using the magic list variable `PROCINFO["sorted_in"]`;
for an example of using that magic variable, see below).
For an example of using lists,
we can count the frequency of words in a document:
"""
{for(i=1;i <=NF;i++) freq[$i]++ }
"""
The list will hold an integer value for each word that occurred in
the file. This list can be printed using the
`for` construct:
"""
{for(i=1;i <=NF;i++) freq[$i]++ }
END { for(word in freq) print word, freq[word] }
""""
In GAWK4+, lists can be nested with the usual syntax.
E.g. `data[row][col]=23` stores a number at some row and column
position in a two-dimensional data matrix.
Nested lists are very useful for ordering named properties. For
example:
+ The following function _project_ runs over the `d` items in `data`
then projects them into a two-dimensional `x,y` co-ordinate system.
+ Technical aside: this is done by using the cosine
rule to project the points onto a line running from two
points `west` to `east`, then using Pythagorean to
find a point on an orthogonal dimension.
This new information is stored in the nested `at` list.
function project(east,west,_Table,
a,b,c,d,m,at) {
c = dist(east,west,_Table)
for(d in data) {
a = dist(d,east,_Table)
b = dist(d,west,_Table)
m++
at[m]["d"]= d
at[m]["x"]= x= (a^2 + c^2 - b^2) / (2*c)
at[m]["y"]= (a^2 - x^2)^0.5
}
print_in_x_order(at)
}
Once the list is generated, then we sort and print it, ordered on
`at`'s `"x"` field (using the `xsort` function).
function print_in_x_order(at, i,xs,m,key) {
m = asort(at,xs,"xsort")
for(i=1;i<=m;i++)
for(key in xs[i])
print ":" key " " at[i][key]
}
function xsort(r1,x1,r2,x2) {
return x1["x"] - x2["x"]
}
### Variables Passed to Functions
AUK passed numbers and strings by value- which is to say it copies the
value and passes in the copy. That means that the original value,
outside the function, is safe from modification. For example, the
following function returns a string of `n` number of strings
`c`. Inside the function, `n` is decremented (see `n--`) but outside
the function, that `n` is not modified.
"""
function nchars(n,c, out) {
while(n> 0) {
n--
out = out c
}
return out
}
"""
(As an aside, note how AUK concatenates strings on line four of the
function `out = out c` (this adds the character `c` to the growing
`out` string). Here, the concatenation "function" is the blank space
between two variables. That is, to concatenate two strings, place them
next to each other with a blank in-between them.)
Since copying values for a function call can be (a little) slow,
AUK passes lists to functions by reference. That is,
modifying a list in a function will change its contents outside
of that function.
Note that AUK functions can only return single values (strings or numbers).
To return mutliple values, pass in a list and write then return values
to different keys in that list.
## Commonly-used Functions
### print a,b,c
This is basic AUK function that prints `a` then `b` then `c` separated
by the OFS. This is followed by a newline character. Numbers are
printed using the `CONVFMT` variable.
### printf, sprintf
These are the more advanced printing functions that offer more
[control for displaying variables](http://goo.gl/OGpDv6). `Sprintf`
is just like `printf` except that it returns the string (rather than
to standard output):
sprintf("%10.6f",22/7) => " 3.142857"
### split(string,lst,[sep])
Generates a new `lst` comprising the `string` split on `sep`.
If `sep` is not supplied, it defaults to the `FS` field separator.
### sub(this,that,[string]), gsub(this,that,[string])
Replaces `this` for `that` in `string`. The `gsub` function
replaces all instances while `sub` just replaces the first instance.
If `string` is not supplied it defaults to `$0`.
### asort(lst1,[lst2],[how])
After the call to `asort()`, `lst2` contains the `lst1` data , indexed
from 1 to some number n (being the total number of elements in
`lst1`), sorted:
"""
lst1[1] <= lst1[2] <= lst1[3], and so on.
"""
This function returns the number of items in `lst1`.
If `lst2` is not supplied then `lst2` replaces `lst1`. If `how` is
not supplied, data is sorted in ascending order.
## Commonly Used-Idioms
In the following code, `n` starts as an empty string. The first `n++`
is a suffix addition (the old value is returned before we add one to
it).
function something( n) {
while (someTest) {
if (n++)
handle_first_iteration();
else
handle_other_iterations()
}
Hence, the _first_ time the loop hits `n++`, it returns zero
and the code calls `handle_first_iteration()`. After that,
for all other iterations, it calls `handle_other_iterations()`.
## Frequently Asked Questions
1. Q: Why do AUK files usually begin by loading GAWK files;
e.g. `@include "somefile.awk"`?
+ A: Because AUK is a pre-processor that translates `.auk`
files to `.awk`, then runs them using GAWK.
The command `@include` is the GAWK idiom for loading
those files at rungime.
## Appendix: Why AUK?
The above is the `what` of AUK. If you are interested in `why`
I built it, then read on.
GAWK is a great language. It has been my secret ninja power that let me
maintain a competitive academic career, while having fun
learning novel ways to build data miners,
while supervising numerous research
students, all the while
[publishing like a demon](http://goo.gl/Pq0knm).
You really should try GAWK- see
see _Classic Shell Scripting_ (chapter 9) and _Effective Awk
Programming_, both written by Arnold Robbins.
I love GAWK so much that for several years Arnold let me maintain the
[awk.info](http://awk.info) web site (which, by the way, is now
looking for another maintainer; any volunteers?). One disappointment
in that work was that I could not find shared and
widely-used GAWK libraries. This is a worry since it means
the GAWKers are more soloists than members of a large
cooperative. Looking deeper, I found that:
+ Standard GAWK uses pattern-directed programming where input is based
along a set of `pattern {action}` pairs.
+ That style of programming needs global variables to pass
information between the patterns. This discourages library
development (since one pattern can accidently overwrite a global
needed by another).
+ More fundamentally, this approach assumes every code file is a
top-level driver. If the goal is to build libraries of code,
that assumption is rarely true.
+ Also, the GAWK community has no standard formats for:
+ Headers explaining the variables used by different files;
+ Documentation;
+ Demonstration code, test suites, or regression suites
AUK was written to address the above. Just as an aside, the AUK to
GAWK pre-processor is written in GAWK (so the best way to extend GAWK
is to use GAWK). With AUK I can:
+ Implement data hiding and structures:
+ AUK lets me define _dashed variables_ such as `_3D` to be the
string `x,y,z`.
+ Once defined, then everywhere I use (say) `_3D[i]`, AUK expands
that to `x[i],y[i],z[i]`.
+ This means I can succinctly around complex structures to
sub-routines.
+ It also means that if I ever extend the definition of a 3d point
to include (say) its color and weight, then everywhere that uses
that color and weight will auto-expand (say) `_3D` to
`x,y,z,color,weight`.
+ Add documentation tools:
+ AUK supports multi-line comments in the
[Markdown syntax](http://daringfireball.net/projects/markdown/syntax).
+ AUK also support auto-generation of html pages from those
comments (e.g. this web page is automatically generated
from `idioms.auk`).
+ Add regression, testing, demonstration tools.
+ See _dashed functions_, described above.
+ Better package code
+ The dashed variables and the dashed functions and the documentation
tools lets me write shorter AUK code that is easy to share and
document and explain.
On top of the above, I wrote some emacs scripts and shell scripts
that let me edit and run AUK code in a specialized environment: