# -*- markdown -*- """ The Other Side of Complexity ----------------------------- _"Because easy is not wrong."_ - Anon. _"Good design is as little design as possible."_ - [Dieter Rams](http://en.wikipedia.org/wiki/Dieter_Rams) "I would not give a fig for the simplicity on this side of complexity, but I would give my life for the simplicity on the other side of complexity." - [Oliver Wendell Holmes, Sr.](http://en.wikipedia.org/wiki/Oliver_Wendell_Holmes,_Sr) ## About AUK AUK is a simple, succinct, interpreted C-like language. This is a langauge with no complex features. That is, is has no pointers, no closures, no macros, etc. Hence it is useful for succinctly describing algorithms in their simplest terms. The two rules of AUK are: + If you understand "it" then you can code it up, very simply, in AUK. + If your AUK code is not simple, they you have yet to understand "it". So, to find simplicity: 1. Code it in AUK; 2. Code it simpler, in AUK; 3. Goto step 2. From a technical perspective, AUK is a pre-processor that translates to [GNU AWK (a.k.a. GAWK)](http://awk.info). With a few exceptions (e.g. the dashed functions and dashed variables discussed below) AUK looks like GAWK which looks a lot like a simple version of "_C_". For example, the following function pushes `x` ono the end of the list `a` then returns the length of the new list. """ function push(x,a, i) { i = 1 + length(a) a[i] = x return i } """ Note that `push` is called with two variables so the `i` variable (defined in `push`'s arguments) is a local variable that is only accessible inside `push` (and which is cleared from memory when `push` quits). ## Dashing Through Auk To read AUK code, look first for the dashes. These can be found: + In the _dashed variables_ of the `.h` header files; + In the _dashed functions_ found either in each source file or in a helper demo file. To be accurate, by _dash_ I really mean the underscore character. But "dash" sounds much better than underscore so... ### Dashed Functions Dashed functions start with an underscore. These are high-level calls that test/demonstrate the rest of the code. For example, `_push` is the dashed function for `push`: """ function _push() { team[1] = "mead" team[2] = "einstein" team[3] = "curie" print push("carson",team) #<== returns "4" } """ In AUK, every file should have one dashed function that is an underscore, followed by the name of that file. + E.g the file `mycode` should have a demo function `_mycode`. + AUK has a comment-line tool, called `demo` that loads some _file.auk_, then tries to run the demo function of that `_file`). ### Helper Files Dash functions can also be found in a helper file. These helpers are test scripts long enough to deserve their own file. When reading AUK code, look for these helper files since they contain examples of high-level calls to the code. By convention, an AUK file _x.auk_ has the helper _xed.auk_ (where "ed" stands for something like "excellent demo"). For example: + Suppose there is a _reader.auk_ file that converts comma-separated files into something called a `_Table` (and by the way, `_Table` is one of those dashed variables discussed in the next section). + This file has a helper called _readered.auk_. That helper file contains the following code: # readered.auk: demos of readcsv from reader.auk @include "reader.awk" function _readered( _Table) { readcsv("data/weather1.csv",0,_Table) o(dep[0],"dep") o(indep[0],"indep") o(order[0],"order") o(name[0],"name") tableprint(_Table[0],"%4.2f") } If we run the AUK demo script `demo readered`, then the following will happen, automatically: + _readered.auk_ will load _reader.auk_; + To be technically correct, `demo` will compile all the _.auk_ files to _.awk_ then run _readered.awk_- which is why the file does `@include "reader.awk"` and not `@include "reader.auk"`. But you get the idea. + The dashed function `_readered` (shown above) will be called. + This in-turn will call the `readcsv` function (defined in _reader.auk_) which load the contents of _data/weather1.csv_ into a `_Table` at index position 0. + Then, a whole bunch of print statements to show the internal structures of `_Table[0]`. The details of those prints do not concern us here, expect to say that this is _demo_ code that can be safely separated from the core functionality inside _reader.auk_. ### Dashed Variables A dashed variable is a macro that expands to a set of other variables. For example, the function `_readered` uses some variables that look like they are strays from elsewhere in the code: function _readered( _Table) { readcsv("data/weather1.csv",0,_Table) o(dep[0],"dep") o(indep[0],"indep") o(order[0],"order") o(name[0],"name") tableprint(_Table[0],"%4.2f") } Where do `dep,indep,order,name` come from? The answer is, from with the `_Table` dashed variable: + When AUK expands this function into GAWK, it replaces `_Table` with a set of named fields. + That expansion is shown below. Note that, in the following, `_Table` has been replaced by a set of variables that includes `dep,indep,order,name`. # readered.auk: demos of readcsv from reader.auk @include "reader.awk" function _readered( data,name,order,nump,wordp, indep,dep,less,more,klass, term,num,n,count,mode,most, hi,lo,mu,m2,sd) { readcsv("data/weather1.csv",0, data,name,order,nump,wordp, indep,dep,less,more,klass, term,num,n,count,mode,most, hi,lo,mu,m2,sd) o(dep[0],"dep") o(indep[0],"indep") o(order[0],"order") o(name[0],"name") tableprint(data[0],name[0],order[0],nump[0],wordp[0], indep[0],dep[0],less[0],more[0],klass[0], term[0],num[0],n[0],count[0],mode[0],most[0], hi[0],lo[0],mu[0],m2[0],sd[0], "%4.2f") } To find the expansion for the dashed variables, AUK looks for any _.h_ header files in the same directory as the _.auk_ files. For example, in the same directory as _reader.auk_, AUK might find: _Table data # data[row][col] holds 1 cell of datum name # name of i-th colum # # meta knowledge # order # order of the columns nump # is i-th column numeric wordp # is i-th column non-numeric? indep # list of indep columns dep # list of dep columns less # numeric goal to be minimized more # numeric goal to be maximized klass # non-numeric goal term # non-numeric non-goal num # numeric non-goal # # about all cols n # count of things about this col # # about wordp columns: # count # count of each word mode # most common word most # count of most common word # # about nump columns: # hi # upper bound lo # lower bound mu # mean m2 # sum of all nums sd # standard deviation After removing comments and whitespace, AUK tells itself that any string of the form `_Table[X]?` can be replaced with the following: dataX.nameX,orderX,numpX, wordpX, indepX,depX,lessX,moreX,klassX, termX,numX,nX,countX,modeX, mostX,hiX,loX,muX,m2X,sedX The dashed variables let us reference complex lists of variables in a succinct manner. For example, in the above code for `_readered`, it we want to talk about (say) the third table, then `_Table[3]` gives us easy access to: data[3].name[3],order[3],nump[3], wordp[3], indep[3],dep[3],less[3],more[3],klass[3], term[3],num[3],n[3],count[3],mode[3], most[3],hi[3],lo[3],mu[3],m2[3],sed[3] ### In Summary That's it: the main difference between AUK and GAWK are the dashes. To understand AUK code, go chase: + The dashed variables in the _*.h_ files, + The dashed functions in the source code _file.auk_ and the helper functions _filed.auk_. ## The Rest of AUK To understand AUK code, there are several lower level AUK/GAWK details that need explaining. The rest of this page discusses those details. In the following, we will say _in AUK_ when more precisely we should say _in the source code that is pre-processed by AUK and coverted into GAWK_... But that gets a little clumsy so we'll say "AUK" all the time. ## Comments Comments on a single line are any text that follows a `#` symbol. Multi-line comments start and end on lines that start with three double quotes `"""` (AUK extension). ## Control Structures To define combinations of `tests`, AUK uses a C-like syntax for ands `&&`, ors `||`, and nots `!`. These combinations can be used in the following control structures. In the following, if `test` returns the empty string or zero, then `test` is false. Otherwise it is true. + `if` (test) `else` Y + `while` (test) body + `do` body `while` (test) + `for`(start ; until ; increment) body For `while`, `do` and `for`, the following two keywords that can be used within `body`: + `continue` jumps to the next iteration of the inner-most loop; + `break` terminates this loop. AUK also supports the tenary operator `x ? y : z` which is a way of returning a value from an if-then-else statement. The ternary operator is used in, for example, the AUK absolute value function. This following code negates the passed value if it is negative. Else it returns the value unmodified: """ function abs(x) { return x > 0 ? x : -1*x } """ Note that if a `return x` command is called anywhere in a function, then that function exits with a return value of `x`. Functions without `return` values output empty strings empty strings. ## Assignment, Equality, and Match For assignment, use one equals; e.g. `x=1`. For testing equality, use two equals; e.g. `if (x==1) ...`. For partial match, use the regular expression operator `~`. For example, the following line tries to read the _header_ from a CSV file: + In this file, a line can continue to the next line if it ends with a `,` character. + In which case, we have to recurse to get the remainder of the line. The fifth line of this function uses `~` to check for a comma, followed by zero ot more white space characters `[ \t]*`, followed by end of line `$`.` """ function line(f, str) { if ((getline str < f) > 0) { # get "str" gsub(/[ \t\r]*/,"",str) # kill whilespace gsub(/#.*$/,"",str) # kill comments if ( str ~ /,[ \t]*$/ ) # does str end with `,`... return str line(f) # get rest,recursively else return str # output the string } else # if getline finds nothing, output error code return -1 } """ To illustrate this example, here is the top of the file _data/weather1.csv_. Ignoring the comments, observe how the first line ends with a `,`. """ outlook, # forecast ?+$temperature, # degrees Fahrenheit, -$humidity, # percent of dew point windy, # boolean =play # goal ######################################## sunny ,85 ,90 ,FALSE ,no sunny ,80 ,90 ,TRUE ,no overcast ,83 ,86 ,FALSE ,yes ... """ For this file, the `line` function would return the following: outlook,?+$temperature,_$humidity,windy,=play That is, `line` studies the header and returns everything up to the line that _does not_ end with a comma. ## Variables ### Types of Variables In AUK, variables do not need to be declared- they appear as your use them. There are only three types of variables: strings, numbers, and lists. To ensure a number is a number, add zero to it. """ x=x+0 """ To ensure a string is a string, add an empty string to it. """ x= x "" """ To ensure an list is an list, create it with some function that generates list. For example, the `split` functions generates a list by splitting some string on some separator. We can (ab)use it to make lists as follows: """ split("",lst,"") """ This is such a common idiom that the AUK library includes the function `new`, defined as follows: """ function new(lst) { split("",lst,"") } """ ### Global Variables AUK uses a set of special global variables: + `NF` is the number of fields read on the latest line of input. + `FS` is the field seperator used to seperate field inputs. + Defaults to `FS=" "`; + For comma-sperated files, use `FS=","`; + After reading a line, each field is `$1`, `$2`, ... `$NF`; + Also, the entire line is `$0`. + `OFS` is the field separator between output fields. + Defaults to `OFS=" "`. + `RS` is the record seperator for input fields. + Defaults to `RS="\n"`; + The magic setting `RS=""; FS="\n"` lets AUK read paragraphs at a time with one field per line in the paragraph. + `CONVFMT` controls the printing of floats. + Defaults to `%.6g`, which prints a value with at least six significant digits + `PROCINFO["sorted_in"]` controls the order of traversal of keys in a list: + Note that usually this variable is an empty string so the traversal order is undefined. + If you want more control than that: + `@ind_num_asc` sorts up numerically; + `@ind_str_asc` sorts up alphanumerically; + If you switch `asc` with `desc`, the above are reversed; + For other sort orders, GAWK offers [a set of built-ins](http://goo.gl/QQPtV4). + If they do not suffice, you can `@f` to control sort using your own function `f`. ### Local Variables To ensure your variables aren't global, use them within a function and add more variables to the call. For example if a function is passed two variables, define it with two _as well as_ the local variables. For example: + In the above `_push` function, there are no local variables. + In the above `push` function, the `i` variable is local. Note that its good practice to add white space between passed and local variables. One interesting idiom for setting defaults for local variables is as follows. The following function checks if some arguments have been passed in. If not, it sets defaults: function hasDefaults(mustHave1, mustHave2, optional1, optional2, local1, local2) { optional1 = optional1 ? optional1 : 3.142 optional2 = optional2 ? optional2 : "tuesday" ... } In the above, if nothing is passed for `optional1` and `optional2`, then they are set to `3.142` and `"tuesday"`, respectively. This example uses the _ternary operator_ that is discussed below (see _Control Structures_). ### Lists are "Associative Arrays" Lists are accessed via strings. Such key strings are hashed so all access is in constatnt time. + e.g. `lst["emp"]` + And `lst[1]` is treated by AUK as `lst["1"]`. To test if a key exists in a list: + `if (key in lst) { ... } ` To create something like a hash table of booleans: + The idiom `flags["thing"]` creates a key `"thing"` with a value of empty string. + After that idiom, the test `if( "thing" in flags) {...}` executes the `if` block if `flags` has a key `things`. You can remove an individual element of an list using the delete statement `delete lst[index]`. It is not an error to delete an element which does not exist. AUK has a special kind of for statement for scanning an list: """ for (var in list) body """ This loop executes body once for each different key that your program has previously used as an index in list, with the variable var set to that index. Note that, in classic GAWK, the order of traversal is not defined (newer versions of GAWk can guide that traversal using the magic list variable `PROCINFO["sorted_in"]`; for an example of using that magic variable, see below). For an example of using lists, we can count the frequency of words in a document: """ {for(i=1;i <=NF;i++) freq[$i]++ } """ The list will hold an integer value for each word that occurred in the file. This list can be printed using the `for` construct: """ {for(i=1;i <=NF;i++) freq[$i]++ } END { for(word in freq) print word, freq[word] } """" In GAWK4+, lists can be nested with the usual syntax. E.g. `data[row][col]=23` stores a number at some row and column position in a two-dimensional data matrix. Nested lists are very useful for ordering named properties. For example: + The following function _project_ runs over the `d` items in `data` then projects them into a two-dimensional `x,y` co-ordinate system. + Technical aside: this is done by using the cosine rule to project the points onto a line running from two points `west` to `east`, then using Pythagorean to find a point on an orthogonal dimension. This new information is stored in the nested `at` list. function project(east,west,_Table, a,b,c,d,m,at) { c = dist(east,west,_Table) for(d in data) { a = dist(d,east,_Table) b = dist(d,west,_Table) m++ at[m]["d"]= d at[m]["x"]= x= (a^2 + c^2 - b^2) / (2*c) at[m]["y"]= (a^2 - x^2)^0.5 } print_in_x_order(at) } Once the list is generated, then we sort and print it, ordered on `at`'s `"x"` field (using the `xsort` function). function print_in_x_order(at, i,xs,m,key) { m = asort(at,xs,"xsort") for(i=1;i<=m;i++) for(key in xs[i]) print ":" key " " at[i][key] } function xsort(r1,x1,r2,x2) { return x1["x"] - x2["x"] } ### Variables Passed to Functions AUK passed numbers and strings by value- which is to say it copies the value and passes in the copy. That means that the original value, outside the function, is safe from modification. For example, the following function returns a string of `n` number of strings `c`. Inside the function, `n` is decremented (see `n--`) but outside the function, that `n` is not modified. """ function nchars(n,c, out) { while(n> 0) { n-- out = out c } return out } """ (As an aside, note how AUK concatenates strings on line four of the function `out = out c` (this adds the character `c` to the growing `out` string). Here, the concatenation "function" is the blank space between two variables. That is, to concatenate two strings, place them next to each other with a blank in-between them.) Since copying values for a function call can be (a little) slow, AUK passes lists to functions by reference. That is, modifying a list in a function will change its contents outside of that function. Note that AUK functions can only return single values (strings or numbers). To return mutliple values, pass in a list and write then return values to different keys in that list. ## Commonly-used Functions ### print a,b,c This is basic AUK function that prints `a` then `b` then `c` separated by the OFS. This is followed by a newline character. Numbers are printed using the `CONVFMT` variable. ### printf, sprintf These are the more advanced printing functions that offer more [control for displaying variables](http://goo.gl/OGpDv6). `Sprintf` is just like `printf` except that it returns the string (rather than to standard output): sprintf("%10.6f",22/7) => " 3.142857" ### split(string,lst,[sep]) Generates a new `lst` comprising the `string` split on `sep`. If `sep` is not supplied, it defaults to the `FS` field separator. ### sub(this,that,[string]), gsub(this,that,[string]) Replaces `this` for `that` in `string`. The `gsub` function replaces all instances while `sub` just replaces the first instance. If `string` is not supplied it defaults to `$0`. ### asort(lst1,[lst2],[how]) After the call to `asort()`, `lst2` contains the `lst1` data , indexed from 1 to some number n (being the total number of elements in `lst1`), sorted: """ lst1[1] <= lst1[2] <= lst1[3], and so on. """ This function returns the number of items in `lst1`. If `lst2` is not supplied then `lst2` replaces `lst1`. If `how` is not supplied, data is sorted in ascending order. ## Commonly Used-Idioms In the following code, `n` starts as an empty string. The first `n++` is a suffix addition (the old value is returned before we add one to it). function something( n) { while (someTest) { if (n++) handle_first_iteration(); else handle_other_iterations() } Hence, the _first_ time the loop hits `n++`, it returns zero and the code calls `handle_first_iteration()`. After that, for all other iterations, it calls `handle_other_iterations()`. ## Frequently Asked Questions 1. Q: Why do AUK files usually begin by loading GAWK files; e.g. `@include "somefile.awk"`? + A: Because AUK is a pre-processor that translates `.auk` files to `.awk`, then runs them using GAWK. The command `@include` is the GAWK idiom for loading those files at rungime. ## Appendix: Why AUK? The above is the `what` of AUK. If you are interested in `why` I built it, then read on. GAWK is a great language. It has been my secret ninja power that let me maintain a competitive academic career, while having fun learning novel ways to build data miners, while supervising numerous research students, all the while [publishing like a demon](http://goo.gl/Pq0knm). You really should try GAWK- see see _Classic Shell Scripting_ (chapter 9) and _Effective Awk Programming_, both written by Arnold Robbins. I love GAWK so much that for several years Arnold let me maintain the [awk.info](http://awk.info) web site (which, by the way, is now looking for another maintainer; any volunteers?). One disappointment in that work was that I could not find shared and widely-used GAWK libraries. This is a worry since it means the GAWKers are more soloists than members of a large cooperative. Looking deeper, I found that: + Standard GAWK uses pattern-directed programming where input is based along a set of `pattern {action}` pairs. + That style of programming needs global variables to pass information between the patterns. This discourages library development (since one pattern can accidently overwrite a global needed by another). + More fundamentally, this approach assumes every code file is a top-level driver. If the goal is to build libraries of code, that assumption is rarely true. + Also, the GAWK community has no standard formats for: + Headers explaining the variables used by different files; + Documentation; + Demonstration code, test suites, or regression suites AUK was written to address the above. Just as an aside, the AUK to GAWK pre-processor is written in GAWK (so the best way to extend GAWK is to use GAWK). With AUK I can: + Implement data hiding and structures: + AUK lets me define _dashed variables_ such as `_3D` to be the string `x,y,z`. + Once defined, then everywhere I use (say) `_3D[i]`, AUK expands that to `x[i],y[i],z[i]`. + This means I can succinctly around complex structures to sub-routines. + It also means that if I ever extend the definition of a 3d point to include (say) its color and weight, then everywhere that uses that color and weight will auto-expand (say) `_3D` to `x,y,z,color,weight`. + Add documentation tools: + AUK supports multi-line comments in the [Markdown syntax](http://daringfireball.net/projects/markdown/syntax). + AUK also support auto-generation of html pages from those comments (e.g. this web page is automatically generated from `idioms.auk`). + Add regression, testing, demonstration tools. + See _dashed functions_, described above. + Better package code + The dashed variables and the dashed functions and the documentation tools lets me write shorter AUK code that is easy to share and document and explain. On top of the above, I wrote some emacs scripts and shell scripts that let me edit and run AUK code in a specialized environment:
Note that all this is committed to the [AUK github repository](https://github.com/timm/auk) so you can easily use and improve AUK. At the time of this writing, there is no released tag but that is coming soon.