cs310: Programming Languages: 2.3 Grammars Hacking: Example

Generic approach to recongizing parts of an input stream, then performing specific actions for certain sections.

Gawk

Gawk & Gender Recognition

Scott Pakin. Regular Expressions and Gender Guessing. In Computer Language Magazine, 8(12):pp. 59-68, December 1991.

                                          { sex = "m" }  # Assume male.
/^.*[aeiy]$/                              { sex = "f" }  # Female names endng in a/e/i/y.
/^All?[iy]((ss?)|z)on$/                   { sex = "f" }  # Allison (and variations)
/^.*een$/                                 { sex = "f" }  # Cathleen, Eileen, Maureen,...
/^[^S].*r[rv]e?y?$/                       { sex = "m" }  # Barry, Larry, Perry,...
/^[^G].*v[ei]$/                           { sex = "m" }  # Clive, Dave, Steve,...
/^[^BD].*(b[iy]|y|via)nn?$/               { sex = "f" }  # Carolyn,Gwendolyn,Vivian,...
/^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$/  { sex = "m" }  # Dewey, Stanley, Wesley,...
/^[^GKSW].*(th|lv)(e[rt])?$/              { sex = "f" }  # Heather, Ruth, Velvet,...
/^[CGJWZ][^o][^dnt]*y$/                   { sex = "m" }  # Gregory, Jeremy, Zachary,...
/^.*[Rlr][abo]y$/                         { sex = "m" }  # Leroy, Murray, Roy,...
/^[AEHJL].*il.*$/                         { sex = "f" }  # Abigail, Jill, Lillian,...
/^.*[Jj](o|o?[ae]a?n.*)$/                 { sex = "f" }  # Janet, Jennifer, Joan,...
/^.*[GRguw][ae]y?ne$/                     { sex = "m" }  # Duane, Eugene, Rene,...
/^[FLM].*ur(.*[^eotuy])?$/                { sex = "f" }  # Fleur, Lauren, Muriel,...
/^[CLMQTV].*[^dl][in]c.*[ey]$/            { sex = "m" }  # Lance, Quincy, Vince,...
/^M[aei]r[^tv].*([^cklnos]|([^o]n))$/     { sex = "f" }  # Margaret, Marylou, Miriam,...
/^.*[ay][dl]e$/                           { sex = "m" }  # Clyde, Kyle, Pascale,...
/^[^o]*ke$/                               { sex = "m" }  # Blake, Luke, Mike,...
/^[CKS]h?(ar[^lst]|ry).+$/                { sex = "f" }  # Carol, Karen, Sharon,...
/^[PR]e?a([^dfju]|qu)*[lm]$/              { sex = "f" }  # Pam, Pearl, Rachel,...
/^.*[Aa]nn.*$/                            { sex = "f" }  # Annacarol, Leann, Ruthann,...
/^.*[^cio]ag?h$/                          { sex = "f" }  # Deborah, Leah, Sarah,...
/^[^EK].*[grsz]h?an(ces)?$/               { sex = "f" }  # Frances, Megan, Susan,...
/^[^P]*([Hh]e|[Ee][lt])[^s]*[ey].*[^t]$/  { sex = "f" }  # Ethel, Helen, Gretchen,...
/^[^EL].*o(rg?|sh?)?(e|ua)$/              { sex = "m" }  # George, Joshua, Theodore,..
/^[DP][eo]?[lr].*se$/                     { sex = "f" }  # Delores, Doris, Precious,...
/^[^JPSWZ].*[denor]n.*y$/                 { sex = "m" }  # Anthony, Henry, Rodney,...
/^K[^v]*i.*[mns]$/                        { sex = "f" }  # Karin, Kim, Kristin,...
/^Br[aou][cd].*[ey]$/                     { sex = "m" }  # Bradley, Brady, Bruce,...
/^[ACGK].*[deinx][^aor]s$/                { sex = "f" }  # Agnes, Alexis, Glynis,...
/^[ILW][aeg][^ir]*e$/                     { sex = "m" }  # Ignace, Lee, Wallace,...
/^[^AGW][iu][gl].*[drt]$/                 { sex = "f" }  # Juliet, Mildred, Millicent,...
/^[ABEIUY][euz]?[blr][aeiy]$/             { sex = "m" }  # Ari, Bela, Ira,...
/^[EGILP][^eu]*i[ds]$/                    { sex = "f" }  # Iris, Lois, Phyllis,...
/^[ART][^r]*[dhn]e?y$/                    { sex = "m" }  # Randy, Timothy, Tony,...
/^[BHL].*i.*[rtxz]$/                      { sex = "f" }  # Beatriz, Bridget, Harriet,...
/^.*oi?[mn]e$/                            { sex = "m" }  # Antoine, Jerome, Tyrone,...
/^D.*[mnw].*[iy]$/                        { sex = "m" }  # Danny, Demetri, Dondi,...
/^[^BG](e[rst]|ha)[^il]*e$/               { sex = "m" }  # Pete, Serge, Shane,...
/^[ADFGIM][^r]*([bg]e[lr]|il|wn)$/        { sex = "f" }  # Angel, Gail, Isabel,...
                                          { print sex }  # Output prediction

Gawk & Domain-Specific Language

The code is here. Note the use of guard action pairs, where the guards are regular expressions.

Fast and Slow Regular Expression Processing

Strange to say, many modern languages use a simplistic recursive descent test for regular expressions.

e.g. Lets build two finate state automoa for the patterhn abab|abbb and run it the string abbb.

A more efficient but more complicated way to simulate perfect guessing is to guess both options simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter.

Note that this second way was first implemented by Ken Thompson introduced the multiple-state simulation approach in his 1968 paper. The full source code (under 400 lines) and the benchmarking scripts are available online

Thompson introduced his regualr expression to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex.

Welcome to the 21st Century

Remember that OCaml language? That was strongly typed (at compile time, you know if a variable is a string, number, whatever) and supported type inferencing (so you can check if you are doing dumb things like adding a number to a string).

Well, OCaml supports recursive type definitions? So that OCaml’s types are a grammar.

The Ulist and Olist constructors take the first item followed by a (possible empty) list of items, to prevent empty lists — this way, there’s at least one element.

So eigenclass.org uses the above to define a cool generator for HTML pages using a set of short-cuts:

The above grammar is the core of a parser that generates HTML. And, interestingly, in this ultra-high level language, this parse runs very very fast.

2.3 Grammars Hacking: Example

Lex

Gawk

Gawk & Gender Recognition

Gawk & Domain-Specific Language

Fast and Slow Regular Expression Processing

Welcome to the 21st Century

Summary