2009-08-28

Playing with perl6 grammars

Since the grammar feature in perl6 is documented in rather sophistication-requiring ways (including Synopsis 5, Apocalypse 5 and Exegesis 5 (the latter being the most accessible one)), I decided to look for short intros to get quick cheap results.

The simplest description I found is the Wikipedia article about Perl6 rules. My point is (ab)using grammar as a tool for describing a grammar (in the Chomsky sense), not just as a named pair of braces to stick regexen into.

I won't repeat the Wikipedia stuff here, just mention a few facts:
  • The grammar entered has to be a PEG, i.e. it has some features that are not expressible with a CFG but requires grammar formulations (or manual transformations) similar to those required for a LL(1) parser (i.e. eliminating left-recursion using left-factorization, which makes the grammar much less human-readable)

  • The keywords "token" and "rule" are not what one would expect simple-mindedly. They are special cases of "regex" (which is what one should use unless one understands the details) with suppressed backtracking (and in addition, "rule" is whitespace-sensitive, which is unexpected in a grammar description language).

  • Actions can be included by having a class living in parallel to the grammar whose methods are called like the nonterminals of the grammar and by adding the string "{*}" after the production part (separately for each alternative production).

  • Alternatives can be tagged by adding something that looks like a comment, but consists of the string "#= " followed by the tag name.

  • Untagged productions call their action with one argument, tagged ones with two arguments (the second being the tag). The first argument is the matched text. Multi-methods in the class can be used to tell both cases apart.


More stuff will come once I've understood and tested it.

Here is an example which does not work (only the grammar, the rest of the program is identical to the working one). The language consists of all non-empty expressions containing balanced parentheses and the letter 'a'.

#!/usr/local/bin/perl6
# broken (left-recursive grammar)
grammar Foo {
regex TOP { ^<A>$ {*} #= top
};
regex A {
'' {*} #= epsilon
|<B> {*} #= simple
|'('<A>')' {*} #= paren
|<A><A> {*} #= double
};
regex B { 'a' };
};


Unfortunately, with the bad grammar the program does not signal an error at compile time but goes into an endless recursion, maybe eventually falling victim to exceeding recursion depth (or running forever).

The following version has been left-factored, thus becoming unreadable for humans and readable for perl6.


grammar Foo {
regex TOP { ^<A>$ {*} #= top
};
regex A {
<B><C> {*} #= bc
|'('<A>')'<C> {*} #= paren
};
regex C {
'' {*} #= empty
| <A><C> {*} #= ac
};
regex B { 'a' };
};

class Fooa {
method TOP($m,$t) { say "TOP m=$m t=$t"; };
method A($m,$t) { say "A m=$m t=$t"; };
};

my Str @s=( 'aa',
'a',
'(a(a)a)a',
'a(a(a)a)',
'a(a(a)a)a(a(a)a)'
);

for @s -> $x { say "look at $x";
say Foo.parse($x,:action(Fooa.new));
say '=================';
};

2009-08-21

Parse::Marpa - trying to read the manuals

The usage of Parse::Marpa is very different from all other parsers or parser generators I've seen so far. Nevertheless, it is a promising tool, since it is an Earley parser and thus able to parse all context free languages. It should be noted that perl 5.10 is required (I installed a fresh 5.11 just for testing the module).
The grammar is to be given in a language called "MDL" (Marpa Demonstration Language), which supposedly is not the only possibility (using low-level interface, the user should be able to write his own grammar specification language). I give only the most elementary stuff needed to write a grammar, omitting amost all the cool stuff - the Postscript version of the most relevant chapters (created using
perldoc -u "$1" | pod2man | groff -man -Tps
) is about 70 pages.

A minimal grammar specification looks like this:
start symbol is englishsentence.
semantics are perl5. version is 1.004000.
default lex prefix is /\s+|\A/.
concatenate lines is q{ (scalar @_) ? (join "-", (grep { $_ } @_)) : undef; }.
default action is concatenate lines.

englishsentence: subject, verb, conjunction, object.

englishsentence: subject, verb, object.

specializernoun: noun. q{ "specializernoun($_[0])" }.

ordinarynoun: noun. q{ "ordinarynoun($_[0])" }.

subject: specializernoun, ordinarynoun. q{ "spsub($_[0]+$_[1])" }.

subject: noun. q{ "subject($_[0])" }.

noun: nounlex.

nounlex matches /fruit|banana|time|arrow|flies/.

verb: verblex. q{ "verb($_[0])" }.

verblex matches /like|flies/.

object: preposition, noun. q{ "ob(prep($_[0])+n($_[1]))" }.

conjunction: /like/. q{ "conjunction($_[0])" }.

preposition: prepositionlex.

prepositionlex matches /a\b|an/.
Main principles:

A grammar specification consists of paragraphs (separated by empty lines).
A paragraph consists of sentences (terminated with a dot).

At the beginning of the grammar, a few sentences about start␣symbol, semantics and version are mandatory. My entry for default␣lex␣prefix is the simplest thing I was able to create (whitespace before each word unless we are at the beginning of the text). The concatenate␣lines/default␣action-stuff is copied from the manual (and not yet really understood by me). All those sentences contain the word "is" or "are", which are considered synonyms.

Production sentences have the form
nonterminalsymbol colon comma-separated-list-of-symbols dot

They can be followed by an action of the form

q{ dosomethingwith($_[0],$_[1]); }.

The manual says that one should not use "return" inside an action. The elements of @_ inside an action correspond to the items in comma-separated-list-of-symbols.
In order to specify alternatives, just give multiple rules with different right-hand-sides.

Literal elements inside the rhs of a rule can specified only as regexps (i.e. literal strings do not work despite my interpretation of perldoc Parse::Marpa::Doc::MDL).

Finally, terminal symbols can be defined using so-called terminal sentences the form

terminalsymbolname matches /regexpbody/.

Terminal paragraphs cannot have actions, their value is always the matched input (which is not really a problem, just write
foo: foolex. q{ here_goes_my_action($_[0]); }.

foolex matches /foo/.
).

Finally, the grammar is (in the simplest case) used as follows:
my @values1=Parse::Marpa::mdl(\$grammar_description,\$data1);
whereafter @values contains a list of the results for each interpretation of the (possibly ambiguous) grammar.

Parse::Marpa - first results

I just found the parser generator mentioned above on CPAN and played with it (some usage rules I've found so far will be mentioned in the next post). The following source is intended to decompose the well-known ambiguous english sentence "fruit flies like a banana", the output is appended below. Marpa does this, but finds 8 results instead of 2 (each of the interpretations is found 4 times). Since this is my first trial, I believe the fault is in my program.

#!/opt/bin/perl

use 5.011;
use strict;
use warnings;
use English qw(-no_match_vars);
use Parse::Marpa;

my $grammar_description='
start symbol is englishsentence.
semantics are perl5. version is 1.004000.
default lex prefix is /\s+|\A/.
concatenate lines is q{ (scalar @_) ? (join "-", (grep { $_ } @_)) : undef; }.
default action is concatenate lines.

englishsentence: subject, verb, conjunction, object.

englishsentence: subject, verb, object.

specializernoun: noun. q{ "specializernoun($_[0])" }.

ordinarynoun: noun. q{ "ordinarynoun($_[0])" }.

subject: specializernoun, ordinarynoun. q{ "spsub($_[0]+$_[1])" }.

subject: noun. q{ "subject($_[0])" }.

noun: nounlex.

nounlex matches /fruit|banana|time|arrow|flies/.

verb: verblex. q{ "verb($_[0])" }.

verblex matches /like|flies/.

object: preposition, noun. q{ "ob(prep($_[0])+n($_[1]))" }.

conjunction: /like/. q{ "conjunction($_[0])" }.

preposition: prepositionlex.

prepositionlex matches /a\b|an/.

';

my $data1='time like an arrow.';
my @values1=Parse::Marpa::mdl(\$grammar_description,\$data1);
for my $i(@values1) { say $$i; }

my $data2='time flies like an arrow.';
my @values2=Parse::Marpa::mdl(\$grammar_description,\$data2);
for my $i(@values2) { say $$i; }

Output:

subject(time)-verb(like)-ob(prep(an)+n(arrow))
subject(time)-verb(flies)-conjunction(like)-ob(prep(an)+n(arrow))
subject(time)-verb(flies)-conjunction(like)-ob(prep(an)+n(arrow))
subject(time)-verb(flies)-conjunction(like)-ob(prep(an)+n(arrow))
subject(time)-verb(flies)-conjunction(like)-ob(prep(an)+n(arrow))
spsub(specializernoun(time)+ordinarynoun(flies))-verb(like)-ob(prep(an)+n(arrow))
spsub(specializernoun(time)+ordinarynoun(flies))-verb(like)-ob(prep(an)+n(arrow))
spsub(specializernoun(time)+ordinarynoun(flies))-verb(like)-ob(prep(an)+n(arrow))
spsub(specializernoun(time)+ordinarynoun(flies))-verb(like)-ob(prep(an)+n(arrow))
PS: If I replace the conjunction-definition (conjunction: qr{like}. q{ "conjunction($_[0])" }.) with a terminal (conjunction matches /like/.), the quadruplication ceases and I get exactly one result for each of the two interpretations (but I lose the action on the conjunction).
Using

conjunction: conjunctionlex. q{ "conjunction($_[0])" }.

conjunctionlex matches /like/.

I get the unwanted quadruplication back.

Empfohlene Artikel aus Google-Reader

Es gibt eine kaum bekannte Erweiterung von Google-Reader, die es erlaubt, fremde Artikel einfach zu verlinken, anstatt selber welche schreiben zu müssen. Um einen interessanten Artikel zu empfehlen, klickt man unter dem Kurztext auf "Empfehlen", dann landen sie auf der Seite http://www.google.de/reader/shared/09410950957807438947 (bei mir, jemand anderes bekommt eine andere Nummer). Diesen Link erfährt man im Google-Reader in Einstellungen→Ordner␣und␣Tags als URL des mit "öffentliche␣Seite␣anzeigen" beschrifteten Links.
Jetzt geht man zum Blog, dort zu "Gadget einfügen", wählt "HTML/Javascript", denkt sich eine Überschrift aus (z.B. "Recommended Items from Google Reader"), einen Linktext (ich habe die URL selbst genommen) und markiert ihn als Link mit der richtigen URL.

2009-08-17

Manche Freunde sind so peinlich, dass es selbst die Union mitkriegt

Man betrachte das von den Jenaer Piraten veröffentlichte Video und lese die handgemalten Pappschilder der Fans (z.B. der Mordaufruf bei 0:27). Die sind den Offiziellen so peinlich, dass sie albern auf die Bänke hüpfen und ihre lächerlichen Bildchen davorhalten. Vermutlich wollen sie (beide) andeuten, dass die Verräterpartei das kleinere Übel darstellt.

2009-08-13

Suchmaschine kaputt

Wenn man auf der Webseite der Verräterpartei sucht, finden sie sich selbst nicht:




Hauptsache, der Nameserver (dns2.spd.de) geht ;-)

2009-08-07

Kuschelhaie

Endlich etwas für den Streichelzoo, ohne dass man danach immer die Finger voll Fusseln hat:
Sharks can be cuddled like dolphins, say scientists (Telegraph, UK)

Deutsche Beschreibung hier. (Sea life centers)