Parsing with Oslo’s MGrammar (Mg)

7. November 2008 09:49 by CarlosLoria in General  //  Tags:   //   Comments (0)


As profiled in a previous post we are building a tool for reasoning about CSS and related HTML styling tools using some logic framework (logic grammars in human readable style are usually highly ambiguous, consequently an interesting case study). So, we need a textual language, a DSL, for the tool, I was asked to try MGrammar (simply Mg), which a DSL generator which is part of the Microsoft Oslo SDK, recently presented in MS PDP.

Actually I am unaware of the Oslo project details , so I was a little bit skeptical about the thing and because it would entail learning another new language (M). But I fight against any personal preconception and decided just to take a look and try to have fun with it. I do not pretend here to blog about the whole Oslo project, of course. Just to tell a simple story about using Mg in its current state, no more.

In general, I like very much the idea that data specification (models) and data storing, and querying and the like can be separated from procedural languages in a declarative and yet more expressive way; hopefully, my own way to express my model (beyond standard diagrams and graphic perspectives). Oslo and M approach such the general vision on model oriented software development, Mg the latter on expressivity, as I understand the source material which is not much yet.

What I have seen so far of M is indeed very interesting, I really liked it; it extends LINQ which I consider a very interesting proposal MS’s; it also reminded me OCL in some ways. In fact, there exists an open source OCL library called Oslo, which has nothing to do with MS; it is just funny to mention.

Mg
I focused on tasting Mg; we do not consider M at all.  We do not present a tutorial here, there is one nice for instance here. If you are interested in my example and sources please mail me and I will send you it back.

The SDK also contains a nice demo for a musical language (Song grammar) and its player in C#; and moreover M and Mg themselves are specified in Mg, the corresponding sources are available in the SDK. Hence, you learn by reading grammars, because the documentation is not yet as rich as I would like.

Mg source files resemble, at least in principle, other parser generators (like ANTLR, Coco, JavaCC, YACC/Lex and relatives). I do not know details about the parsing technique behind Mg. However, according to the mg.exe reporting it would be LR-parsing.

Mg offers really powerful features. Just as an instance, following are tokenizing rules:

token Digit = "0".."9";
token Digits = Digit+;
token Sign = ("+"|"-");
token Number = Sign?Digits ("." Digits)?;

Or constraint the repetition factor (from at least 1 maximum 10) as in:

token ANotTooLongNumber = Digit#1..10;

But you may have parameters in scanner rules, thus you may write (taken from nice tutorial here):

token QuotedText(Q) = Q (^('\n' | '\r') - Q)* Q;
token SingleQuotedText = QuotedText('"');
token DoubleQuotedText = QuotedText("'");

For handling different types of quotations depending on parameter Q! Notice the subpattern ^('\n' | '\r') – Q. it means any but newline or return excluding character Q.

Oslo SDK provides an editor called Intellipad, which can be used for both M and Mg development. You need to experiment a little bit before getting used with it, but, after short time, it appears to you as a very nice tool. Actually, a command-line compiler for Mg is included in the SDK, so you may edit your grammar using any text editor and compile using a shell. But, the Intellipad is quite powerful, it allows editing and debugging your grammar simultaneously, quickly; Intellipad shows you several panes for working with, in one you write your grammar, in other you may enter input to your grammar which is immediately checked against your rules. A third (tree view) pane shows you the AST being projected by your grammar on the input you are providing.

Finally, a forth pane shows you the error messages that includes those eventual parsing ambiguities according to the given input and grammar. Interesting, only then, you notice any potential conflict. I did not see any form of static analysis for warning about such cases provided in the editor. Is there any such?

Besides that, actually I enjoyed using it. I a not sure to appreciate the whole functionality but it was quite easy to create my grammar. Mg is modular, as M is. So you can organize and reuse your grammar parts. That is very nice.
To start defining a language, you write something like:

language LogicFormulae{
   //rules go here inside
}

For defining the language you have "token", "syntax" mainly and other statements. In this case, the name "LogicFormulae" will be used later on when we access the corresponding parser programmatically.

Case Study
Back to our case study, our simple language contains logic formulae like "p&(p->q) <-> p&q" (called well-formed-formulae, or wff) where as usual a lot of ambiguity (conflicts) might occur. The result is frequently that you have to rewrite your grammar into a usually uglier one in order to recover determinism and so fun is out. What I liked of Mg is a very readable style for handling such cases. For instance, I have a rule like

syntax ComposedWff =  
                                 precedence 1: ParenthesisWff
                               | precedence 1: PossibilityWff  
                               | precedence 1: NecessityWff   
                               | precedence 1: QuantifiedWff
                               | precedence 2: NotWff
                               | precedence 3: AndWff
                               | precedence 4: ImpliesWff
                               | precedence 4: EquivWff
                               | precedence 5: OrWff
                               ;
Using the keyword precedence I can "reorder" the rule to avoid ambiguity. Thus, "all x.p & q" parses as "(all x.p) & q" under this rule because of the precedence I chose.

Another example shows you how to deal with associativity and operator precedence

syntax AndWff         =  Wff  left(4)  "&" Wff;
syntax ImpliesWff    =  Wff right(3) "->"  Wff;

Mg projects syntax in its D-Graphs using the name of the rule and token images automatically, thus it always generate an AST without specifying it which is very useful. Thus, the "AndWff" rule will produce a tree with three children (including the token "&"), labeled with "AndWff". You can use your own constructors to be produced, instead:

syntax AndWff         =  x:Wff left(4)  "&"  y:Wff  => And[x, y];

Such constructors like "And" need not to be classes they will be labels in nodes.

Conflicts

Conflicts were in appearance solved by this way because the engine behind Intellipad did not complain anymore about my test cases. I do not know whether there is a way to verify the grammar using Intellipad, so I just assumed it was conflicts free. However, when I compiled it using the mg.exe (using the option -reportConflicts) I  got a list of warnings as by any LR parser generator. That was disappointing. Are we seeing different parsing engines?

Using C#
The generated parser and the graphs it produces can be accessed programmatically, something that was of my special interest in my case beyond working with Intellipad. Because I was not able to create a parser image directly from Intellipad, I did the following: I compiled my grammar using the mg.exe command producing a so-called image, a binary file (other targets are possible, I think) called "mgx". This mgx-image can be loaded in C# project using some libraries of the SDK. (It can also be executed using mgx.exe).  You would load the image like this:

parser  = MGrammarCompiler.LoadParserFromMgx(stream, languageName);

where "stream" would be a stream referring to the mgx file I produced with the mg compiler. And "languageName" is a string indicating the language I want to use for parsing ("LogicFormulae" in my case). And I parse any file as indicated by string variable "input" as follows

rootNode = parser.ParseObject(input, ErrorReporter.Standard);

The variable rootNode (of type object, by the way!) refers to root of the parsing graph. A special kind of object of class GraphBuilder gives you access to the nodes, node labels and node children if any. Hence, using such an object you visit the graph as usual. For instance using a pattern like:

void VisitAst(GraphBuilder builder, object node){

foreach (object childNode in builder.GetSequenceElements(node))
       if (childNode is string)
                VisitValue(childNode);
      else VisitAst(builder, childNode);
         
}
For some reason, the library uses object as node type, as you are noticing, probably.

Conclusions
I found Mg very powerful and working with Intellipad for prototyping was actually fun. Well it was almost always. I only have some doubts concerning use and performance. For instance, using Intellipad you are able to develop and test very fast. But it seems that there is no static analysis tool inside Intellipad for warning about eventual conflicts in the grammar.
The other unclear thing I saw is the loading time of the engine in C#. It takes really too long before you get the object parser given the mgx file. In fact, the mg compiler seems to work too slowly, in particular with respect to Intellipad. It seems as though the mg.exe and Intellipad were not connected as tool chain or something like that.  But surely is just me, because it is just my first contact with this SDK.