Generic DSL Grammar and Parser

About a year ago I blogged about an idea of an extensible language; http://rogeralsing.com/2009/03/18/an-intentional-extensible-language/ 
Since then, I have been experimenting with this concept quite a bit.

I have now created a more complete grammar which contains hard-coded support for some constructs, e.g. assignments, lambdas, binary and unary operations.
What I ended up with is not a language, but rather a parser for a structured document.
You can think about it like XML, XML gives you structured documents which you can traverse and transform via code.

This is pretty much exactly the same, but the syntax looks like a C derivate rather than a mark up language.

So what’s the point of it?

The idea is to make it possible for developers to create their own textual DSL’s without having to care about any of the gory details of parsing and grammar construction.

A lot of people are currently using XML to define different types of rules or entire scripts, e.g. build scripts or custom business rules.
This works great since it is easy to deal with XML documents via code.

The downside of XML based DSLs is that they are extremely hard to read and edit!
(I know that a lot of people disagree with me here)

Let’s say for the sake of the argument that we want to define some custom business pricing rules.
Using XML that could look something like:

<rules>
   <rule product="50050" >
      <condition>
          <greaterthan>
              <leftoperand>
                    <property value="quantity" />
              </leftoperand>
              <rightoperand>
  <integerliteral value="200" />
              </rightoperand>
          </greaterthan>
      </condition>
      <price>
            <add>
  ... etc etc ...
            </add>
      </price>
   </rule>
</rules>

Such XML rule definition can easily bloat from something fairly simple to something completely horrible when more requirements are thrown at it.

Using the generic DSL grammar that I have created, the above rules would look something like this:

product 50050
{
    when (quantity > 200) then (20 - quantity / bonus);

    when ... //more pricing rules for same product
};

Or you could use it to define Google protobuffer messages:

message OrderPlaced
{
  1 Guid MessageId;
  2 Guid CustomerId;
  3 int OrderId;
  4 List(OrderDetail) Details;
};

Or what about defining some CQRS entity definitions?

public entity Customer
{
    private string Name;

   command Rename(string newName)
    {
       require (newName != null);
       Renamed (newName);
    };

    event Renamed(string newName)
    {
        this.Name = newName;
    };
};

How it works, in short:
The grammar is based on a few simple constructs;
Every line is an “statement” and statements are an expression terminated with “;”.

An expression can be things like an assignment,a lambda, a binary operation or a “Chain”.
A “Chain” consists of zero or more constructs that occurs in a whitespace separated sequence.
Members of a chain can be primitive constructs such as primitives, identifiers, bodies {a;b;c;} and tuples (a,b,c).
Since there is only a handful of constructs, it is very easy to traverse and analyze the parsed documents.

If we take the above “customer entity” definition, we would get a structure like:
A statement (chain) containing: public, entity, customer, body.
Where the first 3 elements are identifiers and the last item is a body.
A body is a structure containing zero or more statements.
So the body of the customer entity would consist of 3 statements, the member variable “name”, the command “Rename” and the event “Renamed”.
Each of those statements are chains containing their own sub items.

I am currently cleaning up the parser and I will also add some convenience methods to make it easier to match structures in the resulting document.
So code and more examples will hopefully be up in a few days.

//Roger

11 Comments

  1. Torkel says:

    interesting, show me more :)

  2. rogpeppe says:

    So the body of the customer entity would consist of 3 statements, the member variable “name” […]

    does that mean that “private”, “command”, etc are keywords? or is it just that in this particular DSL, the identifier “private” is treated as a keyword?

  3. Roger Alsing says:

    @rogpeppe

    There are no keywords in the grammar, they would be treated like “identifier” nodes by the document.
    They would be keywords in this specific DSL, that is, the developer would have to apply conditions and evaluate if the contents of a “chain” makes sense.

    Much like specific tags and attributes are present in a specific XML schema while XML itself does not treat them as keywords.
    My intention is to make it possible to define patterns that can be matched in a “chain”, but I’m not sure how I want the API for that to look yet.

    1. rogpeppe says:

      ah, that’s what i’d hoped.

      what about operator precedence & associativity? if you’d said, for example this.Name = x.Name, would that be parsed as ((this “.” name) “=” (x “.” Name))
      or (((this “.” name) “=” x) “.” Name) ?

      including binary operators seems potentially problematic.

      you say there’s hard coded support for assignments and lambdas. if there are assignments, presumably there are identifier declarations and scope rules; and if there are lambdas, then presumably you can invoke the lambdas too. all of that seems to me to be moving towards a fully fledged language rather than an XML equivalent…

  4. Roger Alsing says:

    @rogpeppe

    Precedence is based on the standard operators.

    Your example will result in:
    ((this “.” name) “=” (x “.” Name))

    So there is special support for binary and unary operations, they have their own node type in the AST.

    >>. if there are assignments, presumably there are identifier declarations and scope rules

    No, no such thing exist in the grammar or parser, I only mean that there is support for infix operations and precedence in the syntax.

    And the hard-coded support for lambdas does not imply any infrastructure for execution, simply that they have their own special syntax in the grammar:

    (a,b,c) -> expression
    or:
    (a,b,c) -> { bar(); }
    or arg-less:
    -> expression

    Although, I might remove that support since the same can be expressed as:

    (a,b,c) expression

    1. rogpeppe says:

      Precedence is based on the standard operators.

      the standard C operators?

      Although, I might remove that support since the same can be expressed as:

      (a,b,c) expression

      it looks to me like -> is just another binary operator. but what about
      (a) -> (b) -> expr ? would it associate right-to-left as in haskell?

  5. FallenGameR says:

    Have you read http://www.manning.com/rahien/ ?
    There is а language that allows to work with it’s AST in run time. No need to write a parser – it’s already exists.

  6. Roger Alsing says:

    @rogpeppe

    >>The standard C operators?

    Yes

    >>but what about
    (a) -> (b) -> expr ? would it associate right-to-left as in haskell?

    Currently, the -> is not a binop, the grammar requires a (..) then -> and an expression.

  7. Roger Alsing says:

    @FallenGameR

    I’m well aware of Boo and have used it many times.
    The thing is that in order to bend Boo to your will you have to have alot of knowledge of the inner workings of Boo and the compiler pipeline.

    Boo brings a lot of nifty stuff to the table, but it also comes with alot of baggage.
    You have to cut off the parts that you don’t need.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s