Monday, September 3, 2007

Loyc design issues

It's been fun designing Loyc, but boy, I've got a lot left to think about.

Right now I'm trying to figure out how to allow extensions to activate and deactivate statements based on arbitrary contextual criteria. One unanswered question is whether statements should have access to their parent node (ICodeNode) during parsing. The main problem with allowing it is that the parent nodes are, in general, not yet fully parsed when the child nodes are parsed, and it may be tricky to design convenient, reasonable, non-cumbersome semantics for the incomplete parent nodes. I'm leaning toward requiring only that the type Symbol of parent nodes be made available. Probably some other kind of context than the parent node ought to be available, such as symbol tables. In some languages, notably C++, symbol tables are considered necessary for correct parsing, although there are usually ways around such problems; for example I think FOG can parse C++ without them. Still, even if symbol tables aren't needed to parse, it often makes sense to build symbol tables during parsing. But in Loyc I also want to separate concerns as much as possible in order to maximize code re-use. By separating out the code for building symbol tables,
  1. it should be easier to add artificial (aka synthetic) nodes to symbol tables
  2. people can parse code without building symbol tables, which is nice if, for whatever reason, the symbol tables are not needed.
But I digress. There's lots of unresolved issues and I'd just like to summarize the ones I can think of...
  • There may be a lot of statements allowed from a lot of different extensions, perhaps hundreds, and the set of available statements may vary with every new block that opens. I'm planning to give statements full control over parsing their contents, including nested statements, but there will be a conventional way that statements can give control back to the language style. So the questions are
    • How to efficiently modify the set of available statements (I decree the split infinitive to be perfect English :P).
    • How to allow statements to specify when they are available. Arbitrary criteria should be possible but the most common case(s) should be easy for the user (i.e. extension writer) to use and should perform well. Or maybe the problem should be reconsidered as follows: how can block statements (that contain other statements) specify what categories of other statements they can contain?
    • How to provide the language style with enough control over how parsing operates that the original language spec can be supported under the Loyc extensible parsing model.
  • Similar concerns apply to operators. There may be hundreds of operators available in a program, but not all at once. Availability may be moderated by the parent statements and parent expressions.
  • Note to self: I need to introduce a new kind of OneOperatorPart that represents the edge of the expression. This would be a prerequisite to custom-syntax function calls such as Line(from x, y to x+10, y+10).
  • What kind of context information should be available during statement and expression parsing? Certainly the type Symbol of parent and grandparent nodes... but some statements may only be available if a certain custom attribute was used on the statement or a parent statement, so I think the set of attributes for parent/grandparents should be available too. And maybe availability based on attributes should be a standard feature, a criterion upon which Loyc activates/deactivates the statements automatically. But as I've said, providing the parent ICodeNode seems like too much to ask. I suppose it could be provided optionally.
  • As I mentioned above, there are two ways to look at how statements are allowed to be nested inside other statements. You can either have the substatements specify what they can be located inside of; or, the parent statement can say what kind of substatements it can contain. Should Loyc support both approaches?

Now consider this. Suppose somebody writes an "unless-else statement" extension:
unless (x < 0)
return new StringBuilder(x);
return new StringBuilder();
and somebody else writes an extension for "macro methods" which can be "instantiated" as normal methods:
macro(T) T Abs(T x) {
unless (x < 0) return x;
else return -x;

instantiate(int) Abs; // create method int Abs(int)
instantiate(long) Abs; // create method long Abs(long)
instantiate(float) Abs; // create method float Abs(float)
instantiate(double) Abs; // create method double Abs(double)
You can see that the macro method statement should be able to parse all statements that belong inside a method, and the "unless" statement should be allowed in the same places an "if" statement is allowed. You can see that if the "macro method" had to specify explicitly the kind of statements it supports, or if the "unless" statement had to specify explicitly the allowable parent nodes, then there is no way the two extensions could work together if neither author knew about the other's extension.

Therefore, I think statements should grouped by "category", where categories are classes of statements like "method body statements", "class body statements", "loop statements", "block statements", "conditional statements", etc. I suspect categories will be important for extensibility because they can allow statements to work together that are not aware of each other.

Hello, no one!

At this point I have no readership and no one has the foggiest idea what Loyc is, except maybe my non-programmer best friend, Ivan. (In case some random guy reads this, Loyc is the Language of Your Choice, a multi-language compiler that will allow anybody to add new features to existing programming languages.)

Before I try to get anybody on board the project, I need to work out enough design issues to present a reasonable description of its design. I've got some design documents written but not posted on the web yet--I need to re-learn how to upload to, and the usability of their services is truly awful, at least if one is not a Unix guy. For now I should probably make pages on my local crappy WEBrick server. Then I won't have to upload anything, just move stuff to another folder on my hard drive.

For now you can read the incomplete doc about my extensible expression parser called ONEP. If anyone clicks on that link I'll be shocked, shocked! I'm excited to announce (to my zero readers) that the C# code for BasicOneParser is complete and you can post a message if you want a copy.

Anyway, I bought the domain name a few months ago; in fact that's how I chose the name of the project: domains for most acronyms are already taken.

Saturday, August 11, 2007

Microsoft sure knows how to foil search engines

I have been trying to learn COM and .NET development lately, but every time I try to search for COM and .NET related stuff on Google, it seems like I get any website that has .com or .net in the domain name. Plus, search engines are harly able to tell the difference between C, C++, C# and pages that are indexed by their first letter. For .NET I've got around this problem by searching for ".NET framework" or CLR, but for COM there seems to be no way to find information about it.

What TLA will they come up with next? THE? FOO?

P.S. My 80-gig secondary hard drive has died. It's only a matter of time before my 200-gig explodes...

Tuesday, August 7, 2007

Microsoft C++ doing what Loyc is doing?

I came across this MSDN article today, which says
...when the project is built, the compiler parses each C++ source file, producing an object file. However, when the compiler encounters an attribute, it is parsed and syntactically verified. The compiler then dynamically calls an attribute provider to insert code or make other modifications at compile time. The implementation of the provider differs depending on the type of attribute. For example, ATL-related attributes are implemented by Atlprov.dll.

So, boo and Loyc aren't the only compilers doing binary-compatible compiler extensions. I wonder who else is doing it.

Wednesday, July 18, 2007

C++/CLI is disgusting

Microsoft has found a truly awful set of syntax and semantics for their new C++/CLI language, formerly known as "Managed Extensions for C++". They decided that the old syntax was ugly because it used keywords that began with double underscores (which is a standard way to add compiler extensions in C++). Unfortunately, their solution was much worse than the problem they were trying to solve.

I had been using the first Managed C++ for a little while, but luckily I only made a single module in it (wrapper classes to allow C# to access some C++ classes). After a couple years I wanted to add a dialog box that accessed the C++ classes directly, but the forms designer only supported the "new" syntax; worse, Microsoft requires that the entire project only use one syntax or the other. So I learned the awfulness of the new design as I laboriously converted each line of the old code to the new syntax; the new syntax is so different that virtually every line of the module's header file had to be changed. And not just slightly. In many cases it was faster to retype the line than to try to adjust it. And they didn't just make new syntax, they invented new problematic semantics as well.

The changes include
  • The new "handles". A pointer to a managed class used to be called MyClass*, now it's MyClass^. Other than that they are still used like pointers (i.e. with the arrow notation).
  • "Tracking references". Instead of writing String^& and Int32& like you would expect, you have to write String^% and Int32%.
  • nullptr. Whereas you used to be able to initialize all pointers to NULL, including managed pointers, now you have to remember if it's a managed class and use "nullptr" if so.
  • Same with 'new'; now you have to write gcnew if the class is managed.
  • New finalizer syntax. Confusingly, whereas C# and Managed C++ use "~ClassName" for finalizers, Microsoft decided it was too predictable and renamed it to "!ClassName". Worse, they now require your Dispose() function to be called ~ClassName(), which causes a silent semantics change in old code. Or it would, except that you'll know something's up because your Dispose() method yields this odd error: "'Dispose' : this method is reserved within a managed class".
  • You can no longer use a managed enum like you do a normal enum; you have to qualify the names with "EnumName::EnumValue". This makes it impossible to share an enum between C# and standard C++ code, so you have to create a second enum (with the same items) and convert between them all the time.
  • In a managed class you must say if you're overriding a base class function or not, or you'll get a compiler error--whereas in pure standard C++ you can't. Argh! Even C# lets you off with a warning. And what a bizarre syntax they've picked too; instead of grouping the "override" keyword with "static", "virtual", etc., they make you put it at the end: virtual void foo() override {}. What's more, you have to specify both virtual and override.
  • Similarly, "sealed" and "abstract" go after the class name.
  • When making managed properties, you now have to group the setter with the getter in a single construct like in C#, but unlike in C#, you also have to repeat the data type three times (or twice if it's just a getter). How many times do you want to type Dictionary<string,SomeFreakyLongClassName>?
  • CLR enums are no longer implicitly convertable to arithmetic types.
  • They've switched to the standard 'typeid' syntax instead of __typeof(MyManagedType). Oh wait, no they haven't! The syntax is randomly different: MyManagedType::typeid versus typeid(UnmanagedType).
  • What the hell were they thinking here?
    virtual Object^ InterfaceClone() = ICloneable::Clone;

    The old syntax for 'explicit interface implementation' made much more sense:
    Object* ICloneable::Clone();
Tell me, how is it that when C# is supposedly modeled after C++, the C++ version of all these .NET features ends up looking so much longer different than C#?

Admittedly, there are a few things that don't suck, like
  • the support for normal C++-style operator overloading in managed classes.
  • implicit boxing (although if NULL is defined as 0, watch out for boxed zeros when converting old code)
  • default indexers (much like in C#)
  • trivial properties (but they're inflexible and so not usable in many scenarios)
And now some managed-style features work in unmanaged classes, such as properties. Personally I have no use for this. After all, using such features means you can't compile your unmanaged class in a non-.NET program, so their utility is limited. If I want to write a class that only works in .NET, I would almost always make it a "ref class" or "ref struct" so I can interoperate with other .NET languages.

There are two main problems I see with their design.

The first big problem is that they've forgotten the spirit of C++ and discarded longstanding rules of C++ such as implicit overriding. C++'s philosophy has long been that an object should be able to behave like a pointer, like a number, like a function. Smart pointers, iterators, fixed-point/matrix classes/bigints, functors. The ability of one thing to act like something else is the whole basis for the STL. But in Microsoft's new design, everything managed is completely segregated so you can no longer write code that doesn't care whether something is managed or not. It's not just reference types either; value types and even simple enums are segregated to an extent that they weren't before. You always have to think: Do I have to Qualify:: that enum or not? should I use gcnew or new here? NULL or nullptr? * or ^? & or %? I can only use one or the other in a given context, but the wretched compiler still makes me tell it what it wants to hear. Template code that before could have (theoretically) taken managed or unmanaged classes for arguments can now take only one or the other, because a separate syntax is needed for each.

The second big problem is that there is no longer anything I can share between C# and standard C++. I have a library that needs to be compiled into both C# programs and MFC programs (which must be Windows CE compatible, so mixing .NET and MFC is not an option). With the old syntax it was possible to share a small number of value types and enums between plain C++ and managed C++ (with the help of some #define macros); now I have to make two versions and convert between them.

If anything, Microsoft should have made the managed syntax more like standard C++, not less. It should have considered how to allow people to write classes that could be used directly from C# or (in another program) directly from standard C++. This would have made a much better bridge between unmanaged land and managed land. As it is, Microsoft has imposed a kind of syntax apartheid.

Bottom line: I loathe the new syntax. It makes me long for the hellish landscape of double underscores again.

Thursday, July 12, 2007

Goodbye, ANTLR

Three days ago, after finding workarounds for the ANTLR3 (C#) bugs detailed here, I immediately ran into even more bugs. For instance I had a rule that said


Somehow the generated code for this rule included a check (during the matching stage, if I remember correctly) that said, in essense, "if the comment contains a slash character, generate a syntax error". What the hell? And there was another bug besides that which I've forgotten. My bug report on the first batch of bugs went mostly unacknowledged, so I didn't bother to try isolating this new problem.

Instead, I'm planning to try another approach: I'll make my own ANTLR. I bought the ANTLR book May 26, and I've been unable to get the thing to work for me since then. I'm getting impatient. I know how a LL parser generator should behave, so I ought to be able to make one... right?

Of course, I would like a parser generator done the Loyc way - as an extension to Loyc. But it'll be a little bit tricky to do this, because Loyc does not actually exist yet. It's still in the planning stages! There are no AST classes, no ONEP. So what will I do?

Well, my initial goal will be a translator from boo to boo. I'll make some AST classes and give them the ability to print themselves out as source code. Then I'll create a lexer and tree parser by hand; as for the main parser, I'm not sure how to approach it. But after I've done those things, I'll write some routines for printing out AST nodes as text. So it will be able to read source code and spit it back out.

At this point I've already written a lot of the lexer by hand. I've taken it as an opportunity to figure out how a parser generator should work, by attempting to write the lexer the way a machine would do it. I started by writing the lexer grammar in a hypothetical boo-style syntax; then I translated that grammar--mechanically, by hand--to C# source code.

There is so much work I have to do before I start making the parser generator, though. I fear that by the time I'm done with the prerequisites, I will have forgotten the lessons I'm now learning about making a parser generator. We'll see.

What's wrong with Java

When I see the features added recently to Java, I'm sure glad I'm using .NET, C# and boo. Even though Java is a lot older than .NET, .NET seems to get the good features first. Care in point: Generics. For a former C++ developer, it seems stupid to give up type-safe collections; I don't know how many years it took before Java got generics (10?) but .NET got them in much less time (less than 4 years, I do believe) and Java was left playing catch-up. In fact, most of the "new" features in Java 5.0 seem to be things that C# had from the beginning:
  • enhanced for loop (foreach in C#)
  • autoboxing/unboxing
  • enums
  • varargs (variable argument lists - "params" in C#)
  • annotations (much like .NET attributes)
Java generics aren't even supported by the JVM, so you get the same performance penalty from casting that you did before. I've always been unhappy with Java's performance (especially for GUI programs), whereas .NET just doesn't seem slow.

Look, Sun, if nothing short of competition from Microsoft can prompt you to improve Java, you must not care very much much about it.

Let's see, what else...
  • Value types (structs). This is a big one for me because can offer a big performance boost in many situations. You don't want to allocate a new object if that object contains nothing more than an integer and some methods, do you? A new object sucks up at least 16-20 bytes of memory even if it just contains one integer or reference; creating it requires multiple method calls and all those bytes have to be initialized. Useful value types include
    • A "Point" type that has X and Y coordinates
    • A "FixedInt" type that contains a fixed-point number (the language must support operator overloading to make it easy to use, of course.)
    • A "BigInt" type that contains a small integer normally, but allocates a memory block for a large integer if necessary.
    • A "Handle" type that contains an opaque reference to something else
    • A "Symbol" type that contains a numeric identifier that represents a string (symbols are a built-in feature of Ruby and are typically used like enums, except they are more flexible)
    • A Pair type that contains a pair of values A and B; often you get better performance by not allocating memory for this purpose.
  • Multi-language support. Well, the JVM can certainly support multiple languages, but only .NET is specifically designed for it. Admittedly, the design isn't that great, but at Microsoft specifically considers the needs of other languages.
  • Delegates. The Java equivalent is using interfaces with one function in them, but this is relatively inflexible and certainly more annoying to use. Java provides inner (even anonymous) classes to help people use the pattern, but delegates are way better.
  • Closures (functions inside other functions, where the inner function can access local variables of the outer function). Java doesn't have that, does it? You can access "final" variables from a function-inside-a-class-inside-a-function, but that's all. By the way, .NET itself doesn't actually support closures, but C# fakes it well.
  • Iterators. Now this may be my favorite feature of C# 2.0; it would be hard to choose between iterators and generics. I love them not only because you can create enumerators easily (which is great) but also because you can approximate coroutines with them.
  • Swing. Ugh! It's ugly, it's slow, and the Windows "skin" isn't very convincing. There often seem to be glitches in Swing that you don't find in other programs, such as the failure to resize a window fluidly (i.e. the window doesn't redraw itself until you let go of your mouse button). Finally, and worst of all, developing Swing GUIs is a huge pain in the ass. I absolutely can't stand it. The .NET counterpart, Windows.Forms, doesn't seem all that well designed, but it looks good, it's relatively fast, and it's easy to write code for it. Plus, of course, a good Forms designer is a standard feature of any .NET IDE.
Right now I wish I could have the C# 3.0 "var" feature because I'm sick of typing

SomeJerkGaveThisClassALongName foo = new SomeJerkGaveThisClassALongName();

Obviously we should be able to write simply

var foo = new SomeJerkGaveThisClassALongName();

And there's a lot of other great stuff in C# 3.0 [.doc]:
  • Lambda expressions (syntactic sugar for anonymous inner functions) with type inference
  • Type inference for generic method calls
  • Extension methods (they are not well thought out, but I'd rather have them than not)
  • Object and collection initializers (to make code more brief)
  • Anonymous POD ("plain old data") classes, which work like tuples except that the fields have names.
  • And last but not least, the query thingie, LINQ.
Suddenly, C# is starting to seem a lot more like boo.

Having said all this, there are a couple of things from Java that I might like to have in C#:
  • The assert statement. Typing Debug.Assert() all the time is driving me nuts.
  • Inner classes that have an implicit link to the outer class
And let's see, if I could have some more features I think they would include
  • Traits
  • The ability to supply a default implementation for a member of an interface
  • Preconditions and postconditions on methods
  • Static and run-time unit checking (units as in metres, litres, bytes, pixels, etc.)

The Loyc Blog

This blog will be a place for me to report on the progress of Loyc and to comment on the programming field in general. Welcome.