Friday, August 10, 2007

Why should we be excited about RIA? What's the big deal? Some people have reductionist attitudes toward new technology, focusing more on how something is the "same ol' stuff" instead of focusing on the new aspects the technology brings to the table. Strides forward in software technology are usually evolutionary instead of revolutionary which makes it easy to succumb to this reductionist perspective. Software borrows from the successes of the past and improves upon the failures. The revolution, if it happens, is in how the technology is used. Arguably the largest world changing technology in our lifetimes is the web, and this is just an evolution of the Internet via a new application layer protocol (HTTP), a markup language (which evolved from SGML), and rendering engines to display it. Some dismissed it as "nothing really new" or thought it had nowhere to go. Why should we have gotten excited about the web, if it was just (apparently) a few minor steps ahead of where the technology was at the time? Others embraced it, and now look at where it is. The web is such a great example because we can see where things are now. We know that HTTP, HTML and a few web browsers enabled the world to change.

New software, such as what we're seeing in the RIA world, is all about enabling something new / making something easier. It doesn't matter if software is truly innovative (which, if we define too narrowly, is significantly harder to achieve than most people realize) or not. The focus should be on what we can can do now that we couldn't do yesterday, or what we can do now that was hard to do yesterday. This perspective is far from original, but it's useful to bring to the front of our minds as we consider where RIA can go.

So, what really are we gaining? As I mentioned in my previous post, I see RIA as a shift in perspective. Website designers being able to develop desktop applications, and .NET developers being able to use their skill set for desktop and web user interfaces, both without massive retraining, enables people to contribute their expertise in new ways to the applications that are developed. We're bridging worlds that co-existed but weren't as blended together as they are now. The technology is enabling this blending. JavaScript and HTML and CSS are leaving the browser. XAML is spoken by both development (Visual Studio) and design tools (Expression suite). Some stuff is new (XAML), other stuff (HTML applications anyone?) is made easier. And this intersection is just the beginning (notice I'm falling into a trap of focusing on just the front end - there's exciting stuff happening in other layers too!)

Some people will dismiss what's happening as "nothing really new" or "just more technology" but the possibility is here, right now, to make this so much more. This is why we should get excited about RIA. Many people are already exploring it, figuring out what works and what doesn't. I'm not totally sure where everything is heading. I'm trying to chart a course like everyone else and people far smarter than I am are leading the way. But I can't get rid of the feeling that we're at the beginning of something significant. It seems we've come so far but that's nothing compared to the changes we'll see in the coming years. The first step, what we're seeing now, is a lot of exploration and playing with the technology to see what we can do, as we also figure out what we should do. People's approach to application design will change as high profile projects become successful and as the industry discovers and publishes best practices, etc. Security will be a big deal. How well applications are developed will affect user's perception of the technology and of the companies that build it. There's a lot to think about and there will be more people discussing it. This can be a good thing and a bad thing, but it's definitely an exciting thing.

I'm unsure if I'm saying anything useful here or not. I do know that I'm a code junkie and am rather surprised I have two opinion pieces on RIA within a week. I love implementing the things I imagine in my head, whether I'm making money off the code or not. So far I haven't discussed code much with WPF/Silverlight/etc. I need to change that on here, but I want to contribute something useful, so stay tuned, I'll get there as soon as I can! Thanks for reading.

Friday, August 10, 2007 1:51:11 AM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback
Wednesday, August 08, 2007

Ah yes, the perils of having co-authored a Java book and now working in a .NET shop :) Whenever a Java project arises, it comes my way. (Okay, I'm not REALLY complaining, but it's humorous to me) This project seemed easy enough: someone else wrote a web service in C#. I needed to write Java code to invoke the .NET web service and package it as a small SDK of sorts. So I hit the web service, pulled down the WSDL, ran it through Axis, and I got some code. However, I didn't want the additional dependency on Axis when delivering the Java code to our customer, so I tried using wsimport, a tool that comes with the JDK. Should work right away, right?

Sadly, no. I got an error about the s:schema being undefined, so I tried passing the address of XMLSchema.xsd from w3.org as a binding via the -b option.

Then I got a confusing error that left me scratching my head for longer than I care to admit. It is because of this error that I am going to the trouble of writing this post. I found exactly one hit on Google about this error and that search result simply had someone replying "I haven't seen this before....I think you solve it by....." which unfortunately was useless to me.

To cut to the point, the issue was one of the properties on the C# object is a DataSet and wsimport ain't very happy with DataSets since the types cannot be determined until runtime. The solution is straight forward: use a typed DataSet. There's a ton of documentation online describing how to create typed DataSets in Visual Studio. I couldn't find this information, though. Nobody linked the following error with DataSet interoperability issues, for whatever reason. It's likely that this is NOT the only case where wsimport will output the error message below. If this post doesn't help you and you're getting this error, check interoperability problems, check that the XMLSchema.xsd namespace/location is correct (specify -b if needed, perhaps), and if none of this helps - take a WSDL you know wsimport will work with, compare it to the WSDL that's failing, see where the differences are and start removing constructs that seem out of place. Once the problem WSDL works, look at the last construct you removed, then you'll have your problem narrowed down.

Here's the error I received from wsimport:

C:\javacode\ws>wsimport DotNetWebService.wsdl -b http://www.w3.org/2001/XMLSchema.xsd

XML reader error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[112,36]
Message: A '(' character or an element type is required in the declaration of element type "xs:schema".
XML reader error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[112,36]
Message: A '(' character or an element type is required in the declaration of element type "xs:schema".
at com.sun.xml.internal.ws.streaming.XMLStreamReaderUtil.wrapException(XMLStreamReaderUtil.java:249)

Once the DataSet was changed to a strongly typed DataSet, the additional binding option wasn't needed. You can execute wsimport DotNetWebService.wsdl and end up with your code.

I hope this helps someone out there

Wednesday, August 08, 2007 10:47:26 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback
Monday, August 06, 2007

Okay, so I go to write a post today and find I have some good longer posts but didn't feel like putting one of those together today. And unfortunately, I struggled to come up with a short topic :) To the few readers I have at the moment, I'll instead post a little puzzle that one of my smartest friends shared with me. It's the type of thing you see right away or get stuck for awhile, I think.

In this equation: 101 - 102 = 1

Move one and only one digit to make this equation true. By "one digit" I mean literally one digit - not one "type" of digit - so you can't move more than one 1 for example.

Enjoy

Monday, August 06, 2007 10:22:38 PM (Eastern Standard Time, UTC-05:00)  #    Comments [2]  |  Trackback
Sunday, August 05, 2007

Many within the company I work for are happy with the deployment of Team Foundation Server (TFS), including people that tend toward the skeptical side with all of Microsoft's offerings. It's great to see such a powerful, integrated tool actually helping development and project management as promised. I am still the main developer supporting our deployment of TFS, so when our configuration management (CM) group requires a tool or any extensibility done to TFS (such as custom controls in our work item types) I am the one to do the implementation.

At the moment, CM is looking for some specialized handling of work item type files, details of which I'll possibly outline in a future post. Since this requires processing the work item type XML files, I went to retrieve the Visual Studio SDK which includes the schema files for the work item types.

The xsd program generates code based on the original schemas. However, passing these files through any validator, such as XMLSpy's, reveals problems with the XML. Since xsd generates code I can use, I could stop here, but I wanted to explore the XML and see if there's a way to fix it. Also, I don't think this is a blocking issue for many at all, explaining why I haven't seen much about this online.

After opening WorkItemTypesSchema.xsd in XMLSpy we get this warning message:

Some of "include" and/or "import" and/or "redefine" statements in the following files have no schemaLocation attribute and will be ignored!

Step 1 was to address this issue, so I added schemaLocation="typelib.xsd" to the import of typelib.xsd near the top of the file.

Then I saved and got the following error message:

This file is not valid! If you save the file in its current state, other XML processors may have a problem opening the file.

When something like this happens, I struggle to find out "why?" If you read Raymond Chen's blog regularly, there are many instances where software such as Windows seems to do something boneheaded but upon thinking about it for a few moments (or viewing it in light of backward compatibility) it makes sense. I struggled for a few days to come up with the best answer I could to explain this failure to validate and I only see one likely possibility, but we'll get to that shortly.

So, why is the schema file invalid? It's because of non-determinism. Searching for "XML" and "non-determinism" on Google brings us to this http://msdn2.microsoft.com/en-us/library/9bf3997x(... page at msdn, that states:

A deterministic schema is a schema that is not ambiguous, allowing the parser used by the Schema Object Model (SOM) to determine the sequence in which elements should occur in order for an XML document to be valid. It is possible for an XML Schema to be ambiguous, or non-deterministic. A schema is considered to be non-deterministic if the parser is unable to clearly determine the structure to validate with the schema. When validation is attempted on a non-deterministic schema, the parser used by the SOM generates an error.

I now have two conflicting pieces of information. I have XMLSpy telling me the schema isn't valid. Then I have msdn at microsoft telling me schemas can be ambiguous coupled with xsd successfully generating code. If we follow the cos-nonambig link in the error window of XMLSpy, it brings us to this page http://www.w3.org/TR/2004/REC-xmlschema-1-20041028... that states:

A content model must be formed such that during ·validation· of an element information item sequence, the particle component contained directly, indirectly or ·implicitly· therein with which to attempt to ·validate· each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

We expect the ultimate authority is w3.org. After more Google searches, it appears the existence of ambiguous schemas is (or was, hopefully) expected, even if these schemas don't validate. Rick Jelliffe, a former member of the XML Schema Working Group says:

I have received several very negative reports on the state of interoperability of tools using XML Schema ... The most common complaint is tools that generate ambiguous XML Schemas ... Ambiguous schemas effectively break everything downstream (http://www.w3.org/2005/05/25-schema/rick.html)

Okay, so the schema isn't valid, it apparently violates the specification at w3.org, and we have someone that should have some authority on this matter saying ambiguous schemas are bad. In spite of msdn acknowledging ambiguous schemas can exist (and corroborated by other sites I browsed), I think Microsoft should have made this schema validate.

XMLSpy indicates the <xs:complexType name="FieldDefinition"> tag is where the non-determinsm exists. We see the non-determinism almost immediately (if we know how to identify it) in the following lines

<xs:complexType name="FieldDefinition">
  <xs:sequence>
    <xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
    <xs:element name="HELPTEXT" type="HelpTextRule" minOccurs="0"/>
    <xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>

This specifies a sequence of PlainRules (0 or more), followed by zero or one occurence of a HELPTEXT element, followed by 0 or more PlainRules again. I'm a little fuzzy on the default value of maxOccurs with the HELPTEXT element (and can't easily verify at the moment), so it might be 0 or more HELPTEXTs (and would explain why this isn't an attribute) but the following discussion holds true whether it's "0 or 1" or "0 or more" HELPTEXTs. Let's simplify this for purposes of explanation and call PlainRules "A" and HELPTEXT "B". Using symbols from regular expressions (and formal languages and such, stretching back to university) we'll use * as "0 or more occurrences" and ? as "0 or 1 occurence." This allows us to look at this XML fragment containing the non-determinism as:

A* B? A*

I went to the trouble of converting to these symbols in order to easily illustrate the non-determinsm. By the XML schema specification, each element (the particle component) must non-ambiguously belong to a predictable part of the schema. The following sequences belong to the language A* B? A*

  • AAAAABAAAAA
  • ABA
  • AABAA
  • AB
  • BA

All the preceding sequences avoid the ambiguity issue. The A's that come BEFORE the B belong to the first A group, and all the A's that come AFTER the B belong to the second A group. The ambiguity is solved here by the presence of B, but the language states B is optional, so the following are also valid members of the language A* B? A*

  • A
  • AAAAAAAAA
  • AAA

This is where we encounter the ambiguity. When there is a single A, does that A belong to the first group of A's or the second? When there is a sequence of A's, such as "AAA" then it's just as likely for any of the A's to belong to the A* before the B? as it is for them to belong to the A* after. There is no way to know which A* an A belongs without mandating B in the middle.

The only reason I can see for Microsoft to introduce this ambiguity is to make it easier for people modifying the work item type definitions. These people can place HELPTEXT anywhere as a child of the FIELD element, instead of mandating that HELPTEXT appear first. It seems a straightforward requirement to say "HELPTEXT" must appear in a specific position rather than anywhere in a big mess of PlainRules, since all other elements must be placed precisely.

In order to fix this issue, any tweaks done to the schema must not break the existing schema (such that a work item type validates against both the original and the fixed schemas). The most straightforward way I came up with is to mandate that HELPTEXT, if present, is the first child of the FIELD element. This means the language A* B? A* is rewritten as B? A*. Now the parser knows if there are any A's (PlainRules), they match the one and only A* specification. I changed the FieldDefinition type in the schema to

<xs:complexType name="FieldDefinition">
   <xs:sequence>
      <xs:element name="HELPTEXT" type="HelpTextRule" minOccurs="0"/>
      <xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
   </xs:sequence>

For those a step ahead of me, you'll realize that if HELPTEXT appears after any PlainRules in a work item type, it no longer validates against this revised schema (while still validating against the original). Since I'm writing a tool to manipulate work item types, my fix is to execute code to rewrite the work item type, ensuring any occurence of HELPTEXT appears as the first child of any FIELD elements. I'll be placing this code on this site shortly incase anyone wants to use it.

There's one more modification needed to the XML schema so it validates - the two regular expressions used at the bottom (in SizeType and PaddingType) have their commas escaped. Once the backslashes before the commas are removed, and the other changes done, the XML file validates fine.

Please note that I don't know every corner of the XML schema specification, nor have I explored the Orcas or Rosario TFS versions, so information here might be incorrect or out of date shortly. I imagine I'll update this topic after seeing how things change in Orcas/Rosario.

Sunday, August 05, 2007 11:11:00 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback

There's much discussion of RIA in the blogosphere, most of which I've been reading at some Microsoft blogs and some Adobe blogs. I'm not an artist, even my stick figures reveal my lack of skill (and I don't have the patience to get good like Richard Feynman did), but clean user interfaces and strong user experiences have been a side interest of mine dating back to my first reading of Design of Everyday Things. I've been turning my attention more toward RIA and the supporting infrastructures with this next wave of technology that includes WPF and Silverlight and Adobe AIR and Flash and Flex and others, and eventually might include Seadragon and Surface, and whatever technology other companies come out with. My views align with Scott Barnes in general, but I wanted to hash out what I think in writing about RIA since it'll force me to organize my thoughts. I see RIA as more of a shift in philosophy, a fresh approach to application design, than as a specific technology, though the new technologies enable RIA development.

There are two significant perspectives in the RIA picture: that of the developer and that of the user.

(Note: I recognize I am speaking in generalities and not supporting everything I say with evidence, but bear with me. If you're reading this and disagree, chime in)

The Developer Perspective

As developers, we tend to lose sight of the bigger picture. We get mired in discussions of what technology is better, why X language sucks because it doesn't support Y feature and why Z company is superior. These are religious discussions in the technology world and I have very little patience for them. There is no one language, no one platform that will be the best answer in absolutely every case, and when we have choice, why do we waste time complaining about a certain technology? We figure out which technology best allows us to solve a particular problem and we move forward. Sometimes we deal with constraints (like working at a company that only uses Microsoft technology) but as long as the job can get done, we're fine with whatever we work with (or we quit to work for a company more in line with our personal preferences).

I also see some software developers constantly trying to shift perception of Microsoft technology by only focusing on the negatives and ignoring the positives. On one hand, this indicates people are holding Microsoft to a higher standard - they expect perfection and nail Microsoft when perfection is missed. But this view ignores the reality - Microsoft is like any other big software company. Product quality varies from one to another. Some features may make no sense outside the design meetings where an imperfect decision had to be made. Other decisions are actually the right ones from one angle but wrong from a different angle. And some decisions are made that are just wrong. I hate blanket statements I've seen online that say "everything from Microsoft sucks" because it ignores the many successes and reveals the commenter's ignorance. There are legitimate reasons to zing Microsoft (for example, no label viewer in TFS 1.0? I know they have deadlines and have to cut features, but still, that feature got cut? :) ) so let's shy away from dismissing Microsoft - or any company - out of hand.

I used to hate Microsoft over ten years ago. Even back then it was hip to hate Microsoft. But my views changed abruptly when I gave MS technology a serious chance. It might have been Internet Explorer surpassing Netscape that got me hooked, I'm not sure. I haven't let go for two main reasons: Microsoft technology, as a whole, is actually quite nice; and, I can get my job done fast. I've had to wrestle far less with Microsoft technology than other technology. I keep up with other technology for the times when MS doesn't have what I need, or when I have technology constraints I can do nothing about. I mention this background because the "let's hate Microsoft" and "Silverlight will fail" discussions are nothing new to me. People think Windows is dying or new technology from Microsoft is a failure and these people totally miss the point. Wishing Microsoft would go away didn't work 10+ years ago and it's not going to work now.

At the end of the day, software engineers are problem solvers. We implement solutions in whatever domain we live and work in. Can .NET help us do this? Yes. Can the Java platform? Yes. Can I roll out professional websites using IIS, ASP.NET, Windows Server 2003, etc.? Yes. Can I do the same with LAMP? Yes. This is why the religious discussions should stop in our industry. The people that hate Microsoft will continue to hate them, the people that don't like Java or open source will continue their avoidance, but the funny thing is, these factions will continue to solve problems and continue being productive (hopefully!)

I'm semi-ranting and meandering a bit, but what I'm getting at is Silverlight/WPF aren't going anywhere and we need a more open perspective as software engineers. There's much about Silverlight that should get recognized as "cool" and important:

  • Cross-platform CLR that does not require the .NET framework
  • DLR, the dynamic language runtime, extending the language support of .NET even further
  • Cross-platform support of a subset of XAML/WPF (which I imagine will grow closer to the full implementation over time)
  • XAML XAML XAML. I'm incredibly excited about XAML because I see it as "the new HTML." Once Silverlight has strong penetration, website designers can choose to develop sites in XAML and be confident people can view them. No more dealing with messy HTML/CSS and testing on every browser to make sure the site looks/works the same.

I also like that the technologies on the Adobe side (HTML, JS, Flash, Flex) are given a home on the desktop via AIR. This makes it easy for website designers to extend their skill set to the desktop. Adobe brought the design world to desktop applications and Microsoft brought developers to the world of rich application design, far surpassing the stodgy world of the past (MFC, WinForms, etc.) There's plenty reason to get excited about both technologies. (For the record, yes, I'm aware of Sun's offering, but no comments at the moment) We must be responsible software engineers moving forward as the RIA world evolves, and this means staying well informed about as much technology as we can.

Which technology will "win?" That's the wrong question to ask because there's no competition. Both will continue to exist, each caters to a different type of developer, and the people that really matter in the end are the users.

The User Perspective

I envision a spectrum of users, from those with the absolute bare minimum of knowledge required to use computers to those that are fairly sophisticated but don't do software development. The one thing that unites users is they want their software to work. This is a simple goal at its most basic for software developers, but also a tough goal because everyone uses software a little different. Some people will love an application and others will hate it, either because certain features are hard to use or the user's sense of how a feature should work is different from how it actually works. The larger our user base is for a product, the more we have to first focus on the functionality that affects 90% of the users, and then going forward we can refine the product to work well with as many additional users as possible. This is reality again intruding on what we create - limited resources, limited time, etc. We can also never win 100% of the users - I doubt any product can. There are people that don't like iPods due to bad experiences, but it doesn't hinder the success of the iPod.

Let's start at the basic end of the spectrum. Basic users want things to "just work," whether it's their car or their TiVo or their operating system or some other software. They don't know how it works, they don't care how it works. If it breaks, they want it fixed. Take a car to a mechanic, call the Maytag repairman (okay, maybe call him to fix your cable), get the neighbor's kid to remove spyware. These are the users that don't care if software automatically updates itself - as long as the updates don't break anything. We can't disregard these users when we write software, or discount how many of them there are. These users are a significant part of the reason Microsoft stays as committed to backwards compatibility as they do - if users' existing software didn't work on a new platform, they'd refuse to upgrade, or worse, refuse to use Windows going forward. Whether a site is implemented in Adobe technology or Microsoft technology or whatever, the users don't know and don't really care. Why should they? They want sites they visit and links shared by friends to work, they want to read and respond to e-mail without a hassle.

As we move to the other end of the spectrum, we find users that have an increased knowledge of their software and how it works. They'll know where the advanced configuration dialogs are and will pretty much understand all the options. These users might turn off automatic updates in order to have more control, but the only reason they'd do this is if they've been burnt by automatic updates in the past. These users are more informed about technology and might have strong opinions. The more sophisticated a user is, the more control he wants over his world. It's probably why they are sophisticated to begin with - dissatisfaction with the default configuration, a yearning to understand all they can, or they have specific needs met only by the nether regions of a program (think about how many features Word offers that most users don't use).

The difficulty in developing software is knowing just how many options to expose and what sort of application design will appeal to the majority of users, and hopefully to all users. Most users won't explore configuration too deeply and will in fact be intimidated by too many options. Most users don't want a deep level of choice - again, they simply want software that does what they expect - though we must balance this with what the more sophisticated users want. When we design applications in the near future, we have to think deeply about users. It is a challenge, but the evolution that is occurring in the software industry can help elevate the nature of applications we design.

Conclusion

It appears I've wandered far away from RIA, but I haven't. I see Rich Interactive Application design (yes, I prefer this term, and no, not simply because I'm towing the MS line) as a refocusing on the user via rethinking our user interfaces and application designs, whether applications are on the Internet or not. Using new technology, whether it's Flex or Silverlight or something else, opens more possibilities for us as software engineers. I think we're at the beginning of the next major wave of how people interact with computers and it's definitely an exciting time to contribute our vision and our expertise. It is important to raise our consciousness about this shift and move away from arguing about which technology is superior. We're all in this together, now let's get to the work of building awesome, useful technology using the tools given to us.

Sunday, August 05, 2007 2:51:56 AM (Eastern Standard Time, UTC-05:00)  #    Comments [2]  |  Trackback
Friday, August 03, 2007

Windows Presentation Foundation introduces a number of new concepts, such as XAML, dependency properties, data binding (in the WPF/XAML world), and type converters. Let's take a brief look at type converters and how they're used in WPF.

Since XAML is an XML dialect, parameters to objects are specified as strings. The parameters aren't usually actually strings, so we need a way for the XAML parser to convert the string to the correct object type. This is accomplished via type converters.

Using a type converter is XAML is easy, for example:

<Button Content="Accept" Background="Blue" />

The color specified as a string is converted to a Color object by the XAML parser, since the Button class' Background property is of type Color.

A type converter is obtained by passing a type of object to the GetConverter method, like this:

System.ComponentModel.TypeDescriptor.GetConverter(typeof(type));

The TypeConverter class has a number of useful methods, some of which are:

  • CanConvertFrom: specifies which type it can convert from
  • ConvertFrom: performs conversion
  • CanConvertTo: specifies which type it can convert to
  • ConvertTo: performs conversion
  • IsValid: validates whether an object is valid for this type
  • CanCreateInstance: whether the converter can perform creation of an object based on property values
  • CreateInstance: Re-creates an object based on an IDictionary of properties

One of the beautiful things about XAML and other bits introduced with WPF is that they aren't WPF specific. XAML is an application markup language that mirrors .NET classes in markup, so can be used outside WPF. Type converters are no different - the support is built into .NET 3.0+ so you can add knowledge of these bits to your tool kit and use them where appropriate.

If you want to implement your own type converter, reference this link at MSDN: http://msdn2.microsoft.com/en-us/library/ayybcxe5....

Friday, August 03, 2007 9:27:07 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback
Thursday, August 02, 2007

Scanning is the process of breaking input up into discrete tokens (such as one or more letters forming a word) and parsing is the process of applying meaning to these tokens (such as multiple words strung together to form a sentence, following specific grammatical rules). Software developers come from a variety of backgrounds, and while some may remember using these tools to construct a compiler in their Computer Science undergrad, I have a feeling many are unfamiliar with (or forgot about) these tools. I'd wager that the most well known scanner and parser, in the general pool of software developers, are lex and yacc. The "lex" name is short for lexical analyzer, a tool that understands a syntactic unit of information, such as a word in an English sentence, and feeds these units to another program one at a time. The "lex" tool is a scanner generator since it creates code that scans input. The "yacc" name is short for "yet another compiler compiler" which is accurate since the common usage of a parser is to create a compiler, so you can say the parser compiles a compiler. In general, though, "yacc" is a parser generator.

Remember that a compiler is nothing more than a translator. When you write code in any .NET language, a compiler translates the high level code (such as C#) to a lower level code (Intermediate Language, or IL) which is what the Common Language Runtime (CLR) understands and executes. In a way, a decompiler is actually a compiler, translating the low level code, such as IL, to a higher level language, such as C#. We use the terms "compiler" and "decompiler" because they communicate the direction of the translation. Compilers must scan input and then parse it. They must know that "public" is a token and "class" is a token and what an identifier looks like, but it's the parser that understands what a method is and what a while loop looks like and how they translate to IL.

A scanner and parser can be used separately. I won't go into details of constructing a parser in this post, but I will discuss developing a simple scanner using a scanner configuration that is translated into C# code. Before delving into an example, why do we care about a tool that can create a scanner for us? What are the benefits? Much processing of input we do in the business world (at least in my experience) is easily constructed by hand. We don't usually need to construct a compiler, so isn't using a scanner overkill? Like any problem we're called to solve, it's important to know about as many tools as possible, so when we encounter a case where a certain tool would save us significant time, we can use it since we know it exists. There are also business problems, such as sophisticated data translation, where a scanner/parser might be the perfect set of tools. Here are a few benefits to using an auto-generated scanner:

  • A scanner generator allows us to focus on the syntactic elements and not worry about any other details
  • If syntactic elements change, it's easy to update the code file used as input to the scanner generator 
  • A scanner can feed anything - a parser can accept the tokens, your custom code can, etc., so the syntactic analysis of input is separated from applying meaning to the input
  • Development of scanner is much faster than rolling one by hand, unless input is rather simple, and relying on a scanner generator reduces chances of introducing error into the scanner
  • A generated scanner is typically faster than one you roll on your own

A scanner generator for C# that I've used is available at this link C#Lex site

The syntax of the input files to C#Lex follow the syntax detailed at this JLex page except where called out at the C# Lexer site.

Let's look at an example: analyzing simple English. Words in English can take on multiple forms:

  • Words with an initial capital, such as at the beginning of a sentence or proper names
  • Contractions - words with an apostrophe (e.g., can't, don't, won't)
  • Abbreviations - words that are all capitals and optionally have a period after each letter (e.g., IL, CLR, e.g.)
  • Quoted words - single or double quotes on both sides (e.g., "compiler")

We'll stop there in the interest of keeping this simple.

An input file to the C#Lex program is formatted in three sections, each section separated with a double percent (%%) on a line of its own

  1. User code. This section is copied directly to the output file without modification, so you can place implementation and 'using' statements here.
  2. Directives to control C#Lex. This is where you can control C#Lex, such as specifying scanning states, and also specify regular expressions to match input.
  3. Scanner rules. This is where you specify what to recognize, what to do with it, state transitions, etc. When a rule matches here, data can be returned to the code running the scanning loop via a special class called Yytoken (that you define).

The scanning states allow recognition of different syntactic units at different times, so if certain syntactic units can only follow other syntactic units (think about visibility keywords such as 'public' and 'private' that can't appear outside a namespace declaration) you can control this in the scanner generator code.

The program we'll write is dead simple for purposes of illustration: it'll accept tokens from the scanner and output each token, one per line.

I won't go into details of the regular expression language, instead keeping it simple and showing the regular expressions we need without much explanation.

A word is simple: [A-Z]?[a-z']+

This gets us an optional initial capital and a sequence of one or more lower case letters and apostrophes. This doesn't limit the number of apostrophes, so let's revise it.

[A-Z]?[a-z]+'?[a-z]*

We can continue refining to ensure the end of the contraction is one of just a few options (such as "t" or "s") but let's keep it simple.

A word can also have all capitals: [A-Z]+

optionally separated by periods: ([A-Z]\.?)+

but periods can also separate lowercased words: ([a-z]\.?)+

Combining these gives us: [A-Z]?[a-z]+'?[a-z]* | ([A-Z]\.)+ | ([a-z]\.)+

We continue like this until we end up with a set of regular expressions that fully describe the input. Since the dot matches any character except newlines, we'll add a rule to pass this input back as a catch all rule. You probably don't want this in a real application, but it illustrates how to match any input not matched by other rules. Any whitespace is skipped over, along with punctuation we're not interested in (exclamation point, question mark, commas, periods). These regular expressions are far from perfect or comprehensive but they illustrate the process of analyzing the nature of the input and constructing the required expressions to scan the input.

I'm including the final file at the end of this post.

A Yytoken class must be defined. This is the communication mechanism between the scanner and the code you write and can hold any information you want (such as where in the input the scanner is, state information, etc). An instance of this class is what is returned by yylex() in the main scanning loop located in our code:

      Yytoken t;
      while ((t = yy.yylex()) != null)
      {
         System.Console.WriteLine(t.m_text);
      }

Now that we have an input file to the scanner generator, we first generate the C# code by executing C#Lex.exe on this file, then compile the generated C# file by invoking csc.

C#Lex english.lex

csc english.lex.cs

The input file (input.txt) has this line: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut "labore" et dolore magna aliqua.

Running english.lex.exe gives us the following output

Lorem
ipsum
dolor
sit
amet
consectetur
adipisicing
elit
sed
do
eiusmod
tempor
incididunt
ut
"labore"
et
dolore
magna
aliqua

Our final file looks like this:

using System;
using System.IO;

class WordExample {
   public static void Main(string[] argv) {
      Yylex yy = new Yylex(new StreamReader(new FileStream("test.txt", FileMode.Open)));

      Yytoken t;
      while ((t = yy.yylex()) != null)
      {
         Console.WriteLine(t.m_text);
      }
      Console.WriteLine();
   }
}

class Yytoken {
   public Yytoken(string token)
   {
      m_text = token;
   }
   public string m_text;
}

%%

ALPHA=[A-Za-z]
WORD=[A-Z]?[a-z]+'?[a-z]* | ([A-Z]\.?)+ | ([a-z]\.?)+
WHITE_SPACE_CHAR=[\n\ \t\b\012\r]

%%

<YYINITIAL> {WORD} { return(new Yytoken(yytext())); }

<YYINITIAL> \"{WORD}\" { return(new Yytoken(yytext())); }

<YYINITIAL> {WHITE_SPACE_CHAR}+ { }

<YYINITIAL> [\.,\?!] { }

<YYINITIAL> . { return(new Yytoken(yytext())); }

Thursday, August 02, 2007 11:05:06 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback
Wednesday, August 01, 2007

I’ve had my new Vista machine for awhile and haven't gotten around to writing down my impressions until now. Initially I really liked Vista. Some of the detail work (which I’ll describe later) was well thought out and did save me time in standard operation. UAC was hardly an issue after I got everything configured, and now it appears only at expected times, such as installing new programs. But as I continued to use my machine, a frustrating problem started - sometimes my machine would freeze. I repeatedly searched for solutions online, but none of the described scenarios were exactly like mine. The freezes started when I created a new directory through a file browse dialog. It wouldn’t happen every time but eventually the freezes increased in frequency and had no obvious trigger. At its peak, my computer would barely run for ten minutes before freezing on its own. During each freeze the mouse cursor was fine, which tells me Windows was still there, but the task bar froze and all active windows would slowly change to non-responding windows (I could do some limited alt-tabbing around for a short while). I tried a variety of solutions from updating drivers to various tweaks but nothing helped. Eventually the freezes mysteriously disappeared. I’m really not sure why – my best guess is Windows Update pulled down a fix since problems don't just disappear magically. I was eagerly hoping this would happen, and if this is indeed the case, I’m pleased. I bought this machine to keep my skills up to date with new Microsoft technology (from Vista to the new Office to all manner of exciting stuff happening in the .NET world) and now that it's purring, I'm quite happy.

Some of the detail work I love:

  • Visual appearance overall is pleasant. I like Aero Glass and the general look/feel of Vista.
  • I love that user directories are now under C:\Users (so mine is C:\Users\Jeff). I always dodged the "My Documents" folder on Windows XP (as a user) but I find myself organizing much more under my user directory on Vista than I ever did on XP.
  • The file extension isn’t highlighted when renaming a file in Explorer. How many times did we have to de-select the extension in previous Windows versions?
  • The central window that appears when you hold down Alt while Alt-Tabbing lets you use the mouse to select a specific window. This is such a time saver when you want to go to one of many windows, and the way I work, I have many, many windows open.
  • Flip 3D is really nice and I love that the windows in the stack continue their updating, as opposed to a stack of screenshots. I don’t know how behavior changes on different systems, but this behavior on my computer is awesome.
  • Files/programs are indexed, so it's easy to add the search in the start menu to quickly go to files/programs. I'm always going here to start Process Explorer.
  • I see potential for the sidebar but I haven't made much use of it. I figure loving this feature is a matter of installing the right gadgets, which I haven't searched for yet (I'm too busy watching videos of software I wish I was a part of developing, like SeaDragon and Photosynth)
  • One of the non-important drivers sometimes fails and Vista stops it without any ill effects on the system. I assume this driver is running in user mode which ensures it doesn't romp around memory and makes it easy to kill. I suppose I should disable the faulty driver.
  • I've been running Vista constantly for about three months and the only system crash I had was World of Warcraft overheating the CPU. Vista booted up fine after the abrupt shut off (and time to cool the computer) and I was back to grinding in the Ghostlands in no time.
  • It hooks up nicely to my Xbox 360 and watching video from my laptop on my television is a snap

These comments are more from a user perspective than a Vista developer perspective. I haven't dug into using any of the Vista-specific programming bits yet, but I am looking forward to exploring. I'm most interested in experimenting with the security improvements exposed to application developers. I'm spending a lot of my time these days learning WPF/Silverlight and playing around with various other technologies at work and at home, so getting into the guts of Vista hasn't happened yet and probably won't for awhile.

As an aside to the above comments, I wanted to briefly post the differences between versions of Vista.

  • Vista Basic: Stripped down Vista, no Aero Desktop, no media center, limited backup, no collaboration/media creation tools.
  • Vista Home Premium: Suitable for most regular users. Includes Aero, collaboration, media center, media creation (Windows DVD Maker, Windows Movie Maker)
  • Vista Business: Good version for businesses - doesn't include media center / media creation software, but does include the full PC Backup/Restore tool and Fax/Scan tools.
  • Vista Ultimate: This has everything.

Consult http://www.microsoft.com/windows/products/windowsv... for detailed information.

Wednesday, August 01, 2007 10:08:04 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  |  Trackback