Many within the company I work for are happy with the deployment of Team Foundation Server (TFS), including people that tend toward the skeptical side with all of Microsoft's offerings. It's great to see such a powerful, integrated tool actually helping development and project management as promised. I am still the main developer supporting our deployment of TFS, so when our configuration management (CM) group requires a tool or any extensibility done to TFS (such as custom controls in our work item types) I am the one to do the implementation.
At the moment, CM is looking for some specialized handling of work item type files, details of which I'll possibly outline in a future post. Since this requires processing the work item type XML files, I went to retrieve the Visual Studio SDK which includes the schema files for the work item types.
The xsd program generates code based on the original schemas. However, passing these files through any validator, such as XMLSpy's, reveals problems with the XML. Since xsd generates code I can use, I could stop here, but I wanted to explore the XML and see if there's a way to fix it. Also, I don't think this is a blocking issue for many at all, explaining why I haven't seen much about this online.
After opening WorkItemTypesSchema.xsd in XMLSpy we get this warning message:
Some of "include" and/or "import" and/or "redefine" statements in the following files have no schemaLocation attribute and will be ignored!
Step 1 was to address this issue, so I added schemaLocation="typelib.xsd" to the import of typelib.xsd near the top of the file.
Then I saved and got the following error message:
This file is not valid! If you save the file in its current state, other XML processors may have a problem opening the file.
When something like this happens, I struggle to find out "why?" If you read Raymond Chen's blog regularly, there are many instances where software such as Windows seems to do something boneheaded but upon thinking about it for a few moments (or viewing it in light of backward compatibility) it makes sense. I struggled for a few days to come up with the best answer I could to explain this failure to validate and I only see one likely possibility, but we'll get to that shortly.
So, why is the schema file invalid? It's because of non-determinism. Searching for "XML" and "non-determinism" on Google brings us to this http://msdn2.microsoft.com/en-us/library/9bf3997x(... page at msdn, that states:
A deterministic schema is a schema that is not ambiguous, allowing the parser used by the Schema Object Model (SOM) to determine the sequence in which elements should occur in order for an XML document to be valid. It is possible for an XML Schema to be ambiguous, or non-deterministic. A schema is considered to be non-deterministic if the parser is unable to clearly determine the structure to validate with the schema. When validation is attempted on a non-deterministic schema, the parser used by the SOM generates an error.
I now have two conflicting pieces of information. I have XMLSpy telling me the schema isn't valid. Then I have msdn at microsoft telling me schemas can be ambiguous coupled with xsd successfully generating code. If we follow the cos-nonambig link in the error window of XMLSpy, it brings us to this page http://www.w3.org/TR/2004/REC-xmlschema-1-20041028... that states:
A content model must be formed such that during ·validation· of an element information item sequence, the particle component contained directly, indirectly or ·implicitly· therein with which to attempt to ·validate· each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.
We expect the ultimate authority is w3.org. After more Google searches, it appears the existence of ambiguous schemas is (or was, hopefully) expected, even if these schemas don't validate. Rick Jelliffe, a former member of the XML Schema Working Group says:
I have received several very negative reports on the state of interoperability of tools using XML Schema ... The most common complaint is tools that generate ambiguous XML Schemas ... Ambiguous schemas effectively break everything downstream (http://www.w3.org/2005/05/25-schema/rick.html)
Okay, so the schema isn't valid, it apparently violates the specification at w3.org, and we have someone that should have some authority on this matter saying ambiguous schemas are bad. In spite of msdn acknowledging ambiguous schemas can exist (and corroborated by other sites I browsed), I think Microsoft should have made this schema validate.
XMLSpy indicates the <xs:complexType name="FieldDefinition"> tag is where the non-determinsm exists. We see the non-determinism almost immediately (if we know how to identify it) in the following lines
<xs:complexType name="FieldDefinition">
<xs:sequence>
<xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="HELPTEXT" type="HelpTextRule" minOccurs="0"/>
<xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
This specifies a sequence of PlainRules (0 or more), followed by zero or one occurence of a HELPTEXT element, followed by 0 or more PlainRules again. I'm a little fuzzy on the default value of maxOccurs with the HELPTEXT element (and can't easily verify at the moment), so it might be 0 or more HELPTEXTs (and would explain why this isn't an attribute) but the following discussion holds true whether it's "0 or 1" or "0 or more" HELPTEXTs. Let's simplify this for purposes of explanation and call PlainRules "A" and HELPTEXT "B". Using symbols from regular expressions (and formal languages and such, stretching back to university) we'll use * as "0 or more occurrences" and ? as "0 or 1 occurence." This allows us to look at this XML fragment containing the non-determinism as:
A* B? A*
I went to the trouble of converting to these symbols in order to easily illustrate the non-determinsm. By the XML schema specification, each element (the particle component) must non-ambiguously belong to a predictable part of the schema. The following sequences belong to the language A* B? A*
- AAAAABAAAAA
- ABA
- AABAA
- AB
- BA
All the preceding sequences avoid the ambiguity issue. The A's that come BEFORE the B belong to the first A group, and all the A's that come AFTER the B belong to the second A group. The ambiguity is solved here by the presence of B, but the language states B is optional, so the following are also valid members of the language A* B? A*
This is where we encounter the ambiguity. When there is a single A, does that A belong to the first group of A's or the second? When there is a sequence of A's, such as "AAA" then it's just as likely for any of the A's to belong to the A* before the B? as it is for them to belong to the A* after. There is no way to know which A* an A belongs without mandating B in the middle.
The only reason I can see for Microsoft to introduce this ambiguity is to make it easier for people modifying the work item type definitions. These people can place HELPTEXT anywhere as a child of the FIELD element, instead of mandating that HELPTEXT appear first. It seems a straightforward requirement to say "HELPTEXT" must appear in a specific position rather than anywhere in a big mess of PlainRules, since all other elements must be placed precisely.
In order to fix this issue, any tweaks done to the schema must not break the existing schema (such that a work item type validates against both the original and the fixed schemas). The most straightforward way I came up with is to mandate that HELPTEXT, if present, is the first child of the FIELD element. This means the language A* B? A* is rewritten as B? A*. Now the parser knows if there are any A's (PlainRules), they match the one and only A* specification. I changed the FieldDefinition type in the schema to
<xs:complexType name="FieldDefinition">
<xs:sequence>
<xs:element name="HELPTEXT" type="HelpTextRule" minOccurs="0"/>
<xs:group ref="PlainRules" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
For those a step ahead of me, you'll realize that if HELPTEXT appears after any PlainRules in a work item type, it no longer validates against this revised schema (while still validating against the original). Since I'm writing a tool to manipulate work item types, my fix is to execute code to rewrite the work item type, ensuring any occurence of HELPTEXT appears as the first child of any FIELD elements. I'll be placing this code on this site shortly incase anyone wants to use it.
There's one more modification needed to the XML schema so it validates - the two regular expressions used at the bottom (in SizeType and PaddingType) have their commas escaped. Once the backslashes before the commas are removed, and the other changes done, the XML file validates fine.
Please note that I don't know every corner of the XML schema specification, nor have I explored the Orcas or Rosario TFS versions, so information here might be incorrect or out of date shortly. I imagine I'll update this topic after seeing how things change in Orcas/Rosario.