Rejecting XML/Schema Mistakes

XML, The Good

XML has brought a lot of good to the inter-system communication. Before, XML every developer made its own serialization format, which was nearly always poorly designed, poorly implemented, and only half ready. Thousands of troublesome interfaces were the result. Although verbose, XML provides a strict standard serialization, with standardized formatters and parsers.

XML Character-sets

XML can be created in an character-set: the first processing line specifies which set is used. When applications must pre-load existing (over 750) character-set conversion tables, they get really large. When they get loaded dynamically per XML message, it gets much slower.

Do not produce and accept any other character-set than UTF-8.

XML-Schemas

XML-Schemas specify formally how the messages are structured. Since the first version in 1998, it has gradually been enhanced with new features. Although the namespace did not change after 2002, there have been extensions. Even the Schema for Schemas itself proves that namespaces as concept for versioning is broken!

Do not upgrade namespace versions

The schema designer's ideas were to use a version in the namespace to indicate the revision of elements. Like this:

   <schema ... xmlns:mine="https://example.org/mine-v1.xsd">
     <element name="bike" type="mine:bikeType" />

When you redesign the bikeType, you can either

not change your namespace, and only extend your Type with optional fields. Problems emerge only when you use the new fields but your receiver did not upgrade its schema; or
change your namespace. However, your clients must upgrade their applications at the same time: awkward synchronization.

It become much worse when you have schemas (sometime even maintained by different entities) which are used within one super-set schema cluster, and which use overlapping other schemas. For instance, the huge GeoGML suite and the applications built on top of it. It suffers from mutual "version lock-in". This is seriously hindering progress.

Solution: the client will tell the server which version of the schema it has loaded. The server will produce messages which match that version, not including newer (optional) elements. That's often quite easy to do. (The designed interface contains a deprecation mechanism for versions).

In case you have a major redesign for a type, give it a new name. It helps when those elements are in a substitutionGroup.

Namespaces as schema location

Most namespace as like "http://mydomain/*/myschema-v1.xsd". The idea behind this format is not only to make a namespace which you manage to be unique (including your own domain name, which has some legal protection), but also to point to the location where the schema can be retrieved.

However: the versioning of schemas via namespaces is totally broken, as discussed above. You will end-up having new versions of the schema written over the previous versions. Now, how does the client know there is a new version, what changed between versions, and when did it change?

It's better to make abstract namespaces, like "urn:example: mydomain:myschema". Update "schemaLocation" when you convert to each minor new version.

See the discussion on the use of URNs and OASIS XRI at W3C. Some disadvantages have proven not to be advantages for the HTTP uris either. We are not going to do automatic service discover either!

"any" elements

One of the attempts from schema designers to prepare for extension in the future, is to include "" element on places where you may expect additional elements in the future.

Clients are unable to handle those elements, but will usually try to process them (unless "lax"), which is impossible when those elements require a newer or external schema.

"anyType" types

XML is designed by librarians; they start with information without markup, and then grow to be more specific with their types. Start with an "anyType" which is restricted in a "Decimal" (number with as many digits as you like), restricted into a "Long" (four bytes), etc.

Programmers start the other way around: start with a "Bit", extended into eight bit in a "Byte", four bytes in a "Long", etc. Programmers can not handle "anyType" things correctly.

The "xsi:type" attribute

Be frightened: you can override in your messagethe type of any element with a restriction or extension type. This can be totally unexpected by the other party, you should never use this.

    <message>
      <bike xsi:type="mine:tandem">
        <seats>2  <-- processor did expect that element!

The civilized way around this are substitutionGroups.

unnamed choice, sequence, all

It is easy to construct types which are incredibly hard to parse, but are allowed (and seen in the wild). Especially in the case of substitutionGroups and xsi:type thingies. But also when they contains optional options...

   <element name="bike">
     <complexType>
       <sequence>
         <element name="steer" />
         <choice [0..5] />
           <element name="head-light" [0..1] />
           <element name="tail-light" />
         </choice>
         <element name="seat" />

But the real problem is that it does not map to standard data structures like arrays and associative arrays. It does not map to YAML or JSON either without destroying the structure.

No optional elements in choices. No unnamed choices unless producing a single element. Same for sequences. The all's are even more horrifying.

Namespace-less elements

You can produce schemas which put elements and types in the "no namespace" collection. Namespaces are a good thing: they tell you who is responsible for the definitions. So, when more than one party decides to neglect namespaces, you are in for name collisions and confusion.

A far more serious problem appears when you have a default namespace in your schema or message. In that case, you see the use of types and references without prefix which are indistinguishable to the no-namespace types and references. This will cause even more confusion.

Actually, it is a pity that schemas use a targetNamespace. More people would understand the use of prefixes better when also top-level elements would have a prefix in their name.

Namespace-less attributes

Where the default for element use is to include a namespace, the default for attributes is not to include them. This results in possible name collisions and confusion, especially when you use the extension mechanism on types and substitutionGroups.

For instance, the "id" attribute is used everywhere. Which solution do you prefer to be in your message:

   <product type="bike" id="42" invoice-id="2021007">

   <product product:type="bike" db:id="42" invoice:id="2021007">

... more to come

mark@overmeer.net Web-pages generated on 2023-12-19