Schema element "b:unit"

The unit is the base of everything: objects which MAY be useful to be cached. It is the great uniforming data quantity, a generalization of many protocols and system features.


<element name="unit" type="b:Unit" />

<complexType name="Unit">
    <element name="name"     type="string"     [0..1] />
    <element name="role"     type="b:Constant" [0..1] />
    <group   ref="b:mediatype"  />
    <element ref="b:definitions"   [0..∞] />
    <element ref="b:life_cycle" />
    <element ref="b:payload"    />
    <element ref="b:documentation" [0..∞] />
  <attribute name="id" type="b:UnitID" use! />


A unit is a logically coherent set of data: bytes with some facts about those bytes. Units have various roles, like configuration, identities, collections of information, or simply a sequence of bytes. Units have many elements and attributes, which are often not used: it is very beneficial to implementations to have a strong uniform object base.

Units usually have many relations (b:has) which tell you about owner, data which were used to create it, and much more. This very much depends on the role you have designed for your unit.

One of the applications of units, is the Meshy Space Concept. The concept assigns logic roles to units via the type element (in b:meta-data).

WARNING: unit ids are quite different from xsd:id identifiers: study the differences in b:UnitID!

attribute id

The unique identifier for this information Unit is used to be referred to, either within this same information structure, or from other information structures. Applications will decide to cache the units based on this b:id for performance.

element definitions

Rules which help to interpret fields with expressions. These can be grouped and reused, hence it is possible to encounter more than one block of definitions.

element documentation

The Unit may have different levels of documentation, besides the name, like a short translated name, some tooltip text, extended html, and/or references to external web-pages. This all may be provided in different languages.

When the documentation is not included in the Unit, it may be found inbetween the Unit's own b:has relations or those of the enclosed Subspace, or its parent Namespaces.

element has

The unit has logical relations to other units, which are often packaged in a different data collection, maybe on a different server. See b:has.

element life_cycle

A number of fields are used to control the progress of the unit: when it was created, was deprecated, and is to be removed. The base is the revision of the unit. There is only one revision of a unit in your cache: the newest.

element name

A Unit always has a name which can be presented to users: it defaults to the Unit's ID. The generated Unit-id may not be the nicest way to present the Unit to a person, but the name is probably be better. Best is to present users with the content of the b:documentation structure, which is translatable.

Be warned that the name is (like everything else) in the UTF8 charset. See the discussions below about NFKC normalization ans the problem with using filenames here.

element payload

The payload is a substitutionGroup, which means that the element can be filled in by other element in that group. See b:payload for a list of alternatives.

Structural elements may only contain any "direct" payload, but only references to other units.

element role

The Unit explains why it is there. Not all Units know.

For instance, a Unit which represents a Constant will indicate in to which constant set it belongs. This makes automatic verification possible during Unit creation.

group mediatype

A number of fields describe what the unit represents. If the unit contains a payload, it gives facts about the payload. All these facts are independent from the way the payload is being transported and stored.


demonstrating a simple unit

<b:unit b:id="Rules">
  <b:type="ms:Role/Rules" />
    <b:prefix b:name="cc" b:namespace="https://..." />
<b:unit b:id="ASDJKDJO12314A">
  <b:has b:is="cc:Type/Warc/Request"  b:ref="ASDJKDJO12314A-req">
  <b:has b:is="cc:Type/Warc/Response" b:ref="ASDJKDJO12314A-resp">
  <b:has b:is="cc:Type/Warc/Links"    b:ref="ASDJKDJO12314A-links">
<b:unit b:id="ASDJKDJO12314A-req">
    <b:compressed b:by="ms:Compress/Gzip">
       <b:packaged b:by="cc:Pack/Warc/Archive">
         <b:fetch b:href="cc:/Repo/202102warc-00000" />
<b:unit b:id="ASDJKDJO12314A-resp" />
<b:unit b:id="ASDJKDJO12314A-links" />
<b:unit b:id="Type/Warc/Result" />  <!-- etc -->

The units are separately storable and retrievable. The can have many kinds of relations, which is expressed by the b:is attribute. The b:is does not need to be the type of the unit where it refers to, but it often is: the application which follows the relation should understand it.


Caching units based on b:id

The IDs are unique, because they are QName types: namespaced strings. Their uniqueness extends the full data-structure which is being processed. When you ask the "owner" of the id namespace, you can get three answers:

  • the same information;
  • a newer version of the same information (update); or
  • a denial of existence: removed (or never stored in the service)

Unit names get NFKC normalized

When a Unit is created, the name will get "NFKC normalized" first. When the Unit's ID is automatically derived from the name, it will therefore also be in NFKC normalized form.

Operating Systems and editors use different Unicode ways to express exactly the same. For instance, ö can be represented as a single symbol, but also as two: " + o. These (and other kinds of) alternatives make comparison for the human idea of "equality" hard.

See Unicode TR15.

Using filenames as name

When you want to use a filename as name (which makes a good case), you will need to include an encoded version of the filename because those are often UTF-16 (NTFS) or "bytes" (UNIX/Linux).

Also, the use of hidden files (leading dot on UNIX/Linux) and hidden meta-data directories (__MACOSX on Apple) will complicate the set-up. It hurts even more when file-systems are considered case-insensitive, may contain control characters, and use white-spaces. Some operating systems use UTF8 for ö, other use o + ", which are not equivalent under simple comparison.

This Meshy Space Base does not solve any of these problems. Extensions, like the Meshy Space Concept may offer a better solution. They start with using the filename's precise byte-sequence percent-encoded as unit id, and a readable UTF-8 version of it as name. Search on name can be made case-insensitive.

mark@overmeer.net      Web-pages generated on 2023-04-13