Home
MS Base
Categories (Data):
Categories (Governance):
units:
|
Schema element "b:unit "
The unit is the base of everything: objects which MAY be
useful to be cached. It is the great uniforming data quantity, a
generalization of many protocols and system features.
Definitions
<element name="unit" type="b:Unit" />
<complexType name="Unit">
<sequence>
<element name="name" type="string" [0..1] />
<element name="type" type="b:UnitRef" default="ms:Type/Content" />
<element name="role" type="b:UnitRef" [0..1] />
<element ref="b:life_cycle" />
<element ref="b:payload" />
<element ref="b:definitions" [0..∞] />
<element ref="b:documentation" [0..∞] />
</sequence>
<attribute name="uid" type="b:UnitID" use! />
</complexType>
Description
A unit is a logically coherent set of data: bytes with some
facts about those bytes. Units have various roles, like configuration,
identities, collections of information, or simply a sequence of bytes.
Units have many elements and attributes, which are often not used: it
is very beneficial to implementations to have a strong uniform object
base.
Units usually have many relations (b:definitions
element b:has ) which tell you about owner, data which
were used to create it, and much more. This very much depends on the
role you have designed for your unit.
One of the applications of units, is the Meshy Space Concept.
The concept assigns logic roles to units via the type
element (in b:meta-data ).
WARNING: unit ids are quite different
from xsd:id identifiers: study the differences in
b:UnitID !
attribute uid
The unique identifier for this information Unit is used to be
referred to, either within this same information structure, or from
other information structures. Applications will decide to cache
the units based on this b:uid for performance.
element definitions
Rules which help to interpret fields with expressions. These
can be grouped and reused, hence it is possible to encounter
more than one block of definitions.
element documentation
The Unit may have different levels of documentation, besides the
name , like a short translated name, some tooltip text,
extended html, and/or references to external web-pages. This all may
be provided in different languages.
When the documentation is not included in the Unit, it may be found
inbetween the Unit's own b:has relations or those of
the enclosed Collection, or its parent Namespaces.
element life_cycle
A number of fields are used to control the progress of the unit:
when it was created, was deprecated, and is to be removed. The base
is the revision of the unit. There is only one revision of a unit in
your cache: the newest.
element name
A Unit always has a name which can be presented to users: it defaults
to the Unit's ID. The generated Unit-id may not be the nicest way to
present the Unit to a person, but the name is probably be better. Best is
to present users with the content of the b:documentation
structure, which is translatable.
Be warned that the name is (like everything else)
in the UTF8 charset. See the discussions below about NFKC normalization
ans the problem with using filenames here.
element payload
The payload is a substitutionGroup, which
means that the element can be filled in by other element in that
group. See b:payload for a list of alternatives.
Structural elements may only contain any "direct" payload, but only
references to other units.
element role
The Unit explains why it is there. Not all Units know.
For instance, a Unit which represents a Constant will indicate in
to which constant set it belongs. This makes automatic verification
possible during Unit creation.
element type
The Unit type: what can we expect in the payload from the point of
view of the Meshy Space framework. Everything not native to Meshy Space
will have type ms:Type/Content .
Examples
demonstrating a simple unit
<unit uid="MyRules">
<type="ms:Type/Rules" />
<definitions>
<prefix name="cc" namespace="https://..." />
</definitions>
</unit>
<unit uid="ASDJKDJO12314A">
<type>cc:Type/Warc/Result</type>
<name>https://markov.solutions/index.html</name>
<definitions>
<has type="cc:Type/Warc/Request" unitref="ASDJKDJO12314A-req">
<has type="cc:Type/Warc/Response" unitref="ASDJKDJO12314A-resp">
<has type="cc:Type/Warc/Links" unitref="ASDJKDJO12314A-links">
</definitions>
</unit>
<unit uid="ASDJKDJO12314A-req">
<type>cc:Type/Warc/Request</type>
<language>iets:Language/nl-NL</language>
<content>
<meditype>cc:Type/Warc/record</meditype>
<charset>iso-8859-1</charset>
<size>34516</size>
<compressed by="wp:Compress/Gzip">
<packaged by="cc:Pack/Warc/Archive">
<fetch href="cc:/Repo/202102warc-00000" />
<entry>#12314151+11515</entry>
</packaged>
</compressed>
</content>
</unit>
<unit uid="ASDJKDJO12314A-resp" />
<unit uid="ASDJKDJO12314A-links" />
<unit uid="Type/Warc/Result" />
The units are separately storable and retrievable. They can have
many kinds of relations, which is expressed by the b:is
attribute. The b:is does not need to be the type of
the unit where it refers to, but it often is: the application which
follows the relation should understand it.
Discussion
Caching units based on b:id
The IDs are unique, because they are QName types:
namespaced strings. Their uniqueness extends the full data-structure
which is being processed. When you ask the "owner" of the id namespace,
you can get three answers:
- the same information;
- a newer version of the same information (update); or
- a denial of existence: removed (or never stored in the service)
Unit names get NFKC normalized
When a Unit is created, the name will get "NFKC normalized" first.
When the Unit's ID is automatically derived from the name, it will
therefore also be in NFKC normalized form.
Operating Systems and editors use different Unicode ways to
express exactly the same. For instance, ö can
be represented as a single symbol, but also as two: " + o .
These (and other kinds of) alternatives make comparison for the human
idea of "equality" hard.
See Unicode TR15.
Using filenames as name
When you want to use a filename as name (which makes a
good case), you will need to include an encoded version of the filename
because those are often UTF-16 (NTFS) or "bytes" (UNIX/Linux).
Also, the use of hidden files (leading dot on UNIX/Linux) and hidden
meta-data directories (__MACOSX on Apple) will complicate
the set-up. It hurts even more when file-systems are considered
case-insensitive, may contain control characters, and use white-spaces.
Some operating systems use UTF8 for ö , other use o +
" , which are not equivalent under simple comparison.
This Meshy Space Base does not solve any of these problems. Extensions,
like the Meshy Space Concept may offer a better solution. They start
with using the filename's precise byte-sequence percent-encoded as unit
id , and a readable UTF-8 version of it as name .
Search on name can be made case-insensitive.
mark@overmeer.net
Web-pages generated on 2023-12-19
|