Example complex data archives

Meshy Space can be used to pass complex packaged data, where the client must actively download (or upload) data, preferrably in bulk. This page shows how a real-life very large data-sets can be denoted.

Be aware that this design only works well (on bulk data) when the definitions are heavily cached between calls and processes. It is RECOMMENDED to heavily cache definitions, but mind the expiration!

CommonCrawl publications

Open web-crawler organisation CommonCrawl monthly crawls a part of internet and publishes the results in huge WARC files. Besides, it makes various derivative products. Together in the order of 200,000 files with a total of 500TB of uncompressed information per month.

Access to these files in one of the first implementations of Meshy Space Interfaces, so a good example how complex it can get!

Store

The publications manage large quantities of data. Often —like in this example— it is purely data files, kept on big disks. However, it may also be dynamically created data. Usually, one service has very few publications (otherwise, your service is doing too much).

CommonCrawl publishes three interfaces:

via S3 (using aws s3 cp ...),
via HTTP (you need to know the names), and
via a search page (to get separate crawl records).

For the first two, the presented structure is exactly the same. The third is very different (and not specified in this example)

The following publication definitions reflect the variety of interfaces. The configuration is part of the store definition:

<b:unit b:id="shop">
  <b:type>msc:Role/Store</b:type>
  <b:fetch b:base="s3:"    />
  <b:fetch b:base="http://aws"  />
  <b:has b:ref="cc:index" b:is="msc:Role/Collection" />
</b:unit>
<b:unit b:id="cc:s3">
  <b:type="msc:Role/Resource">
  <b:resource>
     <b:entry>s3://commoncrawl</b:entry>
  </b:resource>
  <b:has b:ref="cc:crawl-data" b:is="msc:Role/Collection" />
</b:unit>
<b:unit b:id="cc:http">
  <b:type="msc:Role/Resource">
  <b:resource>
     <b:entry>https://commoncrawl.s3.amazonaws.com</b:entry>
  </b:resource>
  <b:has b:ref="cc:crawl-data" b:is="msc:Role/Collection" />
</b:unit>
<b:unit b:id="cc:index">
  <b:entry>s3://commoncrawl/cc-index/table/cc-main/warc/</b:entry>
</b:unit>

The address URIs may contain access information, like username and password. There are many URI schemes, see the list of URI schemas on Wikipedia

The publication alternatives may have different restrictions: access time, price, speed, physical location. And, of course, the client may not support some of the offered protocols.

Collections

In a publication, you find collections. Each collection is a logically coherent set of products. You can compare them with directories (maps) and archives (tar, zip). Actually: they are implemented as directories and archives! The academic difference is that a "collection" is an abstract notion: a set of data products. The concept design is more important than the implementation details

<!-- in namespace cc: -->
<b:unit b:id="">
  <b:type>ms:Type/Namespace</b:type>
  <b:use b:prefix="cc" />   <!-- refers to me -->
  <b:namespace>
    <b:manager>server-config</b:manager>
    <b:fetch b:base="s3://commoncrawl/" />
    <b:fetch b:base="https://commoncrawl.s3.amazonaws.com/" />
  </b:namespace>
</b:unit>

<b:unit b:id="server-config">
  <b:type>msc:Type/Shop</ms.type>
  <c:shop>
  </c:shop>
  <b:has my:is="msc:Type/Collection" b:ref="2020dec-crawl" />
  <b:has my:is="msc:Type/Collection" b:ref="2021jan-crawl" />
  <b:has my:is="msc:Type/Collection" b:ref="2021feb-crawl" />
  <!-- and hundreds more -->
</b:unit>

Collections can contain sub-collections, like a directory can contain sub-directories or a zip-file may contain zip-files. In the example above, you can read that each month a separate crawl collection is produced as new part in the full collection.

All crawl sets are "replicated" to both publication locations (actually they are two interfaces to the same bytes). In our case, the data found on both places is exactly the same. The client can decide which one to take. They may differ in cost, or speed, or protocol support, or jurisdiction of the server, or...

It would be possible to have different structures for the same data in different collections, or even in the same collection: see the product as explained below.

The addresses are relative URIs: they are relative to the address of the parent collection, or in this case the publication base address. You may use any kind of relative addressing which your URI scheme supports, or use absolute addresses. Mind that you SHALL put a trailing slash on directories!

The collection refers to the publication, and the publication to the collection: you MAY NEED to dynamicly ask the service for missing details, therefore this is a double-linked list. The number of collections is usually small compared to the amount of products included in them.

Sub-collections

In the case of CommonCrawl publications, each month a bunch of different data fragments are produced. Logically, on web-page can be seen as one product. The logs, extracted text and detected language can be seen as parts of that product. Following any Object model, this should mimic:

$page = $collection->product($some_url);
print $page->part('log');
$words = $page->part('text')->word_count;

It is very well possible you prefer an other structure for products and parts. That are choice you make for your data.

<b:unit b:id="2021feb">
  <b:type>ms:Type/Namespace</b:type>
  <b:namespace>
    <b:fetch b:base="CC-MAIN-2021-10/" />
  </b:namespace>
</b:unit>

<b:unit b:id=
  <b:type>msc:Type/Collection</b:type>
  <b:collection>
  </b:resource>
  </b:collection>
  <b:has b:ref="2021feb-crawl" />
  <b:has b:ref="2021feb-metadata" />  <!-- WAT files -->
  <b:has b:ref="2021feb-texts" />     <!-- WET files -->
</b:unit>

The different parts all have their own sub-collection. They do not need to be published at the same time. As any structural component, you can use version, expires etc to get the client to update its cached information about a collection.

The collection may also maintain various useful statistics, like the total size of the stored information. This MAY BE static information.

Directory containing archive files

On a certain nesting of collections, you will shift from representing physical directories into representing packaged product parts (optional).

<b:unit b:id="2021-10">
  <b:type>msc:Type/Namespace</b:type>
  <b:namespace />
    <b:fetch b:base="CC-MAIN-2021-10/segments/" />
  </b:namespace>
  <b:has b:is="warc:Type/Archive" b:ref="2021-10/00000" />
  <b:has b:is="warc:Type/Archive" b:ref="2021-10/00001" />
     <!-- and 78000 more -->
</b:unit>

The absolute address for this set is already getting quite long s3://commoncrawl/crawl-data/CC-MAIN-2021-10/segments/

Archive files with product parts

Finally something which stores bytes. However, it's still all declarive and abstract.

<b:unit b:id="202102warc-00000">
  <b:type>msc:Type/Collection</b:type>
  <b:content>
    <b:type>warc:Type/Archive</b:type>
    <b:size>1063809923</b:size>
    <b:packaged b:by="b:Pack/GzipMultiStream">
      <b:fetch b:ref="2021feb-crawl">
      <b:entry>1614178347293.1/warc/CC-MAIN-2021022-00000.warc.gz</b:entry>
    </b:packaged>
  </b:content>
</b:unit>

WARC files use extension .warc.gz, but we warned that each "record" is separately compressed with gzip. Each crawled web-page produces three records: a metadata, a request, and a response. Probably in the same warc file.

You may be tempted to say that there is a meaning to cc-s3/crawl-data/2021feb/2021feb-crawl/202102warc-00000 but there is none: there are multiple paths from data items towards publication and they are not equivalent. The sequence of IDs, and even the IDs should not be used anywhere. All IDs are unique, hence uniquely referencing a unit of configuration.

Products

Only at the bottom of a collection structure, we find real bytes. But that's not important. Crucial is that the server has homogeneous units —items— prepared for the client. Those logic items collect their knowledge from the fastest and cheapest sources, and only when they need to.

<!-- in namespace cc: -->
<b:unit b:id="202102warc-00000">
  <b:type>warc:Type/Archive</b:type>
  <b:name>CommonCrawl output 2021-02 file 00000</b:name>
  <b:content>
    <b:type>warc:Type/Archive</b:type>
    <b:fetch b:href="202102warc-00000.warc.gz" />
  </b:content>
</b:unit>

<b:unit b:id="ASDJKDJO12314A">
  <b:type>cc:Type/CrawlResults</b:type>
  <b:name>https://markov.solutions/index.html</b:name>
  <b:created>2021-04-01T09:08:07Z</b:created>
  <b:expires>2021-12-30T12:00:00Z</b:expires>
  <b:content/>
  <b:has b:is="warc:Role/Request"  b:ref="ASDJKDJO12314A-req"   />
  <b:has b:is="warc:Role/Response" b:ref="ASDJKDJO12314A-resp"  />
  <b:has b:is="warc:Role/Text"     b:ref="ASDJKDJO12314A-text"  />
  <b:has b:is="warc:Role/Links"    b:ref="ASDJKDJO12314A-links" />
</b:unit>

<b:unit b:id="ASDJKDJO12314A-req">
  <b:type>warc:Type/record</b:type>
  <b:content>
    <b:size>34516</b:size>
    <b:packaged b:by="ms:Pack/GzipMultiStream">
      <b:resource>202102warc-00000</b:resource>
      <b:entry>#12314151+11515</b:entry>
    </b:packaged>
  </b:content>
</b:unit>

<b:unit b:id="ASDJKDJO12314A-req">
  <b:type>warc:Type/record</b:type>
  <b:content>
    <b:size>25257</b:size>
    <b:packaged b:by="ms:Pack/GzipMultiStream">
      <b:resource>202102warc-00000</b:resource>
      <b:entry>#12325666+23521</b:entry>
    </b:packaged>
  </b:content>
</b:unit>

<b:unit b:id="ASDJKDJO12314A-text">
  <b:type>warc:Type/record</b:type>
  <b:content>
    <b:size>516</b:size>
    <b:charset>iso-5988-1</b:charset>
    <b:languages>ms:/Languange/nl-NL</b:languages>
    <b:packaged b:by="ms:Pack/GzipMultiStream">
      <!-- from another resource -->
      <b:resource>202102wet-0368</b:resource>
      <b:entry>#3242+2345428</b:entry>
    </b:packaged>
  <b:content>
</b:unit>

<b:unit b:id="ASDJKDJO12314A-links">
   <b:type>iana:MediaType/text/uri-list</b:type>
   <b:content>
      <b:blob b:compressed="ms:Compress/Gzip"
               b:encoded="ms:Encode/Base64">
ewogICAiQ29udGFpbmVyIiA6IHsKICAgICAgIkNvbXByZXNzZWQiIDogdHJ1ZSw
ZnNldCIgOiAiNjAwOTk3NTgzIiwKICAgICAgIkZpbGVuYW1lIiA6ICJDQy1NQUl
bnQtTGVuZ3RoIiA6ICIyMDEiCiAgICAgIH0sCiAgICAgICJGb3JtYXQiIDogIld
fQo=
      <b:blob>
   <b:content>
</b:unit>

With url (here used as name) one can get structured crawl information, extracted text (optional), extracted links, and more.

Placement

The publication and collection definitions are cached over client-server connections, typically for a day. They will usually be put staticly in the service definition. However, when the service is very large (as in this example), they may also be included on need-to-know basis.

The products and their parts are provided in response to queries. They are part of the payload, and as such may be distributed inlined, or as bulk via publications.

mark@overmeer.net Web-pages generated on 2023-12-19