MS Concept
Categories: collections: |
Example complex data archivesMeshy Space can be used to pass complex packaged data, where the client must actively download (or upload) data, preferrably in bulk. This page shows how a real-life very large data-sets can be denoted. Be aware that this design only works well (on bulk data) when the definitions are heavily cached between calls and processes. It is RECOMMENDED to heavily cache definitions, but mind the expiration! CommonCrawl publicationsOpen web-crawler organisation CommonCrawl monthly crawls a part of internet and publishes the results in huge WARC files. Besides, it makes various derivative products. Together in the order of 200,000 files with a total of 500TB of uncompressed information per month. Access to these files in one of the first implementations of Meshy Space Interfaces, so a good example how complex it can get! StoreThe publications manage large quantities of data. Often —like in this example— it is purely data files, kept on big disks. However, it may also be dynamically created data. Usually, one service has very few publications (otherwise, your service is doing too much). CommonCrawl publishes three interfaces:
For the first two, the presented structure is exactly the same. The third is very different (and not specified in this example) The following publication definitions reflect the variety of interfaces. The configuration is part of the store definition: <b:unit b:id="shop"> <b:type>msc:Role/Store</b:type> <b:fetch b:base="s3:" /> <b:fetch b:base="http://aws" /> <b:has b:ref="cc:index" b:is="msc:Role/Collection" /> </b:unit> <b:unit b:id="cc:s3"> <b:type="msc:Role/Resource"> <b:resource> <b:entry>s3://commoncrawl</b:entry> </b:resource> <b:has b:ref="cc:crawl-data" b:is="msc:Role/Collection" /> </b:unit> <b:unit b:id="cc:http"> <b:type="msc:Role/Resource"> <b:resource> <b:entry>https://commoncrawl.s3.amazonaws.com</b:entry> </b:resource> <b:has b:ref="cc:crawl-data" b:is="msc:Role/Collection" /> </b:unit> <b:unit b:id="cc:index"> <b:entry>s3://commoncrawl/cc-index/table/cc-main/warc/</b:entry> </b:unit> The The publication alternatives may have different restrictions: access time, price, speed, physical location. And, of course, the client may not support some of the offered protocols. CollectionsIn a publication, you find collections. Each collection is a logically coherent set of products. You can compare them with directories (maps) and archives (tar, zip). Actually: they are implemented as directories and archives! The academic difference is that a "collection" is an abstract notion: a set of data products. The concept design is more important than the implementation details <!-- in namespace cc: --> <b:unit b:id=""> <b:type>ms:Type/Namespace</b:type> <b:use b:prefix="cc" /> <!-- refers to me --> <b:namespace> <b:manager>server-config</b:manager> <b:fetch b:base="s3://commoncrawl/" /> <b:fetch b:base="https://commoncrawl.s3.amazonaws.com/" /> </b:namespace> </b:unit> <b:unit b:id="server-config"> <b:type>msc:Type/Shop</ms.type> <c:shop> </c:shop> <b:has my:is="msc:Type/Collection" b:ref="2020dec-crawl" /> <b:has my:is="msc:Type/Collection" b:ref="2021jan-crawl" /> <b:has my:is="msc:Type/Collection" b:ref="2021feb-crawl" /> <!-- and hundreds more --> </b:unit> Collections can contain sub-collections, like a directory can contain sub-directories or a zip-file may contain zip-files. In the example above, you can read that each month a separate crawl collection is produced as new part in the full collection. All crawl sets are "replicated" to both publication locations (actually they are two interfaces to the same bytes). In our case, the data found on both places is exactly the same. The client can decide which one to take. They may differ in cost, or speed, or protocol support, or jurisdiction of the server, or... It would be possible to have different structures for the same data in different collections, or even in the same collection: see the product as explained below. The The collection refers to the publication, and the publication to the collection: you MAY NEED to dynamicly ask the service for missing details, therefore this is a double-linked list. The number of collections is usually small compared to the amount of products included in them. Sub-collectionsIn the case of CommonCrawl publications, each month a bunch of different data fragments are produced. Logically, on web-page can be seen as one product. The logs, extracted text and detected language can be seen as parts of that product. Following any Object model, this should mimic: $page = $collection->product($some_url); print $page->part('log'); $words = $page->part('text')->word_count; It is very well possible you prefer an other structure for products and parts. That are choice you make for your data. <b:unit b:id="2021feb"> <b:type>ms:Type/Namespace</b:type> <b:namespace> <b:fetch b:base="CC-MAIN-2021-10/" /> </b:namespace> </b:unit> <b:unit b:id= <b:type>msc:Type/Collection</b:type> <b:collection> </b:resource> </b:collection> <b:has b:ref="2021feb-crawl" /> <b:has b:ref="2021feb-metadata" /> <!-- WAT files --> <b:has b:ref="2021feb-texts" /> <!-- WET files --> </b:unit> The different parts all have their own sub-collection. They do
not need to be published at the same time. As any structural component,
you can use The collection may also maintain various useful statistics, like the
total Directory containing archive filesOn a certain nesting of collections, you will shift from representing physical directories into representing packaged product parts (optional). <b:unit b:id="2021-10"> <b:type>msc:Type/Namespace</b:type> <b:namespace /> <b:fetch b:base="CC-MAIN-2021-10/segments/" /> </b:namespace> <b:has b:is="warc:Type/Archive" b:ref="2021-10/00000" /> <b:has b:is="warc:Type/Archive" b:ref="2021-10/00001" /> <!-- and 78000 more --> </b:unit> The absolute address for this set is already getting quite long
Archive files with product partsFinally something which stores bytes. However, it's still all declarive and abstract. <b:unit b:id="202102warc-00000"> <b:type>msc:Type/Collection</b:type> <b:content> <b:type>warc:Type/Archive</b:type> <b:size>1063809923</b:size> <b:packaged b:by="b:Pack/GzipMultiStream"> <b:fetch b:ref="2021feb-crawl"> <b:entry>1614178347293.1/warc/CC-MAIN-2021022-00000.warc.gz</b:entry> </b:packaged> </b:content> </b:unit> WARC files use extension You may be tempted to say that there is a meaning to
ProductsOnly at the bottom of a collection structure, we find real bytes.
But that's not important. Crucial is that the server has homogeneous
units —items— prepared for the client. Those logic
<!-- in namespace cc: --> <b:unit b:id="202102warc-00000"> <b:type>warc:Type/Archive</b:type> <b:name>CommonCrawl output 2021-02 file 00000</b:name> <b:content> <b:type>warc:Type/Archive</b:type> <b:fetch b:href="202102warc-00000.warc.gz" /> </b:content> </b:unit> <b:unit b:id="ASDJKDJO12314A"> <b:type>cc:Type/CrawlResults</b:type> <b:name>https://markov.solutions/index.html</b:name> <b:created>2021-04-01T09:08:07Z</b:created> <b:expires>2021-12-30T12:00:00Z</b:expires> <b:content/> <b:has b:is="warc:Role/Request" b:ref="ASDJKDJO12314A-req" /> <b:has b:is="warc:Role/Response" b:ref="ASDJKDJO12314A-resp" /> <b:has b:is="warc:Role/Text" b:ref="ASDJKDJO12314A-text" /> <b:has b:is="warc:Role/Links" b:ref="ASDJKDJO12314A-links" /> </b:unit> <b:unit b:id="ASDJKDJO12314A-req"> <b:type>warc:Type/record</b:type> <b:content> <b:size>34516</b:size> <b:packaged b:by="ms:Pack/GzipMultiStream"> <b:resource>202102warc-00000</b:resource> <b:entry>#12314151+11515</b:entry> </b:packaged> </b:content> </b:unit> <b:unit b:id="ASDJKDJO12314A-req"> <b:type>warc:Type/record</b:type> <b:content> <b:size>25257</b:size> <b:packaged b:by="ms:Pack/GzipMultiStream"> <b:resource>202102warc-00000</b:resource> <b:entry>#12325666+23521</b:entry> </b:packaged> </b:content> </b:unit> <b:unit b:id="ASDJKDJO12314A-text"> <b:type>warc:Type/record</b:type> <b:content> <b:size>516</b:size> <b:charset>iso-5988-1</b:charset> <b:languages>ms:/Languange/nl-NL</b:languages> <b:packaged b:by="ms:Pack/GzipMultiStream"> <!-- from another resource --> <b:resource>202102wet-0368</b:resource> <b:entry>#3242+2345428</b:entry> </b:packaged> <b:content> </b:unit> <b:unit b:id="ASDJKDJO12314A-links"> <b:type>iana:MediaType/text/uri-list</b:type> <b:content> <b:blob b:compressed="ms:Compress/Gzip" b:encoded="ms:Encode/Base64"> ewogICAiQ29udGFpbmVyIiA6IHsKICAgICAgIkNvbXByZXNzZWQiIDogdHJ1ZSw ZnNldCIgOiAiNjAwOTk3NTgzIiwKICAgICAgIkZpbGVuYW1lIiA6ICJDQy1NQUl bnQtTGVuZ3RoIiA6ICIyMDEiCiAgICAgIH0sCiAgICAgICJGb3JtYXQiIDogIld fQo= <b:blob> <b:content> </b:unit> With url (here used as PlacementThe publication and collection definitions are cached over client-server connections, typically for a day. They will usually be put staticly in the service definition. However, when the service is very large (as in this example), they may also be included on need-to-know basis. The products and their parts are provided in response to queries. They are part of the payload, and as such may be distributed inlined, or as bulk via publications. mark@overmeer.net Web-pages generated on 2023-12-19 |