.. Copyright (c) 2010, Robert Escriva, Joe Werther, Ben Boeckel All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of CHASM nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Cryptographic-Hash-Algorithm-Secured Mirroring ============================================== Technical Design Document ------------------------- .. contents:: Table of Contents :depth: 2 The Cryptographic-Hash-Algorithm-Secured Mirroring project (CHASM) creates a mirroring network in which the network may be thought of as a true peer-to-peer network as opposed to the traditional set of point-to-point links typically used in mirroring software. Traditional tools (e.g. `rsync `_) are well understood and have been used in practice. CHASM is a relatively new idea, and this document provides insight into key design decisions that have been made. Key Definitions =============== This section defines terms that will be used throughout this document to describe CHASM's behavior and implementation. blob A unit of data that logically corresponds to a file structure. Blobs are uniquely identified by a cryptographic hash of the content. This term is based on the term used in the popular `git-scm `_. A blob is considered active if it belongs to one or more collections handled by CHASM. collection A set of directory structures published over the CHASM protocol. Examples of collections include the Fedora Linux distribution, or the Kernel.org public ftp archive. The collection is independent of updates over time. downstream A node which is functioning as the client in a peer-to-peer transaction, pulling blobs from upstream. manifest A published document explicitly describing the state of a collection at a specific point in time. master The node that generates information about updates to the collection. It does so by creating new manifests. node A server or client that wishes to partake in the process of mirroring a collection. pool A set of blobs belonging to one or more collections. public directory root The pathname referring to the top-most directory in the directory hierarchy that makes the collection. tracker A third party with knowledge of all nodes in the network. upstream A node which is functioning as the server in a peer-to-peer transaction, pushing blobs downstream. Components ========== This section describes the components that are composed to form CHASM. Each of these components is more thoroughly described in its own section later in this document. Components should be implemented as isolated daemons that interact with each other over well-defined protocols. Master Control Program A program which is tasked with monitoring other components and restarting them with the correct permissions should it become necessary. Pool Manager A program which handles the active blobs for one or more collections. Manifest Manager A program which handles non-obsoleted manifests. Pool-Manifest Oracle A program which maintains a cross-reference between blobs in the pool and manifests maintained by the manifest manager. The cross-reference is stored in the form of an ordered list of booleans corresponding to the sorted order of blobs in the manifest. Cache Daemon A program which is responsible for monitoring which blobs are most likely to be in cache at any particular point in time. It listens for announcements from other components for declarations of using blobs. At any point in time it may be provided a list of blob identifiers and be asked to return a list of which are likely to be in cache. Peer-to-Peer Downstream Daemon A program which negotiates the transfer of blobs from other peers. The peer-to-peer daemon queries the tracker directly for a list of peers. It negotiates with the pool-manifest oracle to determine which blobs to request from the upstream nodes. Peer-to-Peer Upstream Daemon A program which negotiates the transfer of blobs to other peers. It functions as a server to service requests from the Peer-to-Peer Downstream Daemon. It negotiates with the pool-manifest oracle to determine which blobs it may provide to downstream nodes. Tracker Daemon A program which coordinates pairing of peers. It performs a role similar to that of a `bittorrent tracker `_ in that it is aware of all peers in a network. This tracker should take into account the status of nodes (e.g. "official" nodes vs. "private" nodes) when pairing peers. It also should consider the network location of nodes to decrease latency. Production Manager A program which translates the manifest and pool into the exact form specified by the master. The Manifest ============ The manifest is a text file that explicitly defines the state of the collection at a given point in time; it describes directories, links, and files. Each object is described on its own LF-terminated line and TABs are used to separate metasyntactic variables. Directories have the format:: TAB LF Symbolic links have the format:: TAB LF Files have the format:: TAB TAB TAB LF <{directory|file|link|target} pathname> Each pathname may contain any character valid for a traditional Unix path with two exceptions, LF and TAB. These characters represented as \n and \t, respectively, are reserved as they have special meaning within the context of the manifest. While paths are described as absolute paths they are relative to the public directory root. The leading '/' cleans up corner cases when referring to the root directory. An ASCII-encoded octal number that specifies the user, group, and other read, write, and execute permissions. Set user and group ID bits are ignored. An ASCII-encoded decimal number specifying the size (in bytes) of the file. This number is bound by the size of a 64-bit signed integer. It is incorrect for this value to be negative. Cryptographic hash of $(basename ) with the hash-algorithm defined by the collection. The absolute minimum manifest:: / 755 A more typical (abbreviated) manifest:: / 755 /linux/ 755 /linux/core/ 755 /linux/core/1/ 755 /linux/core/2/ 755 /linux/core/3/ 755 /linux/core/4/ 755 /linux/development/13/i386/os/RPM-GPG-KEY-fedora /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary /linux/development/13/i386/os/RPM-GPG-KEY-fedora-i386 /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary /linux/development/13/i386/os/RPM-GPG-KEY-fedora-ppc /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary 0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/i386/gallery2-gd-2.3-14.fc11.noarch.rpm 0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/ppc/gallery2-gd-2.3-14.fc11.noarch.rpm 0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/ppc64/gallery2-gd-2.3-14.fc11.noarch.rpm 0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/x86_64/gallery2-gd-2.3-14.fc11.noarch.rpm Properties of the Manifest -------------------------- Using the above manifest format provides some useful qualities that may or may not be immediately apparent. Compatible with Command-Line Tools ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The line-oriented format makes for easy analysis with command-line tools. Statistics about the expected size of a complete collection can easily be collected from the manifest. Two manifests may be compared easily tools like ``diff`` and reconstructed using ``patch``. Easily Compressible ~~~~~~~~~~~~~~~~~~~ A manifest generated from a snapshot of the Fedora archive takes ~100M of disk space when using the SHA384 message digest. Using the SHA256 digest the same snapshot only consumes ~80M of disk space. The snapshots compress to ~30M and ~20M, respectively. Human Readable ~~~~~~~~~~~~~~ The manifest is human-readable assuming all pathnames contained within are human-readable as well. Network Protocols ================= Unless otherwise specified, all values are in network byte order. Protocol Version Negotiation ---------------------------- In order to ensure interoperability with future versions of CHASM, the protocol has explicit support for negotiating the version of the protocol used. When establishing a remote connection, clients will perform the negotiation described in this section to establish the latest version of the protocol supported by both clients. Message Formats ~~~~~~~~~~~~~~~ :: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | op (1) | version (1) | +---------------+---------------+ +---------+--------+----------------------------------------------------------+ | Field | Octets | Description | +=========+========+==========================================================+ | op | 1 | | Message op code / message type. | | | | | 0 = VERSION, 1 = ACK, 2 = RST | +---------+--------+----------------------------------------------------------+ | version | 1 | The version of the protocol referred to in the message. | +---------+--------+----------------------------------------------------------+ VERSION Client/server declaration of the protocol version to speak ACK Confirmation that the protocol in `version` is a valid protocol to speak. This SHALL be the value specified in the previous VERSION packet. RST All valid protocol versions have been exhausted. The connection will be closed. Protocol Description ~~~~~~~~~~~~~~~~~~~~ 1. The client sends a protocol message to the server. The content of this message is as described above. The version number selected should (initially) be the most recent version supported by the client. 2. The server determines the greatest protocol version less than or equal to that specified by the client. If it is capable of speaking the version requested by the client it SHOULD return an ACK packet with said version. Versions that were once valid can be deprecated if they prove to be a security risk. If the server does not speak the protocol version it should reply with its own VERSION packet where `version` does not exceed that of the previous packet's `version` value. 3. The client and evaluates the packet in the same manner the server did in step two. This process repeats steps two and three until either an ACK or a RST packet is sent by either party. In the event an ACK packet is sent, the client and server resume communication using the agreed-upon protocol. In the event a RST packet is sent, both parties terminate the connection. :: Client Server | | |\_________________________________________________________________ | 1. | VERSION \| | | | Determines suitability of protocol version. | | | _________________________________________________________________/| 2. |/ VERSION or ACK or RST | | | Determines suitability of protocol version. | | | Peer-to-Peer Protocol --------------------- The Peer-to-Peer protocol (P2P) is used by two nodes to transfer files. The connection is such that the client is the downstream and server is the upstream. Each connection is unidirectional with respect to the direction blobs flow. Message Formats ~~~~~~~~~~~~~~~ Messages in the peer-to-peer protocol have a one-byte opcode that serves to identify the message sent. The opcode defines the class of message (e.g. peer-to-peer, peer-to-tracker, etc.), the type of message (e.g. standard message, error message). The following form is used for the OP field in each of the message formats below. .. note:: This diagram is labelled to make it easy to visualize the meaning of an opcode from its hexadecimal representation. For example: 0xb3 is 0b10110011 which breaks down to 0b10 0b11 0b0011 and therefore is a peer-to-tracker message send from the upstream (tracker) to the downstream (peer), is an error type. It is the fourth message of this type. :: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |CS |TP |MESSAGE| +---+---+-------+ +---------+------+----------------------------------------------------------+ | Field | Bits | Description | +=========+======+==========================================================+ | CS | 2 | The class of message (peer-to-peer, peer-to-tracker, | | | | master-to-tracker, or other) | | | | | | | | 0b00 | | | | Other | | | | 0b01 | | | | Peer-to-peer | | | | 0b10 | | | | Peer-to-tracker | | | | 0b11 | | | | Master-to-tracker | +---------+------+----------------------------------------------------------+ | TP | 2 | Sub-classifiers that define the message. For | | | | peer-to-peer messages, the values are: | | | | | | | | 0b00 | | | | Downstream to upstream (expected). | | | | 0b01 | | | | Upstream to downstream (expected). | | | | 0b10 | | | | Downstream to upstream (error). | | | | 0b11 | | | | Upstream to downstream (error). | +---------+------+----------------------------------------------------------+ | MESSAGE | 4 | Message types are defined below. | +---------+------+----------------------------------------------------------+ 0x30 INVALOP The INVALOP message can be sent by any component involved in interprocess or network communication. This is a potential response to every message in the protocol signifying an error condition. 0x40 HELLO The HELLO message is used by the downstream to initiate a transfer request with the upstream. :: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OP = 0x40 (1) | COL LEN (1) | HASH LEN (2) | +---------------+---------------+-------------------------------+ | COLLECTION (variable < 256) | +---------------------------------------------------------------+ | HASH (variable < 65536) | +---------------------------------------------------------------+ +------------+--------+----------------------------------------------------+ | Field | Octets | Description | +============+========+====================================================+ | OP | 1 | Peer-to-peer, downstream to upstream, message 0. | +------------+--------+----------------------------------------------------+ | COL LEN | 1 | An unsigned integer that specifies the length of | | | | the collection string in bytes (including null | | | | terminator). | +------------+--------+----------------------------------------------------+ | HASH LEN | 2 | An unsigned integer that specifies the length of | | | | the hash provided. | +------------+--------+----------------------------------------------------+ | COLLECTION | < 256 | A human readable, null-terminated string that | | | | identifies the collection. This is the value set | | | | by the vendor of the collection. | +------------+--------+----------------------------------------------------+ | HASH | < 65536| A binary representation of the hash that uniquely | | | | identifies a manifest published by the collection. | | | | This value can function as a shared secret and | | | | loosely enforce the structured release of updates. | +------------+--------+----------------------------------------------------+ .. note:: The HASH field is not to be used as the only means of communication. The security of this shared secret rests on the following assumptions: * No node attempts to pull the blobs from a node that was not listed by the tracker as valid peers for the (COLLECTION, HASH) pair. * No node leaks information about valid hashes that are not yet released to the public. If these two assumptions hold, it is the case that only authorized nodes will be able to fetch updates. Thus, a tracker may release new manifest versions in a manner that creates an arbitrary number of tiers of nodes (thus creating a tree structure). Potential Responses (Typical Control Flow): * 0x50 HAVE Potential Responses (Error Conditions): * 0x70 INVALHELLO * 0x71 INVALCOLL * 0x72 INVALHASH * 0x73 INVALMAN 0x50 HAVE The HAVE message is the upstream's method of indicating which blobs it will make available to the downstream. As the collection and manifest version are settled, clients on both ends may calculate the number of unique blobs in the manifest. The HAVE message consists of two bitfields. The first bitfield indicates blobs that are available for the downstream to request. The second bitfield indicates blobs that the downstream should request *first*. The second bitfield loosely correlates with what the upstream believes to be in the filesystem cache at any given point in time. More complex algorithms for the second bitfield may emerge in the future so as to share the contents of any given manifest over many hosts (and thus coordinate a set of mirrors which serve the entire collection straight from memory). Each bitfield is formed by taking the list of unique blobs in the manifest and sorting it according to increasing hash value (if each hash value is considered a number). The presence of a blob in the pool or cache shall be indicated with a set bit; absence of a blob is indicated by zeroing the bit. :: 0 0 1 2 3 4 5 6 7 n +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OP = 0x50 (1) | CONCATENATED BITFIELDS (ceil(n/8.0)) | +---------------+-----------------------------------------------+ .. note:: The size of the bitfield is not transmitted as manifest specified in HELLO contains all information necessary to determine the size. Potential Responses (Typical Control Flow): * 0x41 WANT Potential Responses (Error Conditions): * 0x60 INVALHAVE * 0x61 INVALCACHE 0x41 WANT The WANT message indicates a subset of the blobs specified as available in the HAVE message. The downstream should consider other peers before requesting an available blob that is not in cache. :: 0 0 1 2 3 4 5 6 7 n +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OP = 0x41 (1) | BITFIELD (ceil(n/8.0)) | +---------------+-----------------------------------------------+ Potential Responses (Typical Control Flow): * 0x51 SENDING Potential Responses (Error Conditions): * 0x74 INVALWANT 0x51 SENDING The SENDING message is a subset of the blobs requested by the downstream in the WANT message. The content of this bitfield should be a subset of that sent by the downstream in the WANT message. :: 0 0 1 2 3 4 5 6 7 n +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OP = 0x51 (1) | BITFIELD (ceil(n/8.0)) | +---------------+-----------------------------------------------+ Potential Responses (Typical Control Flow): * 0x42 CONT Potential Responses (Error Conditions): * 0x62 INVALSEND 0x42 CONT The CONT message is sent by the downstream when it confirms that the SENDING bitfield is appropriate. :: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | OP = 0x42 (1) | +---------------+ 0x52 BLOB The BLOB message is a header that precedes the content of a blob. It is followed immediately by the blob. :: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OP = 0x52 (1) | BLOB SIZE (8) | +---------------+ - - - - - - - + - - - - - - - + - - - - - - - + | | + - - - - - - - +---------------+---------------+---------------+ | | BLOB HASH (variable < 65536) | +---------------+-----------------------------------------------+ +------------+--------+----------------------------------------------------+ | Field | Octets | Description | +============+========+====================================================+ | OP | 1 | Peer-to-peer, upstream to downstream, message 2. | +------------+--------+----------------------------------------------------+ | BLOB SIZE | 8 | An unsigned integer that specifies the size of the | | | | blob to be sent. | +------------+--------+----------------------------------------------------+ | BLOB HASH | < 65536| A binary representation of the hash that | | | | corresponds to blob that follows. | +------------+--------+----------------------------------------------------+ Responses should come after receipt of the blob. Should the upstream break protocol and send an unrequested blob, the downstream should end the connection. Potential Responses (Typical Control Flow): * 0x43 BLOBACK * 0x44 BLOBNACK Potential Responses (Error Conditions): * 0x63 INVALBLOB 0x43 BLOBACK 0x44 BLOBNACK The BLOBACK message is sent by the downstream when it confirms receipt of the blob, and the blob matches its hash sum. The BLOBNACK is sent when a blob does not match its hash sum. :: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | OP = 0x43 (1) | +---------------+ Potential Responses (Typical Control Flow): * 0x52 BLOB Potential Responses (Error Conditions): * 0x53 EOT * 0x54 EOC