.. Copyright (c) 2010, Robert Escriva, Joe Werther, Ben Boeckel
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of CHASM nor the names of its contributors may be used
to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Cryptographic-Hash-Algorithm-Secured Mirroring
==============================================
Technical Design Document
-------------------------
.. contents:: Table of Contents
:depth: 2
The Cryptographic-Hash-Algorithm-Secured Mirroring project (CHASM) creates a
mirroring network in which the network may be thought of as a true peer-to-peer
network as opposed to the traditional set of point-to-point links typically used
in mirroring software.
Traditional tools (e.g. `rsync `_) are well
understood and have been used in practice. CHASM is a relatively new idea, and
this document provides insight into key design decisions that have been made.
Key Definitions
===============
This section defines terms that will be used throughout this document to
describe CHASM's behavior and implementation.
blob
A unit of data that logically corresponds to a file structure. Blobs are
uniquely identified by a cryptographic hash of the content. This term is
based on the term used in the popular `git-scm `_.
A blob is considered active if it belongs to one or more collections handled
by CHASM.
collection
A set of directory structures published over the CHASM protocol. Examples of
collections include the Fedora Linux distribution, or the Kernel.org public
ftp archive.
The collection is independent of updates over time.
downstream
A node which is functioning as the client in a peer-to-peer transaction,
pulling blobs from upstream.
manifest
A published document explicitly describing the state of a collection at a
specific point in time.
master
The node that generates information about updates to the collection. It does
so by creating new manifests.
node
A server or client that wishes to partake in the process of mirroring a
collection.
pool
A set of blobs belonging to one or more collections.
public directory root
The pathname referring to the top-most directory in the directory hierarchy
that makes the collection.
tracker
A third party with knowledge of all nodes in the network.
upstream
A node which is functioning as the server in a peer-to-peer transaction,
pushing blobs downstream.
Components
==========
This section describes the components that are composed to form CHASM. Each of
these components is more thoroughly described in its own section later in this
document. Components should be implemented as isolated daemons that interact
with each other over well-defined protocols.
Master Control Program
A program which is tasked with monitoring other components and restarting
them with the correct permissions should it become necessary.
Pool Manager
A program which handles the active blobs for one or more collections.
Manifest Manager
A program which handles non-obsoleted manifests.
Pool-Manifest Oracle
A program which maintains a cross-reference between blobs in the pool and
manifests maintained by the manifest manager. The cross-reference is stored
in the form of an ordered list of booleans corresponding to the sorted order
of blobs in the manifest.
Cache Daemon
A program which is responsible for monitoring which blobs are most likely to
be in cache at any particular point in time. It listens for announcements
from other components for declarations of using blobs.
At any point in time it may be provided a list of blob identifiers and be
asked to return a list of which are likely to be in cache.
Peer-to-Peer Downstream Daemon
A program which negotiates the transfer of blobs from other peers. The
peer-to-peer daemon queries the tracker directly for a list of peers. It
negotiates with the pool-manifest oracle to determine which blobs to request
from the upstream nodes.
Peer-to-Peer Upstream Daemon
A program which negotiates the transfer of blobs to other peers. It
functions as a server to service requests from the Peer-to-Peer Downstream
Daemon. It negotiates with the pool-manifest oracle to determine which blobs
it may provide to downstream nodes.
Tracker Daemon
A program which coordinates pairing of peers. It performs a role similar to
that of a `bittorrent tracker
`_ in that it is aware of
all peers in a network.
This tracker should take into account the status of nodes (e.g. "official"
nodes vs. "private" nodes) when pairing peers. It also should consider the
network location of nodes to decrease latency.
Production Manager
A program which translates the manifest and pool into the exact form
specified by the master.
The Manifest
============
The manifest is a text file that explicitly defines the state of the collection
at a given point in time; it describes directories, links, and files. Each
object is described on its own LF-terminated line and TABs are used to separate
metasyntactic variables.
Directories have the format::
TAB LF
Symbolic links have the format::
TAB LF
Files have the format::
TAB TAB TAB LF
<{directory|file|link|target} pathname>
Each pathname may contain any character valid for a traditional Unix path
with two exceptions, LF and TAB. These characters represented as \n and \t,
respectively, are reserved as they have special meaning within the context of
the manifest. While paths are described as absolute paths they are relative
to the public directory root. The leading '/' cleans up corner cases when
referring to the root directory.
An ASCII-encoded octal number that specifies the user, group, and other read,
write, and execute permissions. Set user and group ID bits are ignored.
An ASCII-encoded decimal number specifying the size (in bytes) of the file.
This number is bound by the size of a 64-bit signed integer. It is incorrect
for this value to be negative.
Cryptographic hash of $(basename ) with the hash-algorithm
defined by the collection.
The absolute minimum manifest::
/ 755
A more typical (abbreviated) manifest::
/ 755
/linux/ 755
/linux/core/ 755
/linux/core/1/ 755
/linux/core/2/ 755
/linux/core/3/ 755
/linux/core/4/ 755
/linux/development/13/i386/os/RPM-GPG-KEY-fedora /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary
/linux/development/13/i386/os/RPM-GPG-KEY-fedora-i386 /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary
/linux/development/13/i386/os/RPM-GPG-KEY-fedora-ppc /linux/development/13/i386/os/RPM-GPG-KEY-fedora-13-primary
0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/i386/gallery2-gd-2.3-14.fc11.noarch.rpm
0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/ppc/gallery2-gd-2.3-14.fc11.noarch.rpm
0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/ppc64/gallery2-gd-2.3-14.fc11.noarch.rpm
0000100dc4ee87106dacb991fad99befb20ccba6f3f24b067705c48a02452cf7 644 /linux/updates/11/x86_64/gallery2-gd-2.3-14.fc11.noarch.rpm
Properties of the Manifest
--------------------------
Using the above manifest format provides some useful qualities that may or may
not be immediately apparent.
Compatible with Command-Line Tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The line-oriented format makes for easy analysis with command-line tools.
Statistics about the expected size of a complete collection can easily be
collected from the manifest. Two manifests may be compared easily tools like
``diff`` and reconstructed using ``patch``.
Easily Compressible
~~~~~~~~~~~~~~~~~~~
A manifest generated from a snapshot of the Fedora archive takes ~100M of disk
space when using the SHA384 message digest. Using the SHA256 digest the same
snapshot only consumes ~80M of disk space. The snapshots compress to ~30M and
~20M, respectively.
Human Readable
~~~~~~~~~~~~~~
The manifest is human-readable assuming all pathnames contained within are
human-readable as well.
Network Protocols
=================
Unless otherwise specified, all values are in network byte order.
Protocol Version Negotiation
----------------------------
In order to ensure interoperability with future versions of CHASM, the protocol
has explicit support for negotiating the version of the protocol used. When
establishing a remote connection, clients will perform the negotiation described
in this section to establish the latest version of the protocol supported by
both clients.
Message Formats
~~~~~~~~~~~~~~~
::
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| op (1) | version (1) |
+---------------+---------------+
+---------+--------+----------------------------------------------------------+
| Field | Octets | Description |
+=========+========+==========================================================+
| op | 1 | | Message op code / message type. |
| | | | 0 = VERSION, 1 = ACK, 2 = RST |
+---------+--------+----------------------------------------------------------+
| version | 1 | The version of the protocol referred to in the message. |
+---------+--------+----------------------------------------------------------+
VERSION
Client/server declaration of the protocol version to speak
ACK
Confirmation that the protocol in `version` is a valid protocol to speak.
This SHALL be the value specified in the previous VERSION packet.
RST
All valid protocol versions have been exhausted. The connection will be
closed.
Protocol Description
~~~~~~~~~~~~~~~~~~~~
1. The client sends a protocol message to the server. The content of this
message is as described above. The version number selected should
(initially) be the most recent version supported by the client.
2. The server determines the greatest protocol version less than or equal to
that specified by the client. If it is capable of speaking the version
requested by the client it SHOULD return an ACK packet with said version.
Versions that were once valid can be deprecated if they prove to be a
security risk. If the server does not speak the protocol version it should
reply with its own VERSION packet where `version` does not exceed that of
the previous packet's `version` value.
3. The client and evaluates the packet in the same manner the server did in
step two.
This process repeats steps two and three until either an ACK or a RST packet is
sent by either party. In the event an ACK packet is sent, the client and server
resume communication using the agreed-upon protocol. In the event a RST
packet is sent, both parties terminate the connection.
::
Client Server
| |
|\_________________________________________________________________ |
1. | VERSION \|
| |
| Determines suitability of protocol version.
| |
| _________________________________________________________________/|
2. |/ VERSION or ACK or RST |
| |
Determines suitability of protocol version. |
| |
Peer-to-Peer Protocol
---------------------
The Peer-to-Peer protocol (P2P) is used by two nodes to transfer files. The
connection is such that the client is the downstream and server is the upstream.
Each connection is unidirectional with respect to the direction blobs flow.
Message Formats
~~~~~~~~~~~~~~~
Messages in the peer-to-peer protocol have a one-byte opcode that serves to
identify the message sent.
The opcode defines the class of message (e.g. peer-to-peer, peer-to-tracker,
etc.), the type of message (e.g. standard message, error message). The
following form is used for the OP field in each of the message formats below.
.. note::
This diagram is labelled to make it easy to visualize the meaning of an
opcode from its hexadecimal representation.
For example: 0xb3 is 0b10110011 which breaks down to 0b10 0b11 0b0011 and
therefore is a peer-to-tracker message send from the upstream (tracker) to
the downstream (peer), is an error type. It is the fourth message of this
type.
::
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|CS |TP |MESSAGE|
+---+---+-------+
+---------+------+----------------------------------------------------------+
| Field | Bits | Description |
+=========+======+==========================================================+
| CS | 2 | The class of message (peer-to-peer, peer-to-tracker, |
| | | master-to-tracker, or other) |
| | | |
| | | 0b00 |
| | | Other |
| | | 0b01 |
| | | Peer-to-peer |
| | | 0b10 |
| | | Peer-to-tracker |
| | | 0b11 |
| | | Master-to-tracker |
+---------+------+----------------------------------------------------------+
| TP | 2 | Sub-classifiers that define the message. For |
| | | peer-to-peer messages, the values are: |
| | | |
| | | 0b00 |
| | | Downstream to upstream (expected). |
| | | 0b01 |
| | | Upstream to downstream (expected). |
| | | 0b10 |
| | | Downstream to upstream (error). |
| | | 0b11 |
| | | Upstream to downstream (error). |
+---------+------+----------------------------------------------------------+
| MESSAGE | 4 | Message types are defined below. |
+---------+------+----------------------------------------------------------+
0x30 INVALOP
The INVALOP message can be sent by any component involved in interprocess
or network communication. This is a potential response to every message
in the protocol signifying an error condition.
0x40 HELLO
The HELLO message is used by the downstream to initiate a transfer request
with the upstream.
::
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OP = 0x40 (1) | COL LEN (1) | HASH LEN (2) |
+---------------+---------------+-------------------------------+
| COLLECTION (variable < 256) |
+---------------------------------------------------------------+
| HASH (variable < 65536) |
+---------------------------------------------------------------+
+------------+--------+----------------------------------------------------+
| Field | Octets | Description |
+============+========+====================================================+
| OP | 1 | Peer-to-peer, downstream to upstream, message 0. |
+------------+--------+----------------------------------------------------+
| COL LEN | 1 | An unsigned integer that specifies the length of |
| | | the collection string in bytes (including null |
| | | terminator). |
+------------+--------+----------------------------------------------------+
| HASH LEN | 2 | An unsigned integer that specifies the length of |
| | | the hash provided. |
+------------+--------+----------------------------------------------------+
| COLLECTION | < 256 | A human readable, null-terminated string that |
| | | identifies the collection. This is the value set |
| | | by the vendor of the collection. |
+------------+--------+----------------------------------------------------+
| HASH | < 65536| A binary representation of the hash that uniquely |
| | | identifies a manifest published by the collection. |
| | | This value can function as a shared secret and |
| | | loosely enforce the structured release of updates. |
+------------+--------+----------------------------------------------------+
.. note::
The HASH field is not to be used as the only means of communication. The
security of this shared secret rests on the following assumptions:
* No node attempts to pull the blobs from a node that was not listed by
the tracker as valid peers for the (COLLECTION, HASH) pair.
* No node leaks information about valid hashes that are not yet released
to the public.
If these two assumptions hold, it is the case that only authorized nodes
will be able to fetch updates. Thus, a tracker may release new manifest
versions in a manner that creates an arbitrary number of tiers of nodes
(thus creating a tree structure).
Potential Responses (Typical Control Flow):
* 0x50 HAVE
Potential Responses (Error Conditions):
* 0x70 INVALHELLO
* 0x71 INVALCOLL
* 0x72 INVALHASH
* 0x73 INVALMAN
0x50 HAVE
The HAVE message is the upstream's method of indicating which blobs it will
make available to the downstream. As the collection and manifest version are
settled, clients on both ends may calculate the number of unique blobs in the
manifest.
The HAVE message consists of two bitfields. The first bitfield indicates
blobs that are available for the downstream to request. The second bitfield
indicates blobs that the downstream should request *first*. The second
bitfield loosely correlates with what the upstream believes to be in the
filesystem cache at any given point in time. More complex algorithms for the
second bitfield may emerge in the future so as to share the contents of any
given manifest over many hosts (and thus coordinate a set of mirrors which
serve the entire collection straight from memory).
Each bitfield is formed by taking the list of unique blobs in the manifest
and sorting it according to increasing hash value (if each hash value is
considered a number). The presence of a blob in the pool or cache shall be
indicated with a set bit; absence of a blob is indicated by zeroing the bit.
::
0
0 1 2 3 4 5 6 7 n
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OP = 0x50 (1) | CONCATENATED BITFIELDS (ceil(n/8.0)) |
+---------------+-----------------------------------------------+
.. note::
The size of the bitfield is not transmitted as manifest specified in HELLO
contains all information necessary to determine the size.
Potential Responses (Typical Control Flow):
* 0x41 WANT
Potential Responses (Error Conditions):
* 0x60 INVALHAVE
* 0x61 INVALCACHE
0x41 WANT
The WANT message indicates a subset of the blobs specified as available in
the HAVE message. The downstream should consider other peers before
requesting an available blob that is not in cache.
::
0
0 1 2 3 4 5 6 7 n
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OP = 0x41 (1) | BITFIELD (ceil(n/8.0)) |
+---------------+-----------------------------------------------+
Potential Responses (Typical Control Flow):
* 0x51 SENDING
Potential Responses (Error Conditions):
* 0x74 INVALWANT
0x51 SENDING
The SENDING message is a subset of the blobs requested by the downstream in
the WANT message. The content of this bitfield should be a subset of that
sent by the downstream in the WANT message.
::
0
0 1 2 3 4 5 6 7 n
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OP = 0x51 (1) | BITFIELD (ceil(n/8.0)) |
+---------------+-----------------------------------------------+
Potential Responses (Typical Control Flow):
* 0x42 CONT
Potential Responses (Error Conditions):
* 0x62 INVALSEND
0x42 CONT
The CONT message is sent by the downstream when it confirms that the SENDING
bitfield is appropriate.
::
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
| OP = 0x42 (1) |
+---------------+
0x52 BLOB
The BLOB message is a header that precedes the content of a blob. It is
followed immediately by the blob.
::
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OP = 0x52 (1) | BLOB SIZE (8) |
+---------------+ - - - - - - - + - - - - - - - + - - - - - - - +
| |
+ - - - - - - - +---------------+---------------+---------------+
| | BLOB HASH (variable < 65536) |
+---------------+-----------------------------------------------+
+------------+--------+----------------------------------------------------+
| Field | Octets | Description |
+============+========+====================================================+
| OP | 1 | Peer-to-peer, upstream to downstream, message 2. |
+------------+--------+----------------------------------------------------+
| BLOB SIZE | 8 | An unsigned integer that specifies the size of the |
| | | blob to be sent. |
+------------+--------+----------------------------------------------------+
| BLOB HASH | < 65536| A binary representation of the hash that |
| | | corresponds to blob that follows. |
+------------+--------+----------------------------------------------------+
Responses should come after receipt of the blob. Should the upstream break
protocol and send an unrequested blob, the downstream should end the
connection.
Potential Responses (Typical Control Flow):
* 0x43 BLOBACK
* 0x44 BLOBNACK
Potential Responses (Error Conditions):
* 0x63 INVALBLOB
0x43 BLOBACK
0x44 BLOBNACK
The BLOBACK message is sent by the downstream when it confirms receipt of the
blob, and the blob matches its hash sum. The BLOBNACK is sent when a blob does
not match its hash sum.
::
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
| OP = 0x43 (1) |
+---------------+
Potential Responses (Typical Control Flow):
* 0x52 BLOB
Potential Responses (Error Conditions):
* 0x53 EOT
* 0x54 EOC