Skip to main content

RedStuff Encoding Algorithm

The RedStuff encoding algorithm used in Walrus is an adaptation of the Twin-Code framework presented by Rashmi et al. [1].

Goals and overview

The goal of the Walrus system is to provide a distributed storage infrastructure, where a decentralized set of entities—the storage nodes—collaborate to store and serve files (blobs of data). When it comes to storage properties, Walrus has 3 key goals:

  1. To support extremely high availability and durability of the data.
  2. To have low storage overhead compared to full replication, meaning you do not store each blob on every storage node.
  3. To gracefully support node failures, and in particular to allow for efficient node recovery (more on this later).

Given these requirements, one good option is to erasure encode the blobs across the storage nodes. At a high level, erasure encoding (or erasure coding) allows you to encode the data into NN parts, such that the aggregate size of the NNblobs is a small multiple of the original blob size, and a subset kk of these parts is sufficient to recover the original blob. The next section formalizes these concepts, but note that erasure coding already allows you to achieve goals 1 and 2 above, because:

  1. Erasure coding allows you to recover a blob even if NkN - kstorage nodes fail, providing high availability and durability.
  2. The overall storage overhead is much smaller than for full replication. For a blob of size SS, the total storage used in the system is ScS \cdot c instead of SNS \cdot N, where cNc \ll N is a small constant (4.5 in Walrus's case).

To achieve the third requirement, however, simple erasure coding is insufficient. A failed node that wants to reconstruct its part of the encoding needs to first fetch at least kk other parts, reconstruct the blob, and then re-encode its own part. Therefore, the communication overhead for recovery is on the order of the size of the whole blob, SS. With RedStuff, you can instead reconstruct the encoded part of a failed node by fetching only O(S/N)O(S/N) data, meaning only in the order of the size of the lost part. This achieves goal 3.

Background

This section provides the essential background on the coding schemes used in RedStuff.

Erasure codes

Erasure coding addresses the problem of error correction in the case of bit erasures, where some bits in the message are lost, as in the case of a lossy channel. An erasure code divides a blob (or message) of SS bytes into kk symbols (bitstrings of fixed length S/k\sim S/k), which are then encoded to form a longer message of NN symbols, such that the original blob can be recovered from any subset kk' of the NN symbols. The ratio k/Nk/N is called the code rate.

Fountain codes

Fountain codes are a class of erasure codes. The key property of fountain codes is that the encoding process is rateless, meaning the encoder can produce an arbitrary number of encoded parts without knowing the total number of parts that will be produced. This is useful for the RedStuff use case, as it allows you to specify the rate of the encoder. For example, by encoding f+1f+1 source symbols into NN recovery symbols, you guarantee that any subset of f+1f+1 symbols can reconstruct the source. Fountain codes are also extremely efficient as they typically require only XOR operations to encode and decode data.

RaptorQ

RedStuff is based on the RaptorQ fountain code. RaptorQ is one of the fastest and most efficient fountain codes, and has the following properties:

  1. It is systematic, meaning the first kk symbols of the encoded message correspond to the original message.
  2. It is a linear code, meaning the encoding process is a linear transformation of the input symbols, or in other words, the encoded symbols are linear combinations of the input symbols.
  3. It is almost optimal, meaning that kkk' \approx k. Specifically, the probability of decoding failure for k=k+Hk' = k + H symbols received is <1/256H+1\lt 1/256^{H+1}.

RedStuff encoding

An established approach in distributed storage is to use an erasure code to encode blobs of data across multiple storage nodes. By using a k/Nk/N rate erasure code for NN nodes and kk source symbols, the system can tolerate NkN - k node failures, with just an N/kN/k factor of storage overhead. However, in the case of a node failure, the recovery process is inefficient: the failed node needs to fetch kk other parts, reconstruct the blob, and then re-encode its own part. Therefore, the communication overhead for recovery is on the order of the size of the whole blob, SS.

The Twin-Code framework aims to solve this issue by allowing for efficient node recovery. This section briefly describes how the framework is used in RedStuff. For specific details, refer to the original paper. The RedStuff encoding algorithm is an adaptation of the Twin-Code framework, which allows for efficient node recovery in erasure-coded storage systems.

Consider a scenario in which a blob of data is encoded and stored across NN shards—multiple shards can be mapped to the same storage node—in a Byzantine setting. Up to ff of the shards can be corrupted by an adversary, with f<1/3Nf \lt 1/3 N, and the remaining NfN - f shards are honest.

Encoding and recovery

The RedStuff encoding and recovery process works as follows:

  • First, the data blob of size SS is divided into symbols and arranged in a rectangular message matrix of up to N2fN - 2f rows and NfN - f columns of symbols. The number of rows (nRn_R) and columns (nCn_C) is fixed, and determines the symbol size ss as follows:
s=S/(nRnC)s = \left\lceil S / (n_R \cdot n_C) \right\rceil
  • Then, the columns and the rows of the message matrix are encoded separately with RaptorQ.
    • The primary encoding, performed on columns, expands the nRn_R symbols of each column to NN symbols. The rateless nature of RaptorQ allows you to choose the number of encoded symbols.
    • The secondary encoding, performed on rows, expands the nCn_C symbols of each row to NN symbols.
  • nRn_R is also called the number of primary source symbols, and nCn_C the number of secondary source symbols. The primary encoding has rate nR/Nn_R / N, and the secondary encoding has rate nC/Nn_C / N.
  • The encoded rows and columns are then used to obtain primary and secondary slivers, which are distributed to shards and used for blob reconstruction and sliver recovery:
    • Primary slivers are the rows of the matrix of size N×nCN \times n_C obtained with the primary encoding of the message matrix. Each primary sliver is therefore composed of nCn_C symbols.
    • Secondary slivers are the columns of the matrix of size nR×Nn_R \times N obtained with the primary encoding of the message matrix. Each secondary sliver is therefore composed of nRn_R symbols.
  • Each shard receives a primary and a secondary sliver, based on the shard number and the row and column numbers of the slivers. See the section on sliver-to-shard mapping for more details.
  • The fundamental property achieved with this construction, thanks to the linearity of RaptorQ, is that encoding the primary slivers (as rows) with the secondary encoding and the secondary slivers (as columns) with the primary encoding results in the same N×NN \times N expanded message matrix. This property enables lost sliver recovery:
    • To reconstruct a lost primary sliver, a shard can request NfN-f symbols from the encodings of the secondary slivers of other shards. Because the primary encoding of secondary slivers results in the symbols for primary slivers, and the secondary encoding has nCn_C source symbols where nCN2fn_C \leq N-2f, the shard can decode the original primary sliver from the obtained recovery symbols with high probability. See the discussion on recovery probability for more details.
    • The reconstruction of secondary slivers is identical, but with the roles of primary and secondary slivers and encodings inverted.

The following example concretely shows this process.

Worked example

Encoding

Consider a Walrus instance with N=7=3f+1N = 7 = 3f + 1 shards. This means the number of primary source symbols is N2f=3N - 2f = 3, and secondary Nf=5N - f = 5. A blob of size S=15sS = 15 \cdot s can therefore be divided into 15 symbols of size ss, and arranged in the matrix as follows.

[s0,0s0,1s0,2s0,3s0,4s1,0s1,1s1,2s1,3s1,4s2,0s2,1s2,2s2,3s2,4]\left[ \begin{array}{ccccc} s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} & s_{0,4} \\ s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} & s_{1,4} \\ s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} & s_{2,4} \\ \end{array} \right]

Then, the primary encoding acts on the columns of the matrix, expanding them such that each column is composed of 4 source symbols and 6 recovery symbols (si,js_{i,j} indicates source symbols, while ri,jr_{i,j} indicates recovery symbols).

[s0,0s0,1s0,2s0,3s0,4s1,0s1,1s1,2s1,3s1,4s2,0s2,1s2,2s2,3s2,4r3,0r3,1r3,2r3,3r3,4r4,0r4,1r4,2r4,3r4,4r5,0r5,1r5,2r5,3r5,4r6,0r6,1r6,2r6,3r6,4]\left[ \begin{array}{c|c|c|c|c} s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} & s_{0,4} \\ s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} & s_{1,4} \\ s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} & s_{2,4} \\ \color{blue} r_{3,0} & \color{blue} r_{3,1} & \color{blue} r_{3,2} & \color{blue} r_{3,3} & \color{blue} r_{3,4} \\ \color{blue} r_{4,0} & \color{blue} r_{4,1} & \color{blue} r_{4,2} & \color{blue} r_{4,3} & \color{blue} r_{4,4} \\ \color{blue} r_{5,0} & \color{blue} r_{5,1} & \color{blue} r_{5,2} & \color{blue} r_{5,3} & \color{blue} r_{5,4} \\ \color{blue} r_{6,0} & \color{blue} r_{6,1} & \color{blue} r_{6,2} & \color{blue} r_{6,3} & \color{blue} r_{6,4} \\ \end{array} \right]

Each of the rows of this column expansion is a primary sliver. For example, [r5,0,r5,1,r5,2,r5,3,r5,4,r5,5,r5,6][r_{5,0}, r_{5,1}, r_{5,2}, r_{5,3}, r_{5,4}, r_{5,5}, r_{5,6}].

Similarly, the secondary encoding on the rows of the matrix produces the expanded rows.

[s0,0s0,1s0,2s0,3s0,4r0,5r0,6s1,0s1,1s1,2s1,3s1,4r1,5r1,6s2,0s2,1s2,2s2,3s2,4r2,5r2,6]\left[ \begin{array}{ccccccc} s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} & s_{0,4} & \color{blue} r_{0,5} & \color{blue} r_{0,6} \\ \hline s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} & s_{1,4} & \color{blue} r_{1,5} & \color{blue} r_{1,6} \\ \hline s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} & s_{2,4} & \color{blue} r_{2,5} & \color{blue} r_{2,6} \\ \end{array} \right]

Each of the columns of this row expansion is a secondary sliver. For example, [r0,6,r1,6,r2,6][r_{0,6}, r_{1,6}, r_{2,6}].

The iith sliver pair is composed of the iith primary and iith secondary slivers. For simplicity, consider that the iith sliver pair is stored on shard ii. The sliver-pair-to-shard mapping section discusses the full mapping.

Thanks to the linearity of RaptorQ, the expansion of:

  • the recovery secondary slivers (columns 5 and 6) with the primary encoding, and
  • the recovery primary slivers (rows 3, 4, 5, and 6) with the secondary encoding,

results in the same set of symbols, which is essential for recovery. These symbols can be represented as the lower-right quadrant of what is called the fully expanded message matrix.

[s0,0s0,1s0,2s0,3s0,4r0,5r0,6s1,0s1,1s1,2s1,3s1,4r1,5r1,6s2,0s2,1s2,2s2,3s2,4r2,5r2,6r3,0r3,1r3,2r3,3r3,4r3,5r3,6r4,0r4,1r4,2r4,3r4,4r4,5r4,6r5,0r5,1r5,2r5,3r5,4r5,5r5,6r6,0r6,1r6,2r6,3r6,4r6,5r6,6]\left[ \begin{array}{ccccc|cc} s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} & s_{0,4} & r_{0,5} & r_{0,6} \\ s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} & s_{1,4} & r_{1,5} & r_{1,6} \\ s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} & s_{2,4} & r_{2,5} & r_{2,6} \\ \hline r_{3,0} & r_{3,1} & r_{3,2} & r_{3,3} & r_{3,4} & \color{blue} r_{3,5} & \color{blue} r_{3,6} \\ r_{4,0} & r_{4,1} & r_{4,2} & r_{4,3} & r_{4,4} & \color{blue} r_{4,5} & \color{blue} r_{4,6} \\ r_{5,0} & r_{5,1} & r_{5,2} & r_{5,3} & r_{5,4} & \color{blue} r_{5,5} & \color{blue} r_{5,6} \\ r_{6,0} & r_{6,1} & r_{6,2} & r_{6,3} & r_{6,4} & \color{blue} r_{6,5} & \color{blue} r_{6,6} \\ \end{array} \right]

These symbols do not need to be stored on any node because they can always be recomputed by expanding either a primary or secondary symbol. For example, r4,5r_{4,5} can be obtained by:

  • the secondary-encoding expansion of the 4th primary sliver: [r4,0,r4,1,r4,2,r4,3,r4,4,r4,5,r4,6][r_{4,0}, r_{4,1}, r_{4,2}, r_{4,3}, r_{4,4}, \color{blue} r_{4,5}, r_{4,6}], or
  • the primary-encoding expansion of the 5th secondary sliver: [r0,5,r1,5,r2,5,r3,5,r4,5,r5,5,r6,5][r_{0,5}, r_{1,5}, r_{2,5}, r_{3,5}, \color{blue} r_{4,5}, r_{5,5}, r_{6,5}].

Recovery

Consider that shard 3 fails, losing its slivers, and needs to recover them. In the following, the symbols of the lost slivers are highlighted in red (the lower quadrant is never stored).

[s0,0s0,1s0,2s0,3s0,4r0,5r0,6s1,0s1,1s1,2s1,3s1,4r1,5r1,6s2,0s2,1s2,2s2,3s2,4r2,5r2,6r3,0r3,1r3,2r3,3r3,4r4,0r4,1r4,2r4,3r4,4r5,0r5,1r5,2r5,3r5,4r6,0r6,1r6,2r6,3r6,4]\left[ \begin{array}{ccccc|cc} s_{0,0} & s_{0,1} & s_{0,2} & \color{red} s_{0,3} & s_{0,4} & r_{0,5} & r_{0,6} \\ s_{1,0} & s_{1,1} & s_{1,2} & \color{red} s_{1,3} & s_{1,4} & r_{1,5} & r_{1,6} \\ s_{2,0} & s_{2,1} & s_{2,2} & \color{red} s_{2,3} & s_{2,4} & r_{2,5} & r_{2,6} \\ \hline \color{red} r_{3,0} & \color{red} r_{3,1} & \color{red} r_{3,2} & \color{red} r_{3,3} & \color{red} r_{3,4} & & \\ r_{4,0} & r_{4,1} & r_{4,2} & r_{4,3} & r_{4,4} & & \\ r_{5,0} & r_{5,1} & r_{5,2} & r_{5,3} & r_{5,4} & & \\ r_{6,0} & r_{6,1} & r_{6,2} & r_{6,3} & r_{6,4} & & \\ \end{array} \right]

To recover the primary sliver, the node contacts 5 other shards and requests the recovery symbols for the 3rd primary slivers. Because the symbols of the sliver are recovery symbols, the shards need to encode their secondary slivers (highlighted as columns) to obtain them. For example, shards 0, 1, 2, 4, and 6 provide the symbols:

[s0,0s0,1s0,2s0,3s0,4r0,5r0,6s1,0s1,1s1,2s1,3s1,4r1,5r1,6s2,0s2,1s2,2s2,3s2,4r2,5r2,6r3,0r3,1r3,2r3,4r3,6]\left[ \begin{array}{c|c|c|c|c|c|c} s_{0,0} & s_{0,1} & s_{0,2} & \color{red} s_{0,3} & s_{0,4} & r_{0,5} & r_{0,6} \\ s_{1,0} & s_{1,1} & s_{1,2} & \color{red} s_{1,3} & s_{1,4} & r_{1,5} & r_{1,6} \\ s_{2,0} & s_{2,1} & s_{2,2} & \color{red} s_{2,3} & s_{2,4} & r_{2,5} & r_{2,6} \\ \color{green} r_{3,0} & \color{green} r_{3,1} & \color{green} r_{3,2} & & \color{green} r_{3,4} & & \color{green} r_{3,6}\\ \end{array} \right]

To recover the secondary sliver, the node contacts at least 3 other shards to obtain recovery symbols. In this case, the recovery symbols are already part of the primary slivers (highlighted as rows) stored by the other shards, so no re-encoding is necessary. For example, shards 0, 1, and 5 provide the recovery symbols:

[s0,0s0,1s0,2s0,3s0,4s1,0s1,1s1,2s1,3s1,4s2,0s2,1s2,2s2,3s2,4r3,0r3,1r3,2r3,3r3,4r4,0r4,1r4,2r4,3r4,4r5,0r5,1r5,2r5,3r5,4r6,0r6,1r6,2r6,3r6,4]\left[ \begin{array}{ccccc} s_{0,0} & s_{0,1} & s_{0,2} & \color{green} s_{0,3} & s_{0,4} \\ \hline s_{1,0} & s_{1,1} & s_{1,2} & \color{green} s_{1,3} & s_{1,4} \\ \hline s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} & s_{2,4} \\ \hline \color{red} r_{3,0} & \color{red} r_{3,1} & \color{red} r_{3,2} & \color{red} r_{3,3} & \color{red} r_{3,4} \\ \hline r_{4,0} & r_{4,1} & r_{4,2} & r_{4,3} & r_{4,4} \\ \hline r_{5,0} & r_{5,1} & r_{5,2} & \color{green} r_{5,3} & r_{5,4} \\ \hline r_{6,0} & r_{6,1} & r_{6,2} & r_{6,3} & r_{6,4} \\ \end{array} \right]

In this case, the symbols s0,3s_{0,3}, s1,3s_{1,3}, and s2,3s_{2,3} are already stored in the primary slivers of shards 0, 1, and 2 directly. Therefore, by requesting these from those shards, shard 3 does not need to decode the symbols to recover its secondary sliver.

Properties and observations

Why the rectangular layout?

The rectangular layout of the message matrix is an optimization for the Byzantine setting. When storing the blob, a client can only await NfN - f responses, as the remaining ff shards might be Byzantine. Yet, ff of these NfN-f might be the Byzantine ones, and the ff that did not reply were only slow due to network asynchrony. Therefore, the blob needs to be encoded such that N2fN-2f symbols are sufficient to recover the original blob. This is achieved by the primary encoding.

After the initial sharing phase, the honest shards share and reconstruct the missing slivers from each other. At a steady state, you can always assume that NfN - f honest shards are in possession of their slivers. The secondary encoding can therefore have a higher rate, (Nf)/N(N-f)/N, decreasing the storage overhead while maintaining the same fault tolerance properties.

Worst case initial sharing

The following describes how the NfN-f honest shards can obtain their sliver pairs in the worst case outlined above, where the client shares the slivers with NfN-f shards, ff of which are Byzantine and drop them.

  1. The N2fN-2f honest shards receive the sliver pairs.
  2. The remaining ff honest shards are notified of the stored blob, such as through the chain, and start the process to recover their sliver pairs.
  3. First, they recover their secondary slivers, as they can be decoded from N2fN-2f recovery symbols.
  4. Then, once all NfN-f honest shards have their secondary slivers, they can start recovering the primary slivers, which require NfN-f recovery symbols.
  5. All honest shards have their sliver pairs.

Storage overhead

Assume for simplicity that N=3f+1N=3f+1. Then, the original blob is divided into roughly f2f=2f2f \cdot 2f = 2f^2 symbols. The system stores N2fN \cdot 2f primary sliver symbols and NfN \cdot f secondary sliver symbols, for a total storage of about 9f29f^2 symbols.

Therefore, the storage overhead due to RedStuff encoding is about 9f2/2f2=4.59f^2 / 2f^2 = 4.5 times the original blob size.

Differences with the Twin-Code framework

The key modifications in RedStuff, compared to the original Twin-Code framework, are the following:

  • RedStuff uses the RaptorQ fountain code for both the Type 0 and Type 1 encoding, as they are called in the paper. The rates are about (N2f)/N(N-2f)/N and (Nf)/N(N-f)/N respectively. The Type 0 encoding is called the primary encoding, and the Type 1 encoding the secondary encoding.
  • The blob is not laid out in a square message matrix, but in a rectangular one. This is an optimization for the specific BFT setting described here.
  • Both Type 0 and Type 1 encodings are stored on each shard. These are called slivers, and the 2 together form a sliver pair.

Walrus-specific parameters and considerations

Sliver-pair-to-shard mapping

The previous sections assumed that sliver pair ii is stored on shard ii. In practice, sliver pairs are mapped to shards in a pseudo-random fashion to ensure that the systematic slivers, which contain the original data, are not always stored on the same shard.

This is important because systematic slivers are the most frequently accessed, as they allow you to access the data without any decoding.

The mapping works as follows: each encoded blob is assigned a 32-byte pseudo-random blob ID. This ID is interpreted as an unsigned big-endian integer, and its remainder modulo NN is used as a rotation offset, such that sliver pair ii is stored on shard (i+offset)modN(i + \text{offset}) \mod N.

Decoding probability and decoding safety limit

As mentioned above, the reconstruction failure probability of the RaptorQ code is O(256(H+1))O(256^{-(H+1)}), where HH is the number of extra symbols received. Therefore, in a system with ff Byzantine shards, the number of source symbols for the primary encoding should be slightly below N2fN-2f, and for the secondary encoding slightly below NfN-f. This ensures that whenever a validity or quorum threshold of messages is received, there is always a positive HH for a low failure probability.

The following parameters are used in the encoding configuration:

  • ff, the maximum number of Byzantine shards, is (N1)/3\lfloor (N-1) / 3 \rfloor.
  • The safety limit for the encoding, σ\sigma, is set as a function of NN to ensure high reconstruction probability (see table below).
  • The number of primary source symbols (equivalent to the number of symbols in a secondary sliver) is N2fσN - 2f -\sigma.
  • The number of secondary source symbols (equivalent to the number of symbols in a primary sliver) is NfσN - f -\sigma.

Currently, σ\sigma is selected depending on the number of shards as follows:

N shards fromN shards to (incl)σ\sigmaProb. failure
01500.00391
163011.53e-05
314525.96e-08
466032.33e-10
617549.09e-13
76inf53.55e-15

For example, the following settings apply:

N shardsfσ\sigmaNum primaryNum secondary
72035
103047
31102919
1003352962
30099597196
10003335329662

Blob size limits

In RaptorQ, the size of a symbol is encoded as a 16-bit integer. Therefore, the maximum size of a blob that can be encoded is 2161=655352^{16} - 1 = 65535 bytes. At a minimum, a symbol must be at least 1 byte.

Because the blob is encoded in the rectangular message matrix, the blob size is upperbound by source_symbols_primary * source_symbols_secondary * u16::MAX and lowerbound by source_symbols_primary * source_symbols_secondary. A few examples for the same configurations as above:

N shardsMin blob sizeMax blob sizeMin encoded blob sizeMax encoded blob size
715.0 B983 KB56.0 B3.67 MB
1028.0 B1.83 MB110 B7.21 MB
31171 B11.2 MB868 B56.9 MB
1001.80 KB118 MB9.10 KB596 MB
30019.0 KB1.25 GB87.9 KB5.76 GB
1000218 KB14.3 GB991 KB64.9 GB

Sliver authentication, blob metadata, and the blob ID

Alongside the efficient encoding performed by RedStuff, shards need to be able to authenticate that the slivers and encoding symbols they receive belong to the blob they requested. This section briefly outlines how this is achieved.

For each sliver, primary or secondary, a Merkle tree is constructed.

The tree is constructed over all NN symbols of the fully expanded sliver. The root node of the Merkle tree (the sliver hash) is included in the metadata for the blob. To prove that a symbol is part of a sliver, the prover supplies the symbol alongside the Merkle path to the root hash, which every node has as part of the metadata.

A Merkle tree over the sliver hashes is then computed to obtain a blob hash. This is computed by concatenating primary and secondary sliver hashes for each sliver pair, and then constructing the Merkle tree over the concatenations (cic_i in the figure). This construction reduces the number of hashing operations compared to hashing each sliver Merkle root individually.

To prove that a sliver is part of a blob, it is sufficient to provide the Merkle path to the root.

Finally, the encoding type tag (representing the RedStuff version or alternative encoding), the length of the blob before encoding, and the Merkle root of the tree over the slivers are hashed together to obtain the blob ID.

Metadata overhead

Each storage node stores the full metadata for the blob. The metadata consists of:

  • A 32 B Merkle root hash for each primary and secondary sliver.
  • The 32 Bblob ID, computed as above.
  • The erasure code type (1 B).
  • The length of the unencoded blob size (8 B).

The hashes for the primary and secondary slivers can be a considerable overhead when the number of shards is high. The following table shows the cumulative size of the hashes stored on the system, depending on the number of nodes and shards.

N shardsOne nodefloor(N/floor(log2(N))) nodesN nodes
7448 B1.34 KB3.14 KB
10640 B1.92 KB6.40 KB
311.98 KB13.9 KB61.5 KB
1006.40 KB102 KB640 KB
30019.2 KB710 KB5.76 MB
100064.0 KB7.10 MB64.0 MB

The cumulative size of the hashes in the case of 1000 nodes (1 node per shard) is 64 KB per node, or 64 MB for a system of 1000 nodes. The number of shards is fixed and constant, while the number of nodes might vary—each node has 1 or more shards—potentially lowering the overhead on the system. The following table shows the ratio between the size of the hashes stored on the system to the minimum and maximum blob sizes, for N=1000 shards and different numbers of nodes (1 node, floor(N/floor(log2(N))) = 111, and 1000).

N = 1000Total metadata sizeFactor min blobFactor max blobFactor min encoded blobFactor max encoded blob
Single node64.0 KB0.2944.48e-060.06469.85e-07
floor(N/floor(log2(N))) nodes7.10 MB32.60.0004987.170.000109
N nodes64.0 MB2940.0044864.60.000985

For realistic node counts and small blob sizes, the total metadata overhead can be a multiple of the size of the initial unencoded blob. For larger blob sizes, the overhead is negligible.


Reference

[1] K. V. Rashmi, N. B. Shah and P. V. Kumar, "Enabling node repair in any erasure code for distributed storage," 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 2011, pp. 1235-1239, doi: 10.1109/ISIT.2011.6033732.