This file implements the delta file block reuse algorithm. More...

#include "mediablocklist.h"
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <iostream>
#include <fstream>
#include <zypp-core/base/Logger.h>
#include <zypp-core/base/String.h>
#include <zypp-core/AutoDispose.h>
#include <zypp-core/base/Exception.h>

Include dependency graph for mediablocklist.cc:

Go to the source code of this file.

Namespaces
namespace	zypp
	Easy-to use interface to the ZYPP dependency resolver.

namespace	zypp::media

Macros
#define	UPDATE_RSUM(a, b, oldc, newc, bshift) do { (a) += ((unsigned char)(newc)) - ((unsigned char)(oldc)); (b) += (a) - ((oldc) << (bshift)); } while (0)

Functions
static size_t	zypp::media::fetchnext (FILE fp, unsigned char bp, size_t blksize, size_t pushback, unsigned char *pushbackp)

Detailed Description

This file implements the delta file block reuse algorithm.

If the available hashes use rolling checksum style hashes, we use the zsync algorithm to search for blocks that are available.

The basic idea of zsync is that on the server side a metadata file is generated, this file contains a list of checksums for each block in the target file ( the file we want to download ) as well as some other important informations like the size of the blocks, file checksum and mirrors. The client ( we ) then can use a delta file, basically a older version of the file we want to download, to find reusable blocks identified by their checksum.

A simple approach would be now to calculate all checksums in the delta file for each block, but that does not take into account that the blocks we want to reuse have moved in the file, e.g. a new set bytes of data was put in the beginning of the file, shifting all blocks to a new offset most likely not as a multiple of blocksize or worse not even a power of 2. Think a block is 2k but the data inserted is only 200 bytes, offsetting all blocks for 200bytes, in that case the simple approach could not reuse a single block of data because it can't find them.

To work around this problem the code calculates the checksums on every possible byte offset of the delta file. To speed this process up when finding a block we skip forward to the end of the block. If there is a match at x, its very unlikely to find another match starting inside the boundaries of the current block ( between x and x + blocksize - 1 ), except if the file has a lot of redundancy.

This process naturally can take very long if there are no, or just a few matching blocks in the two files. To speed this up a multi step checksum process is used. First we calculate a very cheap to calculate weak checksum for each block based on the Adler-32 checksum:

\[ a(k,l) = (\sum_{i=k}^l X_i) \bmod M \]

\[ \begin{align} b(k,l) &= (\sum_{i=k}^l (l-i+1)X_i) \bmod M \\ &= (\sum_{i=k}^l a(k,i) ) \bmod M \end{align} \]

\[ s(k,l) = a(k,l) + 2^{16} b(k,l) \]

where \(s(k, l)\) is the rolling checksum of the bytes \(X_k \ldots X_l\). For simplicity and speed, we use \(M = 2^{16}\).

The important property of this checksum is that successive values can be computed very efficiently using recurrence relations:

\[ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M \]

\[ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M \]

To understand why we can use recurrence relations its important to look at the formulas written out for a simple example, we will ignore the modulo operation for simplicity:

For the a formula is simple to see that we just need to subtract the value of the byte moving out of the block and add the value of the new byte coming in at the end

\[ \begin{align} a(0,2) &= X_0 + X_1 + X_2 \\ a(1,3) &= a(0,2) - X_0 + X_3 \\ &= X_0 + X_1 + X_2 - X_0 + X_3 \\ &= X_1 + X_2 + X_3 \end{align} \]

The b part is a bit more complex, writing the formula out also shows us why b can be expressed as: \(b=(\sum_{i=k}^l a(k,i) ) \bmod M\):

\[ \begin{align} b(0,2) &= 3X_0 + 2X_1 + X_2 \\ &= X_0 + X_0 +X_0 + X_1 + X_1 + X_2 \\ &= X0 + X0 + X1 + X0 +X1 +X2 \\ &= a(0,0) + a(0,1) + a(0,2) = (\sum_{i=k}^l a(k,i) ) \end{align} \]

\[ \begin{align} b(1,3) &= b(0,2) - (2-0+1)X_0 + a(1,3) \\ &= b(0,2) - 3X_0 + a(1,3) \\ &= 3X_0 + 2X_1 + X_2 - 3X_0 + X_1 + X_2 + X_3 \\ &= 3X_1 + 2X_2 + X_3 \end{align} \]

This shows us that we can very quickly calculate the new checksum of a block at offset i+1 based on the checksum of the block at offset i.

If the weak checksum matches, a stronger MD4 checksum is calculated for the same block to make sure we really got the block we wanted. If the strong checksum matches as well the block is written to the target file and removed from the list of blocks we still need to find. The read offset then jumps to the end of the current bock: offset+blocksize and continues the process from there.

Zsync additionally can require that always a sequence of 2 neighboring blocks need to match before they are considered a match. This depends on the filesize and is specified in the zsync metadata file. According to the zsync docs this greatly lowers the probability of a false match and allows to send smaller checksums for the blocks, minimizing the data we need to download in the meta datafile.

More in depth docs can be found at http://zsync.moria.org.uk/paper and https://rsync.samba.org/tech_report

Definition in file mediablocklist.cc.

Macro Definition Documentation

◆ UPDATE_RSUM

#define UPDATE_RSUM	(	a,
		b,
		oldc,
		newc,
		bshift
	)	do { (a) += ((unsigned char)(newc)) - ((unsigned char)(oldc)); (b) += (a) - ((oldc) << (bshift)); } while (0)

Definition at line 145 of file mediablocklist.cc.

Variable Documentation

◆ a

unsigned short a = 0

Definition at line 125 of file mediablocklist.cc.

◆ b

unsigned short b = 0

Definition at line 126 of file mediablocklist.cc.

Namespaces

Macros

Functions

Detailed Description

Macro Definition Documentation

◆ UPDATE_RSUM

Variable Documentation

◆ a

◆ b