0 users browsing Projects. | 2 guests  
Main » Projects » delta patching, bsdiff edition
Pages: 1
Posted on 19-09-18, 12:32
Full mod

Post: #342 of 420
Since: 10-30-18

Last post: 1 day
Last view: 8 hours
byuu's been tinkering recently with beat, his patch tool for the BPS file format, and I've been testing it out. The big change in the recent versions is that instead of brute-force searching for common parts in the old and new files, it uses a suffix array (and other things) to quickly and efficiently find the best matches. I've been benchmarking it against other tools, but although it's an improvement, it's still nowhere near as efficient as bsdiff.

I've tried before to understand how bsdiff manages to be so efficient, and failed every time. Partially this is because I was reading the doctoral thesis containing "a far more sophisticated algorithm", which is hella dense. I'm not sure if I noticed it before, but the bsdiff webpage also links to a much smaller paper titled "Naïve differences of executable code", which describes a much simpler algorithm. Probably you should go read it yourself to understand it properly, but for practice I want to try describing it in my own words, and I figured I'd do it publicly so Google can find it and maybe help somebody out later.

Like modern beat, bsdiff starts with suffix arrays. Unlike beat, it doesn't just start at the beginning of the target and blindly encode every byte as best it can. Instead, it maps out the common chunks of each file and analyses them.

Let's look at an example:

source: aaabbcccccccdddeeeeeff
target: XcccccccYYYeeeeeZZZaaa


The content "aaa" is the same between source and target, as is the content "ccc..." and "eee...". However, the target replaces "bb" with "X", "ddd" with "YYY", "ff" with "ZZZ", and moves the "aaa" block from the beginning to the end.

A tool like beat would encode this diff with instructions like this:

- write "X"
- copy 7 bytes from source offset 5
- write "YYY"
- copy 5 bytes from source offset 16
- write "ZZZ"
- copy 3 bytes from source offset 0

However, bsdiff is a bit smarter. There's not much that bsdiff can do about "aaa" moving, or the changed content in "X" or "ZZZ", so it handles them in much the same way as beat would. However, it treats "ccc...", "YYY" and "eee..." specially. You see, the important thing about "ccc..." and "eee..." is not only that they exactly correspond between the source and target, but that they are the same distance apart in the source and target.

When a program is updated, usually only a small part of it is updated. Most of the code remains the same, and therefore the compiler generally produces the same instructions. However, small changes often ripple through the entire program - if a data structure gains an extra field, making it 16 bytes large instead of 12, a whole bunch of 12s will change to 16s throughout the code even if the surrounding code stays the same. The key point is this: when an instruction stream like "load address; add 12; push" changes to "load address; add 16; push", the parts that stay the same stay the same distance apart, and so these kinds of changes get picked up by the "same distance apart" logic described in the previous paragraph.

Here's the clever bit: when bsdiff detects a sequence of same-distance-apart matches, it doesn't record a "copy the source" or "write literal bytes" instruction, it records a "subtraction" instruction. It takes the whole consistent-length sequence, mismatches included, subtracts the source from the target, and records the subtracted data in the patch. If our source is "load address; add 12; push" and our target is "load address; add 16; push", the result of the subtraction might be 0, 0, 0, 0, 0, 0, 4, 0, 0. That is, since a whole bunch of 12s got changed to 16s, the subtraction is going to produce a whole bunch of 0s and 4s. By itself, that's not much of an improvement, but the thing is that this kind of encoding compresses really, really well.

So, the bsdiff encoding of our original patch might look something like this:

- write "X"
- add 15 bytes to source: 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0
- write "ZZZ"
- copy 3 bytes from source offset 0

The bsdiff file-format includes bzip2 compression, squeezing down those "add" operations significantly.

The bad news is that because this "add" operation is a new and different thing, BPS patches can't take advantage of it.

The good news is that the kinds of changes bsdiff is optimised for are fairly specific to code compiled from a high-level language. Code written in assembly generally doesn't make changes like this because it's a total pain in the ass, and I imagine things like fan-translations generally leave the original code alone or replace it wholesale, and BPS was really designed for that scenario.

The ending of the words is ALMSIVI.
Posted on 19-09-19, 20:16 (revision 1)
Stirrer of Shit
Post: #631 of 717
Since: 01-26-19

Last post: 244 days
Last view: 242 days
What about using something like a rolling hash to find overlaps? Far simpler to implement, fast, and it should find them all.
Rsync uses it, and they presumably know what they're doing, as do you, so I must be missing something.

You take the file, and iterate over it in rolling n-byte windows.
When the hash of the window ends with k zero bits, you append that position in the file to an array a, whose first element was set to 0 and whose last element gets set to the index of the last byte in the file.
For a hash function you can either use a rolling hash, or just use a fast constant-length hash function and apply it n times with shifts - it wouldn't be too slow with AVX anyway.
You then for both files iterate over the array a, hashing slices of the file from a[i] to a[i+1] while i < the index of the last element in a, and add the found hashes into a set.
You then compute the union of the two files' respective sets.
You then go back and expand the overlapping regions using some kind of linear search, also checking for false positives.
If you're interested in finding regions for which the 'subtract' operation is appropriate, an optimization is to use AVX-2 intrinsics, assuming a fairly recent processor.
To be more precise, do rolling XOR starting working forwards and backwards from the overlapping region, and keep moving for as long as popcount(a^b) (e.g. the Hamming distance) exceeds some arbitrary value c.
Since there is no _mm512_popcnt_si512 intrinsic, you will have to compute e.g. horizontal_add(_mm512_popcnt_epi64(_mm512_xor_si512(a, b))) > c, but an alternative might be to simply compute something like _mm512_cmp_epi64_mask(_mm512_popcnt_epi64(_mm512_xor_si512(a, b)), _mm512_set1_epi64(c/4), _MM_CMPINT_LT).
Comparing with zero is a bit tricky, though.
You have _mm256_testc_si256, and that would work, but you need to either split your _mm512 up or do the whole process on _mm256 registers.

Anyway, you now have a list of matching blocks for each file, and you can define 'matching' as pretty much whatever you want due to the linear search.
The only false negatives are the common chunks smaller than (on average) n*2^k.
For values of k = 0 and n = 32 (256/8), we are talking ~10 cycles per hash, so ~320 cycles per 32 starting bytes.
Then moving a number into RAM, that's limited by mem b/w but otherwise caps at a 1/16 cycle to move a uint16_t, assuming proper batching etc.
The fastest and easiest way to do set intersection is probably to sort both hash arrays and then just iterate over them both; whenever a 'gap' is found 'fast forward' the other sequence, and if you get two matching ones you put them in the new set.
This is all O(n) for hashes, but trying to count the cycles is folly because of all the cache effects and whatnot.
It still should be quite fast though if you do something like radix sort.
Maybe there is a better data structure to get the intersection of two sets as well - I know C++ has something.
Since the location also is needed, you might as well use 64-bit ints and mask them before checking equality; even if it's wasteful it's not exactly a tragedy to waste a few megabytes of ram.
For a memory-constrained application, you can get away with around 4*log2(file size) bits of memory per byte of input data in largest file, possibly shaving off or adding a few bits depending on the false positive rate tolerance.
But for bigger files, you can always increase the value of k - I can't imagine trying to find matching 32b chunks in an 8GB file to be a very productive nor space-saving endeavor.

The downside is that it locks you to specific CPU setups, and if you want it to be fast actually quite recent ones.
Am I missing something else?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.
Posted on 19-09-19, 22:34
Post: #91 of 168
Since: 11-01-18

Last post: 8 days
Last view: 4 hours
the purpose for your post. Its a nice explanation of how rsync works, but im not seeing how it bears on Screwtape's contrasting between bsdiff and beat.
Posted on 19-09-20, 10:54

Post: #201 of 210
Since: 10-29-18

Last post: 357 days
Last view: 329 days
Hey guys, can we use auto-predict in network gaming?!
Posted on 19-09-20, 11:25
The Thirteenth Doctor

Post: #404 of 520
Since: 10-29-18

Last post: 8 days
Last view: 42 min.
User is online
Posted by Kakashi
Hey guys, can we use auto-predict in network gaming?!
Could you please not?
Posted on 19-09-20, 12:26
Stirrer of Shit
Post: #632 of 717
Since: 01-26-19

Last post: 244 days
Last view: 242 days
Posted by funkyass
the purpose for your post. Its a nice explanation of how rsync works, but im not seeing how it bears on Screwtape's contrasting between bsdiff and beat.

By my understanding, both bsdiff and beat use suffix arrays. I am curious whether it were possible to use hash sets instead - by my understanding, it should be faster and easier to implement. After all, tools like rdiff-backup use this same algorithm to encode a delta between two (big) files, so shouldn't it work for smaller ones too?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.
Posted on 19-09-20, 12:48
Post: #92 of 168
Since: 11-01-18

Last post: 8 days
Last view: 4 hours
the vast majority of files that run thru rsync would be small files. But bsdiff and beat are optimized for specific kinds of files, not just any file.
Posted on 19-09-20, 20:26

Post: #202 of 210
Since: 10-29-18

Last post: 357 days
Last view: 329 days
Didn't read, but I'm assuming he's assuming that rsync doesn't merely change files if they've been changed.
Posted on 19-09-20, 21:09
Custom title here

Post: #699 of 932
Since: 10-30-18

Last post: 2 days
Last view: 6 hours
What about nsync?

--- In UTF-16, where available. ---
Posted on 19-09-21, 12:38
Full mod

Post: #343 of 420
Since: 10-30-18

Last post: 1 day
Last view: 8 hours
Posted by sureanem
What about using something like a rolling hash to find overlaps?

xdelta3 uses rolling hashes, and does pretty well - but not as well as bsdiff.

The problem with using a hash to index the contents of the source is that it's not practical to record a hash for *every* offset (that would take even more memory than a suffix array) so you pick some subset of offsets (say, the ones whose hash has 5 trailing zero bits) and hope that that's enough to find all the matches. But it's not guaranteed; there's always a chance that one of the matches has 4 or 3 trailing zero bits, or even that there's matches smaller than the rolling-hash width.

If I ever write another BPS patcher, I'll start with a suffix array, and see if I can figure a way to get, say, 10 near-matches rather than the algorithmically "best" match. Because of the way BPS works, a not-quite-optimal match nearby can actually be more efficient than a slightly longer match further away. I still think there's a lot of room to make even more-efficient BPS patches.

The ending of the words is ALMSIVI.
Posted on 19-09-21, 23:14
Stirrer of Shit
Post: #633 of 717
Since: 01-26-19

Last post: 244 days
Last view: 242 days
Posted by Kakashi
Didn't read, but I'm assuming he's assuming that rsync doesn't merely change files if they've been changed.

They do do delta patching, so that is indeed the case.
Posted by CaptainJistuce
What about nsync?

I'm afraid I can't find anything about it online - do you have a link?
Posted by Screwtape
it's not practical to record a hash for *every* offset (that would take even more memory than a suffix array) so you pick some subset of offsets (say, the ones whose hash has 5 trailing zero bits) and hope that that's enough to find all the matches. But it's not guaranteed; there's always a chance that one of the matches has 4 or 3 trailing zero bits, or even that there's matches smaller than the rolling-hash width.

Isn't it? You're dealing with ROMs the sizes of which number in the low megabytes; you can afford to be lavish with RAM if it's just for a few seconds. Your hash size should be roughly log2(rom size in bytes) bits, no matter how big the window (assuming you don't get collisions in the actual hashed data) - that uniquely identifies a position. So for a 512 KB ROM, that's 19 bits per byte, 2.4 times larger (1.2 MB) than the source - smaller than suffix arrays' 4B/B. For practical purposes I'm assuming you'd however want to round it to 64b/B, so that's 4 MB. Still no big deal, but I might be missing something.

I wouldn't think the small matches are all too useful - even if you can copy 4 bytes from offset X rather than writing them out as immediates, you'd likely have spent more data on encoding the offsets than you'd have just copying them straight out.

If I ever write another BPS patcher, I'll start with a suffix array, and see if I can figure a way to get, say, 10 near-matches rather than the algorithmically "best" match. Because of the way BPS works, a not-quite-optimal match nearby can actually be more efficient than a slightly longer match further away. I still think there's a lot of room to make even more-efficient BPS patches.

How do you mean 'further away' and 'nearby'? Do you mean like optimizing for "longest common sequence, regardless of whether it's perfect or not," rather than "longest exactly common sequence, or if no such, longest almost common sequence weighted for X"? I mean, if you have a list as such,

target offset | length | hamming distance
0x1234 | 721 | 0
0x1244 | 289 | 20
0x2981 | 1192 | 40
0x3000 | 600 | 0

Then isn't it a solved problem to determine which combination of overlaps yield the smallest patch? I think it should reduce to the knapsack problem, except for the part where you can break items into two at will with only minimal loss.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.
Posted on 19-09-22, 06:13
Upper Class Twit

Post: #405 of 520
Since: 10-29-18

Last post: 8 days
Last view: 42 min.
User is online
Posted by sureanem
Posted by CaptainJistuce
What about nsync?
I'm afraid I can't find anything about it online - do you have a link?
I think that was a joke about washed-out boy bands.
Posted on 19-09-22, 06:21

Post: #203 of 210
Since: 10-29-18

Last post: 357 days
Last view: 329 days
Posted by Kawa
Posted by sureanem
Posted by CaptainJistuce
What about nsync?
I'm afraid I can't find anything about it online - do you have a link?
I think that was a joke about washed-out boy bands.

https://en.wikipedia.org/wiki/NKOTBSB
Posted on 19-09-22, 07:26
Full mod

Post: #346 of 420
Since: 10-30-18

Last post: 1 day
Last view: 8 hours
Posted by sureanem
Isn't it? You're dealing with ROMs the sizes of which number in the low megabytes;

One of the nice things about the BPS format is that it supports files larger than 16MB. Arbitrarily sized, even, given it uses variable-length integers. It would be a shame to have a file-format that supports patching terabyte disk images, but the patch-creation tools only support files of a few megabytes.

How do you mean 'further away' and 'nearby'?

A BPS patching tool maintains "source offset" and "target offset" variables. When you copy a byte at offset X in the source to offset Y in the target, the "source offset" and "target offset" variables are updated appropriately. The source copy instruction stores its offsets relative to the current values of the source and target offset variables. Since they're stored as variable length integers, making the next copy instruction start close to where the last one left off can be a significant savings.

The ending of the words is ALMSIVI.
Posted on 19-09-22, 08:21 (revision 1)

Post: #199 of 295
Since: 10-29-18

Last post: 6 days
Last view: 1 hour
Posted by sureanem
dealing with ROMs the sizes of which number in the low megabytes

Unless you're dealing with CD images or MSU1 files.

My current setup: Super Famicom ("2/1/3" SNS-CPU-1CHIP-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Posted on 19-09-22, 19:32
Stirrer of Shit
Post: #634 of 717
Since: 01-26-19

Last post: 244 days
Last view: 242 days
Posted by Screwtape
One of the nice things about the BPS format is that it supports files larger than 16MB. Arbitrarily sized, even, given it uses variable-length integers. It would be a shame to have a file-format that supports patching terabyte disk images, but the patch-creation tools only support files of a few megabytes.

Right, but this should - correct me if I'm wrong - be an issue with nearly any patch creation tool unless there is some other very clever algorithm to it. If the memory usage scales O(n), then you've got yourself a problem when n is too big, no matter the coefficient. At least with the 'leading zeroes' approach, you can get as much mileage out of a given amount of RAM as possible. Whereas, with suffix arrays, again correct me if I'm wrong, that 4n memory usage would make 4TB of RAM a hard requirement to patch it.

How do you mean 'further away' and 'nearby'?

A BPS patching tool maintains "source offset" and "target offset" variables. When you copy a byte at offset X in the source to offset Y in the target, the "source offset" and "target offset" variables are updated appropriately. The source copy instruction stores its offsets relative to the current values of the source and target offset variables. Since they're stored as variable length integers, making the next copy instruction start close to where the last one left off can be a significant savings.

Well, isn't that just traveling salesman? Even if it takes 4 bytes to encode any address and 0 to encode the optimally chosen address, that seems like it easily gets dwarfed by the savings from having a 4 byte longer run and thus saving 4 bytes of literals.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.
Pages: 1
Main » Projects » delta patching, bsdiff edition
[Your ad here? Why not!]