Suisse...

Taux hypothécaires CHF |

Taux à risque minimal CHF |

Avindex - "appétit pour le risque" financier |

Baltic Dry Index |

skip to main |
skip to sidebar
#
Gautier's blog

## mardi 4 octobre 2016

###
Economie septembre 2016

## mardi 20 septembre 2016

###
LZMA parametrization

## samedi 10 septembre 2016

###
LZMA compression - a few charts

## jeudi 1 septembre 2016

###
Taux négatifs - toujours plus bas!

## jeudi 18 août 2016

###
LZMA compression explained

## jeudi 7 juillet 2016

###
GLOBE_3D: now, a bit of fog...

Here is the code activating the fog in the background.

if foggy then

Enable (FOG);

Fog (FOG_MODE, LINEAR);

Fog (FOG_COLOR, fog_colour(0)'Unchecked_Access);

Hint (FOG_HINT, FASTEST);

Fog (FOG_START, 1.0);

Fog (FOG_END, 0.4 * fairly_far);

end if;

## mercredi 6 juillet 2016

###
GLOBE_3D Release 2016-07-05 - "Blender edition"

One fascinating property of the LZMA data compression format is that it is actually a family of formats with three numeric parameters that can be set:

For instance when (**lc**, **lp**, **pb**) = (8, 0, 0) you have a simple Markov model similar to the one used by the old "Reduce" format for Zip archives. Of course the encoding of this Markov-compressed data is much smarter with LZMA than with "Reduce".

Additionally, you have a non-numeric parameter which is the choice of the LZ77 algorithm – the first stage of LZMA.

The stunning thing is how much the changes in these parameters lead to different compression quality. Let’s take a format difficult to compress as a binary data, losslessly: raw audio files (.wav), 16 bit PCM.

By running Zip-Ada's**lzma_enc** with the -b (benchmark) parameter, all combinations will be tried – in total, 900 different combinations of parameters! The combination leading to the smallest .lzma archive is with many .wav files (but not all) the following: (0, 1, 0) – list at bottom [1].

It means that the previous byte is useless for predicting the next one, and that the compression has an affinity with 16-bit alignment, which seems to make sense. The data seems pretty random, but the magic of LZMA manages to squeeze 15% off the raw data, without loss. The fortuitous repetitions are not helpful: the weakest LZ77 implementation gives the best result! Actually, pushing this logic further, I have implemented for this purpose a “0-level” LZ77 [2] that doesn’t do any LZ compression. It gives the best output for most raw sound data. Amazing, isn’t it? It seems that repetitions are so rare that they output a very large code through the range encoder, while weakening slightly and temporarily the probability of outputting a literal - see the probability evolution curves in the second article, “LZMA compression - a few charts”.

Graphically, the ordered compressed sizes look like this:

and the various parameters look like this:

Many thanks to Stephan Busch who is maintaining the only public data compression corpus, to my knowledge, with enough size and variety to be really meaningful for the “real life” usage of data compression. You find the benchmark @ http://www.squeezechart.com/ . Stephan is always keen to share his knowledge about compression methods.

Previous articles:

____

[1] Here is the directory in descending order (the original file is a2.wav).

37'960 a2.wav

37'739 w_844_l0.lzma

37'715 w_843_l0.lzma

37'702 w_842_l0.lzma

37'696 w_841_l0.lzma

37'693 w_840_l0.lzma

37'547 w_844_l2.lzma

...

32'733 w_020_l0.lzma

32'717 w_010_l1.lzma

32'717 w_010_l2.lzma

32'707 w_011_l1.lzma

32'707 w_011_l2.lzma

32'614 w_014_l0.lzma

32'590 w_013_l0.lzma

32'577 w_012_l0.lzma

32'570 w_011_l0.lzma

32'568 w_010_l0.lzma

[2] In the package LZMA.Encoding you find the very sophisticated "Level 0" algorithm

if level = Level_0 then

** while More_bytes loop**

** LZ77_emits_literal_byte(Read_byte);**

** end loop;**

else

My_LZ77;

end if;

Hope you appreciate ;-)

- The “Literal context bits” (
**lc**) sets the number of bits of the previous literal (a byte) that will be used to index the probability model. With 0 the previous literal is ignored, with 8 you have a full 256 x 256 Markov chain matrix, with probability of getting literal j when the previous one was i. - The “Literal position” (
**lp**) will take into account the position of each literal in the uncompressed data, modulo 2^{lp}. For instance lp=1 will be better fitted for 16 bit data. - The
**pb**parameter has the same role in a more general context where repetitions occur.

For instance when (

Additionally, you have a non-numeric parameter which is the choice of the LZ77 algorithm – the first stage of LZMA.

The stunning thing is how much the changes in these parameters lead to different compression quality. Let’s take a format difficult to compress as a binary data, losslessly: raw audio files (.wav), 16 bit PCM.

By running Zip-Ada's

It means that the previous byte is useless for predicting the next one, and that the compression has an affinity with 16-bit alignment, which seems to make sense. The data seems pretty random, but the magic of LZMA manages to squeeze 15% off the raw data, without loss. The fortuitous repetitions are not helpful: the weakest LZ77 implementation gives the best result! Actually, pushing this logic further, I have implemented for this purpose a “0-level” LZ77 [2] that doesn’t do any LZ compression. It gives the best output for most raw sound data. Amazing, isn’t it? It seems that repetitions are so rare that they output a very large code through the range encoder, while weakening slightly and temporarily the probability of outputting a literal - see the probability evolution curves in the second article, “LZMA compression - a few charts”.

Graphically, the ordered compressed sizes look like this:

and the various parameters look like this:

The 900 parameter combinations |

The best 100 combinations |

Many thanks to Stephan Busch who is maintaining the only public data compression corpus, to my knowledge, with enough size and variety to be really meaningful for the “real life” usage of data compression. You find the benchmark @ http://www.squeezechart.com/ . Stephan is always keen to share his knowledge about compression methods.

Previous articles:

____

[1] Here is the directory in descending order (the original file is a2.wav).

37'960 a2.wav

37'739 w_844_l0.lzma

37'715 w_843_l0.lzma

37'702 w_842_l0.lzma

37'696 w_841_l0.lzma

37'693 w_840_l0.lzma

37'547 w_844_l2.lzma

...

32'733 w_020_l0.lzma

32'717 w_010_l1.lzma

32'717 w_010_l2.lzma

32'707 w_011_l1.lzma

32'707 w_011_l2.lzma

32'614 w_014_l0.lzma

32'590 w_013_l0.lzma

32'577 w_012_l0.lzma

32'570 w_011_l0.lzma

32'568 w_010_l0.lzma

[2] In the package LZMA.Encoding you find the very sophisticated "Level 0" algorithm

if level = Level_0 then

else

My_LZ77;

end if;

Hope you appreciate ;-)

Here are a few plots that I have set up while exploring the LZMA compression format.

You can pick and choose various LZ77 variants - for LZMA as well as for other LZ77-based formats like Deflate. Of course this choice can be extended to the compression formats themselves. There are two ways of dealing with this choice.

A criterion appearing obviously by playing with recompression is the uncompressed size (one of the things you know*before *trying to compress).

You can pick and choose various LZ77 variants - for LZMA as well as for other LZ77-based formats like Deflate. Of course this choice can be extended to the compression formats themselves. There are two ways of dealing with this choice.

- You compress your data with all variants and choose the smallest size - brute force, post-selection; this is what the
**ReZip**recompression tool does - You have a criterion for selecting a variant before the compression, and hope it will be good enough - this is what
**Zip.Compress**, method Preselection does (and the**ZipAda**tool with -eps)

A criterion appearing obviously by playing with recompression is the uncompressed size (one of the things you know

Obviously the BT4 (one of the LZ77 match finders in the LZMA SDK) variant is better on larger sizes than the IZ_10 (Info-Zip's match finder for their Deflate implementation), but is it always the case ? Difficult to say on this graphic. But, if you cumulate the differences, things begin to become interesting.

Funny, isn't it ? The criterion would be to choose IZ_10 for sizes smaller than the x-value where the green curve reaches its bottom, and BT4 for sizes larger than that x-value.

Another (hopefully) interesting chart is the way the probability model in LZMA (this time, it's the "MA" part explained last time) is adapted to new data. The increasing curves show the effect of a series of '0' on a certain probability value used for range encoding; the decreasing curves show the effect of a series of '1'. On the x-axis you have the number of steps.

This summer vacation's project was completed almost on schedule: write a LZMA encoder, whilst enjoying vacation - that is, work early in the morning and late in the evening when everybody else is sleeping; and have fun (bike, canoe, visiting caves and amazing dinosaurs fac-similes, enjoying special beers, ...) the rest of the day.

Well, "schedule" is a bit overstretched, because with a topic as tricky as data compression, it is difficult to tell when and even whether you will succeed...

**LZMA **is a compression format invented by Igor Pavlov, which combines a **LZ77** compression and **range encoding**.

With**LZ77**, imagine you are copying a text, character by character, but want to take some shortcuts. You send either single characters, or a pair of numbers (distance, length) meaning "please copy 'length' characters, starting back 'distance' characters in the copied text, from the point where the cursor is right now". That's it!

LZ77 is a well covered subject and is the first stage of most compression algorithms. Basically you can pick and choose an implementation, depending on the final compression size.

**Range encoding** is a fascinating way of compressing a message of any nature. Say you want to send a very large number N, but with less digits. It's possible - if some of the digits (0 to 9), appear more frequently, and some, less. The method is the following.

You begin with a range, say [0, 999[.

You subdivide it in ten intervals, corresponding to the digits 0 to 9, and calibrated depending on their probability of occurrence, p0 .. p9. The first digit of N is perhaps 3, and its corresponding interval is, say, [295, 405[.

Then, you continue with the second digit by subdividing [295, 405[ in ten intervals. If the second digit is 0, you have perhaps now [295, 306[, representing the partial message "30". You see, of course, that if you want to stick with integers (with computers you don't have infinite precision anyway), you lose quickly precision when you set up the ten intervals with the probabilities p0 .. p9. The solution is to append from time to time a 0 to the interval, when the width is too small. So, if you decide to multiply everything by 10 each time the width is less than 100, then the interval for "30" will be now [295**0**, 306**0**[.

Some n digits to be encoded later (after n subdivisions and some x10 when needed) your interval will perhaps look like [**298056**312, **298056**701[. The bounds become larger and larger - second problem. Solution: you see that the leftmost digits won't change anymore. You can get rid of them and send them as a chunk of the compressed message. The compression will be better when symbols are much more frequent than others: the closer the probability is to 1, the more the range width will be preserved. If the probability was exacly 1, the width wouldn't change at all and this trivial message with only the same symbol wouln't take any space in its compressed form! It is an absurd case, but it shows why compression methods such as LZMA are extremely good for very redundant data.

That's how the basic range encoding works.

Then, a funny thing is that you can encode a mix of different alphabets (say digits '0' to '9' and letters 'A' to 'Z') or even the same alphabet, but with different probabilities depending on the context, provided the decoder knows what to use when. That's all for range encoding (you find a more detailed description in the original article [1]).

**LZMA**'s range encoder works exclusively on a single, binary alphabet (0's and 1's), so the range is always divided in two parts. But it works with lots of contextual probabilities. With some parameters you can have **millions** of different probabilities in the model! The probabilities are not known in advance, so in this respect LZMA is a purely adaptive compression method: the encoder and the decoder adapt the probabilities as the symbols are sent and received. After each bit encoded, sent, received, decoded, the entire probability set is (and has to be) exactly in the same state by the encoder and by the decoder.

Developing an encoder from scratch, even if you have open-source code to reproduce, is fun, but debugging it is a pain. A bug feels like when something doesn't work in a PhD work in maths. No way to get help from anybody or by browsing the Web. By nature, the compressed data will not contain any redundancy that would help you fixing bugs. The decoder is confused on faulty compressed data and cannot say why. For range encoding, it is worse: as in the example, digits sent have nothing to do with the message to be encoded. The interval subdivision, the shipping of the leading interval digits, and the appending of trailing '0', occur in a way which is completely asynchronous. So, the good tactic is, as elsewhere, to simplify and divide the issues to the simplest.

First, manage to encode an empty message (wow!). It seems trivial, but the range encoder works like a pipeline; you need to initialize it and flush it correctly. Then, an empty message and the end-of-stream marker. And so on.

Another source of help for LZMA is the probability set: it needs to be identical at every point as said before.

The results of this effort in a few numbers:

To my (of course biased) opinion, this is the first LZMA encoder that a normal human can understand by reading the source code.

Zip-Ada's Zip.Compress makes use of LZMA encoding since revision 459.

**The source code is available here (main SourceForge SVN repository) or here (GitHub mirror).**

Back to vacation topic (which is what you do often when you're back*from* vacation): a tourist info sign was just perfect for a 32x32 pixels "info" icon for the AZip archive manager.

Well, "schedule" is a bit overstretched, because with a topic as tricky as data compression, it is difficult to tell when and even whether you will succeed...

With

LZ77 is a well covered subject and is the first stage of most compression algorithms. Basically you can pick and choose an implementation, depending on the final compression size.

You begin with a range, say [0, 999[.

You subdivide it in ten intervals, corresponding to the digits 0 to 9, and calibrated depending on their probability of occurrence, p0 .. p9. The first digit of N is perhaps 3, and its corresponding interval is, say, [295, 405[.

Then, you continue with the second digit by subdividing [295, 405[ in ten intervals. If the second digit is 0, you have perhaps now [295, 306[, representing the partial message "30". You see, of course, that if you want to stick with integers (with computers you don't have infinite precision anyway), you lose quickly precision when you set up the ten intervals with the probabilities p0 .. p9. The solution is to append from time to time a 0 to the interval, when the width is too small. So, if you decide to multiply everything by 10 each time the width is less than 100, then the interval for "30" will be now [295

Some n digits to be encoded later (after n subdivisions and some x10 when needed) your interval will perhaps look like [

That's how the basic range encoding works.

Then, a funny thing is that you can encode a mix of different alphabets (say digits '0' to '9' and letters 'A' to 'Z') or even the same alphabet, but with different probabilities depending on the context, provided the decoder knows what to use when. That's all for range encoding (you find a more detailed description in the original article [1]).

Developing an encoder from scratch, even if you have open-source code to reproduce, is fun, but debugging it is a pain. A bug feels like when something doesn't work in a PhD work in maths. No way to get help from anybody or by browsing the Web. By nature, the compressed data will not contain any redundancy that would help you fixing bugs. The decoder is confused on faulty compressed data and cannot say why. For range encoding, it is worse: as in the example, digits sent have nothing to do with the message to be encoded. The interval subdivision, the shipping of the leading interval digits, and the appending of trailing '0', occur in a way which is completely asynchronous. So, the good tactic is, as elsewhere, to simplify and divide the issues to the simplest.

First, manage to encode an empty message (wow!). It seems trivial, but the range encoder works like a pipeline; you need to initialize it and flush it correctly. Then, an empty message and the end-of-stream marker. And so on.

Another source of help for LZMA is the probability set: it needs to be identical at every point as said before.

The results of this effort in a few numbers:

**LZMA.Encoding**, started July 28th, first working version August 16th (revision 457).- Less than 450 lines - including lots of comments and some debugging code to be removed!
- 5 bugs had to be fixed.

To my (of course biased) opinion, this is the first LZMA encoder that a normal human can understand by reading the source code.

Zip-Ada's Zip.Compress makes use of LZMA encoding since revision 459.

Back to vacation topic (which is what you do often when you're back

Click to enlarge |

The beautiful sign |

Click to enlarge picture |

Here is the code activating the fog in the background.

if foggy then

Enable (FOG);

Fog (FOG_MODE, LINEAR);

Fog (FOG_COLOR, fog_colour(0)'Unchecked_Access);

Hint (FOG_HINT, FASTEST);

Fog (FOG_START, 1.0);

Fog (FOG_END, 0.4 * fairly_far);

end if;

As usual with GL, it looks very obvious, but (as usual too) it is one of the few combinations that are actually working.

GLOBE_3D is a GL Object Based 3D engine realized with the Ada programming language.

URL: http://globe3d.sf.net

Latest additions:

Here is an example:

Coincidentally, the Wavefront file format so simple that you can also write 3D models "by hand" in that format. An example made in an Excel sheet is provided along with the importer, in the ./tools/wavefront directory.

Enjoy!

URL: http://globe3d.sf.net

Latest additions:

- Use of Generic Image Decoder (GID) in GL.IO; now most image formats are supported for textures and other bitmaps to be used with GLOBE_3D (or any GL app)
- New Wavefront format (.obj / .mtl) importer
- Doom 3 / Quake 4 map importer more complete
- Unified GNAT project file (.gpr), allowing to selected the target Operating System (Windows, Linux, Mac) and compilation mode (fast, debug, small) for demos, tools, etc.
- Project file for ObjectAda 9.1+ updated

Here is an example:

Click to enlarge |

Click to enlarge |

Inscription à :
Articles (Atom)