Compression

From Data Crystal
Revision as of 19:56, 2 February 2024 by Lelegofrog (talk | contribs) (→‎Super Mario World)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Compression is any process that transforms data so that it takes less storage. The process must be reversible, so that the original data can be obtained. Compression works by exploiting the fact that most data has high redundancy; that is, that given a set of data there will generally be strings that occur much more frequently than others.

Below, some common compression methods used in games are described. When looking at the examples, keep in mind that they are only to illustrate the concepts, and actual implementations may store the data very differently.

RLE (Run Length Encoding)

RLE is the simplest type of compression, in which a run (a sequence consisting of one byte repeated) is replaced with the byte along with the length of the run.

Here is an example of RLE. Given the data

4B 68 61 61 61 61 61 61 61 6E 21

a naive implementation of RLE might compress it as follows. The run-length bytes are shown in bold.

4B 01 68 01 61 07 6E 01 21 01

The string of 61s has been shrunk, but the one-byte "runs" have doubled in size, making the compressed data almost as long as the original. If it is known that the original data never uses the most significant bit (i.e. every byte is in the range of 00-7F), then this can be improved by using that bit to indicate a run, eliminating the need to store the length for single bytes. Now the data will become

4B 68 E1 07 6E 21

which is almost half the original size.

RLE was often used to compress maps in NES games. The NES had very little RAM, so a large map generally could not be loaded all at once. Because RLE allows the program to decompress from any point by counting lengths, the desired sections could be loaded while ignoring the rest of the map.

Dictionary

In dictionary compression, commonly used strings are replaced with one- or two- byte sequences that don't occur in the original data. For example, if the character ~ is defined as representing " dog", then the string

If you dog a dog during the dog days of summer, you'll be a dog tired dogcatcher.

would become

If you~ a~ during the~ days of summer, you'll be a~ tired~catcher.

In dictionary compression, one dictionary can be used to compress many small blocks, making it a good method for text.

Huffman coding

In Huffman coding, each character is assigned a sequence of bits to represent it. More common characters get shorter codes, while rarer characters get longer codes. It is named after David Huffman, who discovered an algorithm for generating an optimal code given a set of character frequencies.

Example: With the code 000=e, 001=g, 010=C, 011=(end), 10=a, 11=b, the string "Cabbage" will become 010 10 11 11 10 001 000 011. Putting the bits into groups of 8 to form bytes, this becomes 01010111 11000100 00110000, or "WÄ0".

Like dictionary compression, Huffman coding is often used for text because one Huffman code can be used to compress many small blocks of text.

LZSS (Lempel-Ziv-Storer-Szymanski)

LZSS is sort of like dictionary compression, except that instead of a fixed dictionary, the dictionary is a "sliding window" that consists of the last few kilobytes of data. In LZSS there are single bytes, and references to strings inside the sliding window. These references contain a displacement that says how many bytes before the current position the string starts at, and the length. These are distinguished by the use of "flag bytes", in which each bit determines whether a byte or a reference is to come.

LZSS is sometimes confused with LZ77, an older compression type that it is based on. LZ77 did not use flag bytes, instead the compressed data just alternated between bytes and references.

LZSS works well for large blocks of data, but poorly for small blocks due to the lack of repetition. It is often used for graphics, tiles, and maps. Chrono Trigger uses LZSS, as does the standard GBA compression format.

zlib

https://en.wikipedia.org/wiki/Zlib

Common zlib headers (2 bytes). Default compression is the most common:

78 01 - No Compression/low
78 9C - Default Compression
78 DA - Best Compression

gzip

https://en.wikipedia.org/wiki/Gzip

The gzip header is 10 bytes and always starts with hex bytes: 1F 8B

Super Mario World

Super Mario World uses a very "ad-hoc" compression format, with 5 different commands. Variations of it are used in many other SNES games by Nintendo, such as Yoshi's Island, Super Mario Kart, and EarthBound.

See also