Compression Isn't Always What You Might Think
Compression Isn't Always What You Might Think
Contents
About this document
Related information
What is compression
What happens if you have already compressed a file?
How does the file become bigger?
Where is compression performed?
What about "best case" compression ratios?
So what can be done?
What else should I know?
About this document
The purpose of this document is to clarify the use of
compression in backing up data, and to answer questions such as the following:
Why am I not getting three-to-one compression?
My tapes hold 120 gigs, but I'm only getting 38 gigs. Why?
The information in this document applies to all versions of AIX.
Related information
AIX and related product documentation is also available:
http://www.rs6000.ibm.com/resource/aix_resource
/Pubs/index.html
See "Understanding Data Compression" in System Management Concepts:
Operating System and Devices
What is compression?
Compression finds repeating patterns in data, leaves a tag in its place, and
only keeps the data one time. This means that:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
could become something like:
/2A
This would say / (escape in this non-representative example), then 2, which has
character value of 50 in US ASCII for the count, and A, which is the character
to repeat 50 times. There are obviously more complicated schemes than this,
but in the end they fall into the category of reducing repetition in a file.
What happens if you have already compressed a file?
Compressed data generally does NOT recompress. Binary programs are
semi-compressed in that the source code (text) is converted into machine code
(binary) which will be more compact. Graphics files tend to be already
compressed, since they are usually highly repeating in nature.
Obviously, compressed files are already compressed, and these include
.ZIP by Phil Katz, the old unix compress .Z, gnu compressed .gz files, the newer
open compression .bz, and numerous other available formats.
Compressed data is already mostly non-repeating. If you try to recompress it,
one of three things will happen:
- A lot of CPU time is spent making the data only half a percent smaller.
- The file ends up being the same size.
- The file ends up being larger.
How does the file become bigger?
Compressed files require the inclusion of a decompression table, similar to a key,
which is used during the extraction of data from the files.
Where is compression performed?
There are several ways compression takes place:
- Sometimes it is part of the file format for your application. In this case, the
application uses CPU time to make the file smaller, but this doesn't always work.
- You might use a third-party application, such as RAR, WinZup, ARJ, GNU
Zip, your backup software, or pencil and paper to make calculations.
- Your operating system may have a feature to compress an entire disk,
partition, filesystem, or individual files in a transparent manner to your
applications.
- You may use hardware, such as the CPU inside your tape drive, or an
add-in card for your disk drives.
What about "best case" compression ratios?
Sales and Marketing for most major companies and many other organizations tend
to indicate compression ratios that are either best case, such as a file
consisting of 37 gigabytes of the same character, or they may even be values
that are not attainable in a real-world device.
While the technology, in best cases, can actually achieve huge
compression rates, these are virtually impossible to achieve unless you are
backing up empty databases.
So what can be done?
Be aware of the native capacity or the raw capacity of your
storage media, and of the data characteristics of what you are storing. If
you are backing up users' workstations, don't expect to get 120 gigs on your 40.
What else should I know about compression?
Heavy encryption sufficiently randomizes data so that compression is prevented.
Performance, a complex topic, is affected by compression. If your network is heavily loaded, slow,
or poorly implemented, you will want compression before you send large
amounts of data to it. If your network is good, or if your client systems' CPUs
are slow, then don't use client-side compression. Instead, let your tape drive do
the work.
Note that the streaming throughput of your tape drive may be rated at its
maximum compressed rate. Before you decide your drive is running too slowly,
you will need to determine the head-to-tape transfer speed of your tape
drive. It may have a few different ones since tape drives often run at
multiple speeds to prevent buffer underruns.
Compression is unpredictable. For example, a user on your network may have a hidden stash of MP3s on an
ecrypted volume, so although you might see the data it may be uncompressable.
Also, some backup utilities such as "Tivoli Storage Manager" will not report
expected values of data on your volumes if the data was compressed by the client. The
server will see what was sent from the client, not what the client copied. The
server would see the raw capacity of the tape, since this wouldn't recompress;
however, since it was compressed before, all of this data would actually
represent more to the client than the server indicated.
If you have further questions about compression or would like information about
IBM's consulting services, contact your local service representative.
[ Doc Ref: 98406788611840 Publish Date: Mar. 23, 2001 4FAX Ref: 1029 ]