Compression Comparison Guide

Discussion in 'Reviews & Articles' started by peaz, Nov 30, 2002.

  1. Mgz

    Mgz Newbie

    I just wonder ?


    Why this super guide doesn't have CAB format (with LZX algorithms "unbeatable :? " ,this is the format M$ used to pack their OS)

    You can use M$ Tool (freeware) @ http://support.microsoft.com/default.aspx?scid=KB;en-us;310618& or some shareware like Cabinet Manager 2002


    How about state the algorithm each frogram use....like LZX(LZW,LZ77/LZ78,etc),Blowfish,Burrows-Wheeler block sorting text,Huffman, just a brief info :D
     
  2. drab

    drab Newbie

    The only compressor i use is to dig the drive up!Discs are so big these days dont bother :roll:
     
  3. Fulcrum2000

    Fulcrum2000 Newbie

    Maximum Compression

    Very nice compressor comparison you have on your site!. Great work.

    Also have a look at Maximum Compression (http://www.maximumcompression.com/) for some more up to data benchmarks.
     
  4. MisterE

    MisterE Newbie

    Hi there,

    I found my way to your article via Slashdot and I wanted to commend you on your efforts. That's a lot of work. I do have a few comments about your results.

    (Full disclosure: I am a QA Engineer at Aladdin/Allume/Smith Micro)

    First, you should make it clear that you are comparing compression formats, not compression applications. There is a distinction. Some applications can create compressed files in a variety of formats (StuffIt Standard can create archives in StuffIt X, Zip, gzip, bzip2 among others).

    You might want to go into some detail about what each compression format is doing. One format may be using a very different algorithm than another. There are several common compression algorithims:

    Deflate - LZ-Huffman - [see Wikipedia - Huffman Coding and Wikipedia - Deflate ] - a matching algorithm.

    LZ-Arithmetic - [see Wikipedia - Arithmetic Coding ] - a matching algorithm.

    BWT - "Burrows-Wheeler transform" also called "block-sorting compression" - [see Wikipedia - BWT and Wikipedia - Burrows-Wheeler Transform ] - the order of characters in a file are rearranged to increase redundancy and optimize compression size.

    PPM - "Prediction by Partial Matching" - [see Wikipedia - PPM ] - a prediction algorithm, especially suited for text.

    Zip, GZip use Deflate. BZip2 uses the BWT algorithm. Knowing what algorithm is being used can help explain why one format might compress a particular data set smaller, or faster. The StuffIt X format can use any of the above algorithms alone or in combination to get optimal compression for a given data set. The "Faster" or "Better" set is not necessarily best for any particular data set, but the custom settings give users a lot of control over how their data is compressed.

    Your tests focus mainly on compression size and compression speed. Users need to consider a number of other factors as well when deciding which format to use for their archiving needs: decompression speed, cross-platform compatability, open standards (ie: non-proprietary format), security, or some combination of all of these. If using an open standard compression format is very important, than one might choose a format that doesn't compress as well. Also, compression speed is usually inversely related to compression size. If you are only compressing files a little, you can usually compress them really fast. If you are trying to compress them as much as possible, then it usually takes a long time. Since compression algorithms look for the redundancy in files, files that have little redundancy (eg: lossy compressed files such as MP3) take a lot of effort and give little return. It's like putting a crushed soda can into a trash compactor. It may get a little smaller, but you probably aren't going to see great results, so you may not want to bother.

    A few issues with your StuffIt results:

    You give StuffIt high marks for JPEG compression, but you should note that there are two different StuffIt formats: the older StuffIt (.sit) format and the newer StuffIt X format introduced in 2000. JPEG compression is an enhancement added to StuffIt X at the beginning of 2004. It is not part of the older StuffIt format. It is proprietary and requires StuffIt to expand, but it does get up to 30% compression on JPEG files which previously were considered difficult to compress (see below).

    You fault the StuffIt (X) format for not compressing certain file types (such as MP3 files), but there is a default setting in StuffIt (the application) to not compress already compressed items (I believe the windows version says "Do not Recompress Items"). If you uncheck this option and re-run your tests you will see results more comparable to the other compression applications. But as noted above, the time and effort used to compress files that have little redundancy can be better spent, so the StuffIt app defaults to adding the files to the archive without further compression.

    This post is already too long. Sorry...I'm not trying to sell StuffIt here, I just want you to give it a fair review.

    Thanks!

    --Eric K
     
  5. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Hello Eric! :mrgreen:

    Thanks for your comments. Yeah, it is a lot of work, and I've only compressed them at the fastest settings. It's going to take a lot more time for the other settings.

    Actually, I did consider testing all supported formats of each compressor, but that would take a lot more time and effort. If the response is good and there is a significant number of requests for non-native formats to be tested in each compressor, I wouldn't mind adding them to the results.

    But I would prefer to think that I was testing each data compressor in its native format. I was not actually comparing compression format per se since different compressors using the same format will have different results. It was really about the data compressors' ability at compressing their native formats.

    Yes, I agree that users should consider other factors and not only the compression speed and performance when they choose a data compressor. This comparison guide was never meant to advise readers on the other factors. It is essentially a performance comparison of the few popular data compressors.

    Now, regarding the issues on StuffIt.

    1. Yup, I know about the old format. In fact, it was covered in the first version of this comparison guide. In this guide, we used the latest StuffIt X format to ensure maximum performance.

    2. I think I overlooked that setting. I will need to check it out and redo the tests. Thanks for letting me know about it. :thumbs:

    Thanks again for your comments. I appreciate you taking the time to help us improve the article. Do let us know if you have further comments on the article.

    Thanks!
     
  6. cranstone

    cranstone Newbie

    Hi Adrian,

    Can you send me a link to the full report - i'm interested in learning more. Many thanks.

    (Full disclosure - I'm one of the co-inventors of mod_gzip for Apache, the first PD module of it's kind for compressing data in real time from an Apache web server)

    All the best,

    Peter
     
  7. CeeJay

    CeeJay Newbie

    Adrian ...

    If you're trying to find the fastest compressor, you really need to test LZOP http://www.lzop.org/
    Other compression comparisons consistently rank it as the fastest compressor on the planet.

    Seeing Peter Cranstone post here, impels me to point out that rojakpot.com is not being served gzip-compressed and you could save a huge amount of bandwidth by doing that.
    And the site would load a lot faster to boot.
     
  8. cranstone

    cranstone Newbie

    Hi Adrian,

    A quick query on Netcraft shows that your running site is running :http://forums.rojakpot.com was running Microsoft-IIS on Windows Server 2003

    Many moons ago we Kevin and I actually built a server side IIS compression filter which was faster than Microsoft's version.

    I don't know what the current state of the art is on IIS but I would recommend turning on compression. All the current blog sites are using it and bandwidth savings are considerable - let alone faster load times for the viewer (customer) (ad server)

    All the best,


    Peter
     
  9. peaz

    peaz ARP Webmaster Staff Member

    Hi there Peter.. thanks for the response! :D

    Ok, this may be a bit off topic for this forums but...
    We have already turned on IIS Compressions. But I guess as you said, it's just not as efficient. I'll look into the available IIS compression filter. And I'd also like to try out the version you have built as well if you don't mind.

    Anyways, email me at ken[at]rojakpot{dot}com for further discussions on this topic. Thanks and cheers! :)
     
  10. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Hello Peter,

    I'm still working on the other tests. The Fastest tests alone took 4-5 days of solid testing. :wall:

    I will try to get the other test results out ASAP. :pray:
     
  11. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Hello CeeJay!

    Thanks for the tip. Will go check it out ASAP. Maybe I can add it to the Fastest test results before I get the results of the other tests. :think:
     
  12. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Hello Eric,

    Just checked the StuffIt settings. It looks like the "Do not Recompress Compressed Items" setting was NOT enabled in my tests. At least, it doesn't seem to be enabled now but since I did not change any settings since, that should be the case and the current results should be valid.
     
  13. MisterE

    MisterE Newbie

    Hey Adrian,

    I stand corrected... The Don't Compress Already Comrpessed Items setting only applies to other formats that StuffIt can create (ie: zip, gzip, StuffIt 5).

    The StuffIt X format excludes certain files that are typically difficult to compress (ie: .mp3, .mov). These files are added to the archive without compression. I knew this was the case with the Mac version, but I was under the impression that with the Windows versions of StuffIt, the Don't Recompress option affected all compression formats. (I test the Mac version...)

    Again, the reasoning for excluding these files is that for general use, in most cases the effort taken to compress a lossy-compressed file gives little return. (JPEG being the exception - we get great compression in JPEGs!)

    I expect that we will give users more control over this setting when using the StuffIt X format in future versions.

    --Eric K
     
    Last edited: Jan 4, 2006
  14. Olle P

    Olle P Newbie

    I'd say there's a fairly straightforward relationship where the computing time grows exponentially with the compression rate, no matter what algorithm is used.
    For a comparison in this test it would be nice to also know the time needed to do a straight copy, not using any compressor.
    This would show how much extra time it takes to save some space. I'd guess that the extra time is neglectable when compressing a small amount of data on a "fast" setting, and then be the major factor when opting for maximum compression.

    The only conclusions I can draw from the comparison, as presented this far, are:
    - Win RK doesn't opt for speed.
    - Stuffit is crap for anything but JPEG-files.

    I'm looking forward for the maximum compression test.

    Cheers
    Olle
     
  15. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Straight copy? Hmm.. I used two different hard drives to minimize the effect of the controller on their performance. Also, I used a slow PC to reduce the effect of the hard drives on the results. In effect, I'm increasing the bias on the actual compression, instead of hard drive activity.

    BTW, I think the effect of hard drive activity will be more pronounced during the fast tests since the processor needs far less time compressing it and the compressed archive will be bigger.

    Will get the other results out as soon as I can. Been travelling a lot these days and I will still need to travel more in the coming weeks. :wall: :wall:
     
  16. Olle P

    Olle P Newbie

    Still, the actual access time to read/write can't be neglected.
    ... which I also pointed out, with the addition that it also requires somewhat "small" files so that the computer won't need to use the virtual RAM (HDD) to temporarily store half-baked calculations during the compression.

    Cheers
    Olle
     
  17. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Correct, but as long as the variables are minimized and kept constant, the results should not be affected much.

    Also, some compressors compress directly to the archive, instead of compressing to a temporary folder first and then copying the final archive out. In such cases, they would be penalized if we deduct the time to copy from one hard drive to another.

    This is one of the reasons why I did it this way. Although the effect of the hard drive system is minimized, there should also be no manipulation of the results if possible. In this case, deducting the access/write time will skew the results in favour of compressors that compress to a temporary folder first.
     
  18. Olle P

    Olle P Newbie

    :confused: :confused: Quite the opposite I'd say!
    You deduct the time it takes to make one full size copy. Saving to a temporary folder means making two copies, which takes almost twice as long.

    In theory a compressor that doesn't save to a temporary folder might produce a compressed archive faster than it takes to make a straight copy, if the CPU power needed for the calculations doesn't interfere with the data transfer.

    As you yourself previously wrote; time is also a factor! If I need to quickly copy something for archiving or distribution, then I'm interested in knowing how much longer it will take me to compress the data and do the copy, saving ~10% space, compared to making a straight copy.
    When using a slow media for the file transfer it's almost always better to compress as much as possible first. (I vividly recall downloading 100MB files through a null modem at 10MB/h...)

    Cheers
    Olle
     
  19. Adrian Wong

    Adrian Wong Da Boss Staff Member

    Err... there are basically two ways these compressors work. Either they directly write to the archive as they compress the files.. or they compress to a temporary folder, usually on the OS drive and then write it out to the archive, which in this case is on another hard drive.

    Yes, saving to a temporary folder will take longer, but not really twice as long. In fact, moving the archive from the temporary folder to the actual folder takes only seconds, which is far shorter than the compression time.

    Actually, for both methods, the CPU does not actually interfere with the data transfer. The only difference between the two methods is that the "temporary storage" method needs to move the final archive out to the actual folder. That's it. Nothing else differs, as least as far as the CPU is concerned.

    Yes, time is a factor, which is why it would not be fair to arbitrarily deduct the time it takes to copy the files from one hard drive to another. I will just list down the reasons :

    1. Deducting the transfer time for compressors that use a temporary folder will reduce the speed advantage of compressors that write directly to the archive. That wouldn't be fair or accurate since in the real world, the former compressors will still take time to move the archive from the temporary folder to the actual folder.

    2. Timing how long it takes to copy the fileset from one hard drive to another isn't even accurate in the first place since the compressors would not be copying the UNCOMPRESSED fileset but the compressed fileset which would be smaller.

    3. Even if we become truly purist (and crazy!) and deduct the time it takes to copy every compressed fileset from one hard drive to another, we CANNOT account for hard drive caching which will affect the results.

    In short, it is not only inaccurate to do what you are suggesting, it is also unfair to compressors that write directly to the archive. They have a speed advantage in that sense and it isn't right to mess with the results just to deprive them of that advantage.
     
  20. Olle P

    Olle P Newbie

    Seems like we're still thinking past each other...

    Not as such, but it does interfere with the data being transfered and the time between "read" and "write".
    The CPU does the optimisation calculation that figures out exactly what to write into the compressed archive. This calculation adds time between "read" of the original file and "write" to the archive.
    Without that "interference" it would be a straight copy.

    I'm less up to date with exactly how much the CPU is involved with the actual HDD and data bus I/O control.
    Depends on file size, bus and media speeds, doesn't it?
    I've not suggested to deduct the actual tranfer time used by each method, but the total time needed for making one straight copy of the original data.
    Just call that compression method "Copy" and give it a compression rating of exactly 0.0% for all file types.
    Doing the compression requires CPU time, and one of the key issues here is to compare how long it takes, from start to finish, to create a compressed archive as opposed to creating an uncompressed file/folder.
    That's where large RAM and small originals come into play. Compressing a smaller amount of data can be done without HDD caching.
    As I've pointed out above, this statement is based on a misinterpretation of my suggestion.
    Adding the straight copy as just another compression method in the comparison will suffice.

    To me the key issue when deciding what compression method to use (if any) is why to do it.
    There are a couple of possibilities:

    1) I have some data that I want to have accessible in one location on my HDD.
    Then a single archive file is convenient, and any compression is fairly irrelevant. A straight copy into one folder will do just fine as well. A single file will use less space on the drive because it fills up all clusters used but one.

    2) I'm archiving files for possible future use (for example a back-up copy), and don't want the stored data to use up too much space.
    Then I wan't a fairly efficient compression that may take some time to perform, but not hours.

    3) I'm going to store some defined amount of data on a media that isn't quite that large. (Like 1GB raw data onto one 800MB CD-R.)
    Then any compressor that provide a sufficient amount of compression in a timely manner will do.

    4) I have some data that needs to be transfered through a slow channel with limited bandwidth ASAP.
    Depending on the amount of data involved I need to optimise the combination of extra time needed to do the compression versus the reduction in transfer time gained by the reduced file size.

    5) I've got some sizeable data (like a movie or program) that I've created, and want to publish it on a website for others to download. Minimizing the file size is crucial to cut download times and bandwidth use.
    Then the most powerful compressor is a very good choice, even if it takes hours to create the compressed file. There needs to be an easy accessable and free decompressor publicly available though.

    I hope this clears things up a bit.

    Cheers
    Olle
     

Share This Page