Menu Categories Author

Nubis Novem

Consulting On Cloud Nine

Archive-Zip for Perl, a moody princess: limitations, shortcomings, workarounds

“Her Majesty’s a pretty nice girl,
But she doesn’t have a lot to say
Her Majesty’s a pretty nice girl
But she changes from day to day”
“Her Majesty”, from The Beatles album “Abbey Road” (1969)

When one needs to handle a few ZIP files or archives with Perl, oh, well.. There is the Archive-Zip module introduced as early as January 2001, then supported by several maintainers with regular updates (if CPAN references serve us right). Most of the time Archive-Zip is alright, but there are limitations. A programmer, beware! 2016 is about to be over while that Archive-Zip still does not know how to handle newer “64bit” header ZIP format. Not only it cannot read them 64bit ZIPs; alas, it would not create those, also. With older “32bit” header ZIP archives compressing larger amount of data files presents a bigger challenge than it should. Yes, you might use a different compression format or technique. But what if we must stick with the good old ZIP file as our standard?

Recently we have discovered a bug associated with that module in one of data processing systems that we support. Following was our code snippet expecting array of files to compress. That worked well most of the time:

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
...
# Creating a new zip file
my $zip = Archive::Zip->new();
my $zipname = "archive";
my $zipfile = $zipname . ".zip";
# Trying to read the existing zip structure, when zip archive already exists
$zip->read( $zipfile ) if -s $zipfile;
...
foreach my $file ( @files_to_compress ) {
   # remove if the current file was already in the zip:
   $zip->removeMember( $file );
   # add a new file to zip object in memory, using best compressionLevel = 9
   $zip->addFile( $file, $file, 9 );
}
...
if ( $zip->numberOfMembers ) {
   # Save to a zip file $zipfile
   $zip->overwriteAs( $zipfile );
}

Hence we do accumulate the zip object in memory until the very end of operation, that may present a challenge if this particular thread had a memory limit.

What is buggy here? Well, this code worked fine until either of the module limits was reached:

  • if number of files to compress would be greater than 65535, or,
  • if compressed output (a new ZIP archive) would exceed 2Gb size.

When hitting one of the limits the resulting output is a failure. As a hint, the processing aborted with an I/O error message, and a bad ZIP file was left behind with only handful of good source files. Most of the original files we were compressing would be lost.

It is still the case for the current mostly used version of Archive-Zip 1.30. It has to be the same situation for all versions including the latest 1.59 (as the regular ZIP format limits still apply).

Workaround would be to introduce a way to control the situation in the Perl source code by checking if we are hitting either of these limitations. We achieved that by counting total compressed size in variable $csize for all compressed files and checking the number of files (members) added to the archive image via ZIP member’s method compressedSize.

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
...
# Creating a new ZIP file in memory
my $zip = Archive::Zip->new();
my $zipname = "archive";
my $zipfile = $zipname . ".zip";
# Trying to load the existing ZIP structure, when ZIP archive already exists
$zip->read( $zipfile ) if -s $zipfile;
...
my $zipcnt = 0;
my $csize = 0;
my $csize_LIMIT = 1024 * 1024 * 1024     # roughly a 1Gb of compressed data
my $cfiles_LIMIT = 10000                 # set limit of 10K files per ZIP

foreach my $file ( @files_to_compress ) {
    # remove if the current file was already in the zip:
    $zip->removeMember( $file );
    # add a new file to zip object in memory, using best compressionLevel = 9
    my $file_member = $zip->addFile( $file, $file, 9 );
    $csize += $file_member->compressedSize;
    next unless $csize > $csize_LIMIT || $zip->numberOfMembers >= $cfiles_LIMIT;
    # Save our work to a ZIP file
    $zip->overwriteAs( $zipfile );
    $csize = 0;   # reset compressed size
    $zip = Archive::Zip->new();
    $zipfile = $zipname . ++$zipcnt . '.zip';
    # reading new ZIP file into memory object if exists
    $zip->read( $zipfile ) if -s $zipfile;    
}
...
# Final saving to a ZIP file $zipfile if compressed files are still in memory object
$zip->overwriteAs( $zipfile ) if $zip->numberOfMembers;

Notes and after-thoughts
This logic creates multiple zip files, while naming them with a ZIP archive counter starting from 1. While it does not create a truly split archives, still quite a better way instead of ending up with a bad ZIP! A future improvement would be to check for bigger files that have to be split across multiple ZIP files. But that is a bit challenging for a quick hack.

We were a bit discouraged by warning against compressedSize method in the latest Archive-Zip online documentation. That said “This will not be set for members that were constructed from strings or external files until after the member has been written.” But fortunately that is not true for the module version 1.30 and later. Compressed size in bytes was available right away.

Update (11/2/2016): after the actual fix it was detected that compressedSize method returns the original file size. There is in fact no way to know the compressed size (duh!) until the ZIP file is written to disk when actual compression happens. In other words, if you are not sure what kind of data contained within the source files, it is wise to set the limit at 2 000 000 000 bytes (slightly less than 2Gb, assuming some space for header and other overhead). That would guarantee that even with compress ratio very close to 99% (not compressible data at all, like some images or already compressed data), that will not result in a bad ZIP file due to format limitations.

Yet another, real solution would be to add a temporary ZIP file compression attempt. Here is the code that worked for us to find out real compressed size. As you may see we added a portion of temporary compression into a file named $ziptmp, writing it to ZIP with fast compression option (desired compression level parameter in addFile is 1 instead of 9) and finally reading the written temp ZIP file to find out the actual approximate compressed size for a file:

use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
...
# Creating a new ZIP file in memory
my $zip = Archive::Zip->new();
my $zipname = "archive";
my $zipfile = $zipname . ".zip";
my $ziptmp = "tmp001.zip";
# Trying to load the existing ZIP structure, when ZIP archive already exists
$zip->read( $zipfile ) if -s $zipfile;
...
my $zipcnt = 0;
my $csize = 0;
my $csize_LIMIT = 1024 * 1024 * 1024     # roughly a 1Gb of compressed data
my $cfiles_LIMIT = 10000                 # set limit of 10K files per ZIP

foreach my $file ( @files_to_compress ) {
    # remove if the current file was already in the zip:
    $zip->removeMember( $file );
    # add a new file to zip object in memory, using best compressionLevel = 9
    my $file_member = $zip->addFile( $file, $file, 9 );
    my $zipt = Archive::Zip->new();
    $zipt->addFile( $file, $file, 1 );
    unlink $ziptmp;
    $zipt->writeToFileNamed( $ziptmp );
    $zipt = Archive::Zip->new();
    $zipt->read( $ziptmp ) if -s $ziptmp;
    $file_member = $zipt->memberNamed( $file ); 
    $zipt = undef;
    unlink $ziptmp;
    $csize += $file_member->compressedSize;
    next unless $csize > $csize_LIMIT || $zip->numberOfMembers >= $cfiles_LIMIT;
    # Save our work to a ZIP file
    $zip->overwriteAs( $zipfile );
    $csize = 0;   # reset compressed size
    $zip = Archive::Zip->new();
    $zipfile = $zipname . ++$zipcnt . '.zip';
    # reading new ZIP file into memory object if exists
    $zip->read( $zipfile ) if -s $zipfile;    
}
...
# Final saving to a ZIP file $zipfile if compressed files are still in memory object
$zip->overwriteAs( $zipfile ) if $zip->numberOfMembers;

Have a happy compressing!

Leave a Reply

Your email address will not be published. Required fields are marked *