zlib/contrib/asm586/README.586 - gcc - Git at Google

 This is a patched version of zlib modified to use
 Pentium-optimized assembly code in the deflation algorithm. The files
 changed/added by this patch are:

 README.586
 match.S

 The effectiveness of these modifications is a bit marginal, as the the
 program's bottleneck seems to be mostly L1-cache contention, for which
 there is no real way to work around without rewriting the basic
 algorithm. The speedup on average is around 5-10% (which is generally
 less than the amount of variance between subsequent executions).
 However, when used at level 9 compression, the cache contention can
 drop enough for the assembly version to achieve 10-20% speedup (and
 sometimes more, depending on the amount of overall redundancy in the
 files). Even here, though, cache contention can still be the limiting
 factor, depending on the nature of the program using the zlib library.
 This may also mean that better improvements will be seen on a Pentium
 with MMX, which suffers much less from L1-cache contention, but I have
 not yet verified this.

 Note that this code has been tailored for the Pentium in particular,
 and will not perform well on the Pentium Pro (due to the use of a
 partial register in the inner loop).

 If you are using an assembler other than GNU as, you will have to
 translate match.S to use your assembler's syntax. (Have fun.)

 Brian Raiter
 breadbox@muppetlabs.com
 April, 1998


 Added for zlib 1.1.3:

 The patches come from
 http://www.muppetlabs.com/~breadbox/software/assembly.html

 To compile zlib with this asm file, copy match.S to the zlib directory
 then do:

 CFLAGS="-O3 -DASMV" ./configure
 make OBJA=match.o
	This is a patched version of zlib modified to use
	Pentium-optimized assembly code in the deflation algorithm. The files
	changed/added by this patch are:

	README.586
	match.S

	The effectiveness of these modifications is a bit marginal, as the the
	program's bottleneck seems to be mostly L1-cache contention, for which
	there is no real way to work around without rewriting the basic
	algorithm. The speedup on average is around 5-10% (which is generally
	less than the amount of variance between subsequent executions).
	However, when used at level 9 compression, the cache contention can
	drop enough for the assembly version to achieve 10-20% speedup (and
	sometimes more, depending on the amount of overall redundancy in the
	files). Even here, though, cache contention can still be the limiting
	factor, depending on the nature of the program using the zlib library.
	This may also mean that better improvements will be seen on a Pentium
	with MMX, which suffers much less from L1-cache contention, but I have
	not yet verified this.

	Note that this code has been tailored for the Pentium in particular,
	and will not perform well on the Pentium Pro (due to the use of a
	partial register in the inner loop).

	If you are using an assembler other than GNU as, you will have to
	translate match.S to use your assembler's syntax. (Have fun.)

	Brian Raiter
	breadbox@muppetlabs.com
	April, 1998


	Added for zlib 1.1.3:

	The patches come from
	http://www.muppetlabs.com/~breadbox/software/assembly.html

	To compile zlib with this asm file, copy match.S to the zlib directory
	then do:

	CFLAGS="-O3 -DASMV" ./configure
	make OBJA=match.o