[2427] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

RE: Biham's vectorized DES

daemon@ATHENA.MIT.EDU (Trei, Peter)
Tue Mar 31 15:10:53 1998

From: "Trei, Peter" <ptrei@securitydynamics.com>
To: "'karn@qualcomm.com'" <karn@qualcomm.com>, cryptography@c2.net
Date: Tue, 31 Mar 1998 15:05:22 -0500



> -----Original Message-----
> From:	Phil Karn [SMTP:karn@maggie.ka9q.AMPR.ORG]
> Sent:	Tuesday, March 31, 1998 2:18 PM
> To:	cryptography@c2.net
> Cc:	karn@qualcomm.com
> Subject:	Biham's vectorized DES
> 
> Last night yet I found another fascinating Biham paper on his web
> page:
> 
> "A Fast New DES Implementation in Software"
> 
> http://www.cs.technion.ac.il/users/biham/cgi-binw/tr-get.cgi/1997/CS/C
> S0891.ps.gz
> 
> Eli took the DES and expressed it as a sequence of basic logic
> operations: AND, OR, XOR and NOT. This required expressing the S-boxes
> as logic gates, which wasn't too hard because the DES was originally
> designed in the 1970s for just this kind of implementation.  Then
> using the Alpha as if it were a SIMD vector processor with 64 1-bit
> processors, he encrypts 64 blocks in parallel.
> 
	-----------------------
	Phil:

	This is actually old news - the paper has been out for about
	a year. 

	There are a few implementations available, notably one
	from Matthew Kwan: http://www.cs.mu.oz.au/~mkwan/bitslice/

	The speed of this algorithm is dependent on three things. 
	1. The 'natural width' of the machine word; ie, the
	longest bit vector which can be used in a single instruction.
	Thus, all else being equal, a 64 bit machine will be twice as
	fast as a 32 bit machine.

	2. The effiency with which the DES boxes are calculated. A
simple
	product-of-sums or sum-of-products implementation is very
	suboptimal, and simplifying the equations gains you speed for
every
	operation eliminated.

	3. Register count. The implementations I've seen typically
maintain
	over 30 partial results as they run. This is the
	MMX x86 processor's biggest handicap. There are only 8 64 bit
	registers in the MMX set, and so you have to swap partial
results
	in and out of the MMX registers frequently. This really drags
down
	the speed.

	I don't have the figures to hand, but I don't think that MMX has
	led to impressive speed improvements. 

	It really rocks on 64 bit processors with lots of registers,
though.
	The best I've seen is on an Alpha 21164a, which uses 94 clock
cycles/key. Contrast this with 195 on a vanilla Pentium.

	This is widely called a 'bitslice' implementation. Bitslice key
	searchers were heavily used in the recent DES challenges. 

	Peter Trei
>   
> 
>  

home help back first fref pref prev next nref lref last post