Patchwork ulzma delay

Submitter Arne Georg Gleditsch
Date 2010-03-05 14:37:07
Message ID <>
Download mbox | patch
Permalink /patch/1010/
State Rejected, archived
Headers show


Arne Georg Gleditsch - 2010-03-05 14:37:07

In the same way as unrv2b used to, ulzma exhibits very bad instruction
fetch behavior on my Opteron CPUs/Tyan S2912 board.  I'm not entirely
sure what causes this, and despite having spent significant time digging
through it I'm not able to mend this by adjusting MTRRs or other cache
settings.  So I've taken the easy route, and patched ulzma to copy
LzmaDecode to the stack before executing.  This brings running time for
fallback stage uncompress from nearly two minutes to 50ms here.

I'm also seeing weird performance behavior from memset -- this is not
consistent, and just started appearing at some point in my development
here.  I assume this has something to do with alignment, but I was
frankly not in the mood to start debugging this as well.  I've included
a patch that reduces memset to "rep stosb" under x86, which eliminates
the worst-case behavior (in the order of minutes spent in
cbfs_load_stage) I was seeing here.

Signed-off-by: Arne Georg Gleditsch <>


diff --git a/src/lib/lzma.c b/src/lib/lzma.c
index fc533c0..245fece 100644
--- a/src/lib/lzma.c
+++ b/src/lib/lzma.c
@@ -25,6 +25,9 @@  unsigned long ulzma(unsigned char * src, unsigned char * dst)
 	CLzmaDecoderState state;
 	SizeT mallocneeds;
 	unsigned char scratchpad[15980];
+	unsigned char LzmaDecode_buf[4096] __attribute__ ((aligned(64)));
+	int (*LzmaDecode_rel)(CLzmaDecoderState *, const unsigned char *, SizeT, SizeT *,
+			      unsigned char *, SizeT, SizeT *);
 	memcpy(properties, src, LZMA_PROPERTIES_SIZE);
 	outSize = *(UInt32 *)(src + LZMA_PROPERTIES_SIZE);
@@ -38,7 +41,9 @@  unsigned long ulzma(unsigned char * src, unsigned char * dst)
 		return 0;
 	state.Probs = (CProb *)scratchpad;
-	res = LzmaDecode(&state, src + LZMA_PROPERTIES_SIZE + 8, (SizeT)0xffffffff, &inProcessed,
+	memcpy(LzmaDecode_buf, LzmaDecode, sizeof(LzmaDecode_buf));
+	LzmaDecode_rel = (void *)LzmaDecode_buf;
+	res = LzmaDecode_rel(&state, src + LZMA_PROPERTIES_SIZE + 8, (SizeT)0xffffffff, &inProcessed,
 		dst, outSize, &outProcessed);
 	if (res != 0) {
 		printk_warning("lzma: Decoding error = %d\n", res);
diff --git a/src/lib/lzmadecode.h b/src/lib/lzmadecode.h
index dedde0d..91160f5 100644
--- a/src/lib/lzmadecode.h
+++ b/src/lib/lzmadecode.h
@@ -62,6 +62,7 @@  typedef struct _CLzmaDecoderState
 int LzmaDecode(CLzmaDecoderState *vs,
     const unsigned char *inStream, SizeT inSize, SizeT *inSizeProcessed,
-    unsigned char *outStream, SizeT outSize, SizeT *outSizeProcessed);
+    unsigned char *outStream, SizeT outSize, SizeT *outSizeProcessed)
+    __attribute__ ((aligned(64)));
diff --git a/src/lib/memset.c b/src/lib/memset.c
index bac3305..1167178 100644
--- a/src/lib/memset.c
+++ b/src/lib/memset.c
@@ -2,11 +2,15 @@ 
 void *memset(void *s, int c, size_t n)
+	asm volatile("rep stosb" :: "D"(s), "a"(c), "c"(n));
 	int i;
 	char *ss = (char *) s;
 	for (i = 0; i < (int)n; i++)
 		ss[i] = c;
 	return s;
diff --git a/src/mainboard/tyan/s2912_fam10/romstage.c b/src/mainboard/tyan/s2912_fam10/romstage.c
index 45d0d94..c319735 100644
--- a/src/mainboard/tyan/s2912_fam10/romstage.c
+++ b/src/mainboard/tyan/s2912_fam10/romstage.c
@@ -371,7 +371,6 @@  void real_main(unsigned long bist, unsigned long cpu_init_detectedx)
-	printk_debug("\n*** Yes, the copy/decompress is taking a while, FIXME!\n");
 	post_cache_as_ram();	// BSP switch stack to ram, copy then execute LB.
 	post_code(0x43);	// Should never see this post code.