Patchwork : memcpy/memset inline asm & config_compress disabled when needed

login
register
about
Submitter Bao, Zheng
Date 2010-02-22 04:48:22
Message ID <DD1CC71B621B004FA76856E5129D6B1703743236@sbjgexmb1.amd.com>
Download mbox | patch
Permalink /patch/950/
State Rejected
Headers show

Comments

Bao, Zheng - 2010-02-22 04:48:22
My fam10 is quite slow when it decompressing the module from rom to ram.
So I have to make some modification. It is not the final resolution. But
both of these two patches have their own meanings. 

disable_compress_asneeded.patch:
Sometimes we need to disable the COMPRESS_FLAG. The default value
is 1 as it was. If we set CONFIG_COMPRESS is 0 in mainboard Kconfig,
we should disable the CBFS_COMPRESS_FLAG.

   Kconfig:
   ------------------
	config COMPRESS
		hex
		default 0
		depends on BOARD_AMD_MAHOGANY_FAM10

Signed-off-by: Zheng Bao <zheng.bao@amd.com>
ron minnich - 2010-02-22 16:29:11
On Sun, Feb 21, 2010 at 8:48 PM, Bao, Zheng <Zheng.Bao@amd.com> wrote:


>
> Index: src/lib/memcpy.c
> ===================================================================
> --- src/lib/memcpy.c    (revision 5133)
> +++ src/lib/memcpy.c    (working copy)
> @@ -3,10 +3,14 @@
>  {
>        const char *src = vsrc;
>        char *dest = vdest;
> -       int i;
>
> -       for (i = 0; i < (int)bytes; i++)
> -               dest[i] = src[i];
> +       __asm__ __volatile__ (                          \
> +               "cld \n\t"                              \
> +               "rep \n\t"                              \
> +               "movsb"                                 \
> +               :               /* No output */         \
> +               : "S"(src), "D"(dest), "c"(bytes)       \
> +               );
>
>        return vdest;
>  }
> Index: src/lib/memset.c
> ===================================================================
> --- src/lib/memset.c    (revision 5133)
> +++ src/lib/memset.c    (working copy)
> @@ -2,11 +2,15 @@
>
>  void *memset(void *s, int c, size_t n)
>  {
> -       int i;
>        char *ss = (char *) s;
>
> -       for (i = 0; i < (int)n; i++)
> -               ss[i] = c;
> +       __asm__ __volatile__ (                  \
> +               "cld\n\t"                       \
> +               "rep\n\t"                       \
> +               "stosb"                         \
> +               :                               \
> +               : "a"(c), "D"(ss), "c"(n)       \
> +               );
>
>        return s;
>  }

I'm glad this works, but I am afraid I have a concern about it.

I've become opposed to inline assembly on several principles in the
last few years:

- we've seen one problem after another as the manner in which __asm__
is supposed to be used has varied as gcc changes.
  we've also had code that had errors in __asm__ which nobody noticed.

- I'd like to understand why the C code is so much worse. I can't see
a good reason for C to be so bad at this simple task.
For example, with the memset, what if we just created a memzero
function instead, which used a constant '0' instead of a parameter:
does the code improve?

- I now feel a .c file should contain C. If we need a .s, let's create a .s.

So, first, can we please have a look at the .s produced by the C code
and see if we can understand why it is so slow?

Thanks

Ron
Stefan Reinauer - 2010-02-22 17:23:22
On 2/22/10 5:29 PM, ron minnich wrote:
> I'm glad this works, but I am afraid I have a concern about it.
>
> I've become opposed to inline assembly on several principles in the
> last few years:
>   
I have another one:

- no assembly outside of cpu/ and arch/i386
Bao, Zheng - 2010-02-23 01:45:20
I think the memcpy/memset and decompressing are slow because of the Icache. Only one instruction executes repeatedly in the asm code. It doesn't have to access the instructions in the ROM every time. The memcpy/memset are easy to narrow down to a single instruction. But ulzma() can not. So I don't like my patch either. Does anybody have the idea to finally solve this problem?

Zheng

-----Original Message-----
From: ron minnich [mailto:rminnich@gmail.com] 
Sent: Tuesday, February 23, 2010 12:29 AM
To: Bao, Zheng
Cc: coreboot@coreboot.org
Subject: Re: [coreboot] [patch]: memcpy/memset inline asm & config_compress disabled when needed

On Sun, Feb 21, 2010 at 8:48 PM, Bao, Zheng <Zheng.Bao@amd.com> wrote:


>
> Index: src/lib/memcpy.c
> ===================================================================
> --- src/lib/memcpy.c    (revision 5133)
> +++ src/lib/memcpy.c    (working copy)
> @@ -3,10 +3,14 @@
>  {
>        const char *src = vsrc;
>        char *dest = vdest;
> -       int i;
>
> -       for (i = 0; i < (int)bytes; i++)
> -               dest[i] = src[i];
> +       __asm__ __volatile__ (                          \
> +               "cld \n\t"                              \
> +               "rep \n\t"                              \
> +               "movsb"                                 \
> +               :               /* No output */         \
> +               : "S"(src), "D"(dest), "c"(bytes)       \
> +               );
>
>        return vdest;
>  }
> Index: src/lib/memset.c
> ===================================================================
> --- src/lib/memset.c    (revision 5133)
> +++ src/lib/memset.c    (working copy)
> @@ -2,11 +2,15 @@
>
>  void *memset(void *s, int c, size_t n)
>  {
> -       int i;
>        char *ss = (char *) s;
>
> -       for (i = 0; i < (int)n; i++)
> -               ss[i] = c;
> +       __asm__ __volatile__ (                  \
> +               "cld\n\t"                       \
> +               "rep\n\t"                       \
> +               "stosb"                         \
> +               :                               \
> +               : "a"(c), "D"(ss), "c"(n)       \
> +               );
>
>        return s;
>  }

I'm glad this works, but I am afraid I have a concern about it.

I've become opposed to inline assembly on several principles in the
last few years:

- we've seen one problem after another as the manner in which __asm__
is supposed to be used has varied as gcc changes.
  we've also had code that had errors in __asm__ which nobody noticed.

- I'd like to understand why the C code is so much worse. I can't see
a good reason for C to be so bad at this simple task.
For example, with the memset, what if we just created a memzero
function instead, which used a constant '0' instead of a parameter:
does the code improve?

- I now feel a .c file should contain C. If we need a .s, let's create a .s.

So, first, can we please have a look at the .s produced by the C code
and see if we can understand why it is so slow?

Thanks

Ron
Marc Jones - 2010-02-24 00:29:50
A couple of thoughts. Can we move the data as dwords instead of bytes
for uncompressed data? As for the cache, can you check the MTRRs and
CR0 before the copy? We should cache the cbfs instructions and data
being moved. Does the cbfs code need a call to setup the cache with
the source or should it be setup prior to calling cbfs?

Marc


On Mon, Feb 22, 2010 at 6:45 PM, Bao, Zheng <Zheng.Bao@amd.com> wrote:
> I think the memcpy/memset and decompressing are slow because of the Icache. Only one instruction executes repeatedly in the asm code. It doesn't have to access the instructions in the ROM every time. The memcpy/memset are easy to narrow down to a single instruction. But ulzma() can not. So I don't like my patch either. Does anybody have the idea to finally solve this problem?
>
> Zheng
>
> -----Original Message-----
> From: ron minnich [mailto:rminnich@gmail.com]
> Sent: Tuesday, February 23, 2010 12:29 AM
> To: Bao, Zheng
> Cc: coreboot@coreboot.org
> Subject: Re: [coreboot] [patch]: memcpy/memset inline asm & config_compress disabled when needed
>
> On Sun, Feb 21, 2010 at 8:48 PM, Bao, Zheng <Zheng.Bao@amd.com> wrote:
>
>
>>
>> Index: src/lib/memcpy.c
>> ===================================================================
>> --- src/lib/memcpy.c    (revision 5133)
>> +++ src/lib/memcpy.c    (working copy)
>> @@ -3,10 +3,14 @@
>>  {
>>        const char *src = vsrc;
>>        char *dest = vdest;
>> -       int i;
>>
>> -       for (i = 0; i < (int)bytes; i++)
>> -               dest[i] = src[i];
>> +       __asm__ __volatile__ (                          \
>> +               "cld \n\t"                              \
>> +               "rep \n\t"                              \
>> +               "movsb"                                 \
>> +               :               /* No output */         \
>> +               : "S"(src), "D"(dest), "c"(bytes)       \
>> +               );
>>
>>        return vdest;
>>  }
>> Index: src/lib/memset.c
>> ===================================================================
>> --- src/lib/memset.c    (revision 5133)
>> +++ src/lib/memset.c    (working copy)
>> @@ -2,11 +2,15 @@
>>
>>  void *memset(void *s, int c, size_t n)
>>  {
>> -       int i;
>>        char *ss = (char *) s;
>>
>> -       for (i = 0; i < (int)n; i++)
>> -               ss[i] = c;
>> +       __asm__ __volatile__ (                  \
>> +               "cld\n\t"                       \
>> +               "rep\n\t"                       \
>> +               "stosb"                         \
>> +               :                               \
>> +               : "a"(c), "D"(ss), "c"(n)       \
>> +               );
>>
>>        return s;
>>  }
>
> I'm glad this works, but I am afraid I have a concern about it.
>
> I've become opposed to inline assembly on several principles in the
> last few years:
>
> - we've seen one problem after another as the manner in which __asm__
> is supposed to be used has varied as gcc changes.
>  we've also had code that had errors in __asm__ which nobody noticed.
>
> - I'd like to understand why the C code is so much worse. I can't see
> a good reason for C to be so bad at this simple task.
> For example, with the memset, what if we just created a memzero
> function instead, which used a constant '0' instead of a parameter:
> does the code improve?
>
> - I now feel a .c file should contain C. If we need a .s, let's create a .s.
>
> So, first, can we please have a look at the .s produced by the C code
> and see if we can understand why it is so slow?
>
> Thanks
>
> Ron
>
>
>
> --
> coreboot mailing list: coreboot@coreboot.org
> http://www.coreboot.org/mailman/listinfo/coreboot
>

Patch

Index: Makefile
===================================================================
--- Makefile	(revision 5133)
+++ Makefile	(working copy)
@@ -231,6 +231,9 @@ 
 CFLAGS += -fno-common -ffreestanding -fno-builtin -fomit-frame-pointer
 
 CBFS_COMPRESS_FLAG:=l
+ifeq ($(CONFIG_COMPRESS),0)
+CBFS_COMPRESS_FLAG:=
+endif
 CBFS_PAYLOAD_COMPRESS_FLAG:=
 ifeq ($(CONFIG_COMPRESSED_PAYLOAD_LZMA),y)
 CBFS_PAYLOAD_COMPRESS_FLAG:=l 


memcpy_memset_inline_asm.patch:
Rewrite the memset/memcpy as inlined asmembly code. It make the code
run much faster if memcpy runs in ROM.

Signed-off-by: Zheng Bao <zheng.bao@amd.com>

Index: src/lib/memcpy.c
===================================================================
--- src/lib/memcpy.c	(revision 5133)
+++ src/lib/memcpy.c	(working copy)
@@ -3,10 +3,14 @@ 
 {
 	const char *src = vsrc;
 	char *dest = vdest;
-	int i;
 
-	for (i = 0; i < (int)bytes; i++)
-		dest[i] = src[i];
+	__asm__ __volatile__ (				\
+		"cld \n\t"				\
+		"rep \n\t"				\
+		"movsb"					\
+		:		/* No output */		\
+		: "S"(src), "D"(dest), "c"(bytes)	\
+		);
 
 	return vdest;
 }
Index: src/lib/memset.c
===================================================================
--- src/lib/memset.c	(revision 5133)
+++ src/lib/memset.c	(working copy)
@@ -2,11 +2,15 @@ 
 
 void *memset(void *s, int c, size_t n)
 {
-	int i;
 	char *ss = (char *) s;
 
-	for (i = 0; i < (int)n; i++)
-		ss[i] = c;
+	__asm__ __volatile__ (			\
+		"cld\n\t"			\
+		"rep\n\t"			\
+		"stosb"				\
+		:				\
+		: "a"(c), "D"(ss), "c"(n)	\
+		);
 
 	return s;
 }