Patchwork use memset

login
register
about
Submitter Myles Watson
Date 2010-03-11 21:14:21
Message ID <2831fecf1003111314y380d6602n1426653b1af8e540@mail.gmail.com>
Download mbox | patch
Permalink /patch/1035/
State Accepted
Headers show

Comments

Myles Watson - 2010-03-11 21:14:21
I was having trouble with stack corruption.  Using memset (C) instead of
clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.

TSC difference with clear_memory 0xFA884D
TSC difference with memset 0x826742

Anyway, this patch removes a couple of files that don't need to exist
anymore, given that only K8 was using clear_memory.

SIgned-off-by: Myles Watson <mylesgw@gmail.com>

Thanks,
Myles
Patrick Georgi - 2010-03-11 21:20:14
Am 11.03.2010 22:14, schrieb Myles Watson:
> I was having trouble with stack corruption.  Using memset (C) instead of
> clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
> 
> TSC difference with clear_memory 0xFA884D
> TSC difference with memset 0x826742
Interesting that the C version is faster in this case.

> Anyway, this patch removes a couple of files that don't need to exist
> anymore, given that only K8 was using clear_memory.
Yay :-)

> SIgned-off-by: Myles Watson <mylesgw@gmail.com <mailto:mylesgw@gmail.com>>
Acked-by: Patrick Georgi <patrick.georgi@coresystems.de>
Myles Watson - 2010-03-11 21:39:48
> Am 11.03.2010 22:14, schrieb Myles Watson:
> > I was having trouble with stack corruption.  Using memset (C) instead of
> > clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
I should have said it got rid of the memory corruption I was seeing, too :)

> > TSC difference with clear_memory 0xFA884D
> > TSC difference with memset 0x826742
> Interesting that the C version is faster in this case.
Yeah.  Maybe you could do a lot better with a different asm implementation.
You'd hope that the compiler would have a pretty good routine for clearing
memory, though.  I didn't dig into it to see what the real difference was.

> > SIgned-off-by: Myles Watson <mylesgw@gmail.com
> <mailto:mylesgw@gmail.com>>
> Acked-by: Patrick Georgi <patrick.georgi@coresystems.de>
Rev 5201.

Thanks,
Myles
Arne Georg Gleditsch - 2010-03-18 08:54:22
Myles Watson <mylesgw@gmail.com> writes:
> I was having trouble with stack corruption.  Using memset (C) instead of
> clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
>
> TSC difference with clear_memory 0xFA884D
> TSC difference with memset 0x826742

That's odd.  I just recently sent a patch to the list ("ulzma delay")
that did pretty much the opposite, as I was seeing really bad
performance for the C memset function on my Opteron (Istanbul) boxes.
memset would take minutes to do what ran in a handful of ms using "rep
stosb", by all accounts because of instruction cache thrashing.

I see clear_memory was using "stosl", but apart from that it looks
very similar to the variant I ended up with to improve performance.

Could you see if you experience stack corruption with the "rep stosb"
patch I posted for memset as well?  I'd like to see that go in, but of
course it's a problem if it results in a performance degradation on
other platforms. Perhaps we could enable it only for the platforms where
instruction footprint/fetches is known to be an issue, ie fam10?
Myles Watson - 2010-03-18 13:28:11
> Myles Watson <mylesgw@gmail.com> writes:
> > I was having trouble with stack corruption.  Using memset (C) instead of
> > clear_memory(asm) speeds it up by almost a factor of 2 for a 1M region.
> >
> > TSC difference with clear_memory 0xFA884D
> > TSC difference with memset 0x826742
> 
> That's odd.  I just recently sent a patch to the list ("ulzma delay")
> that did pretty much the opposite, as I was seeing really bad
> performance for the C memset function on my Opteron (Istanbul) boxes.
> memset would take minutes to do what ran in a handful of ms using "rep
> stosb", by all accounts because of instruction cache thrashing.
> 
> I see clear_memory was using "stosl", but apart from that it looks
> very similar to the variant I ended up with to improve performance.
Once caching works correctly for fam10, maybe you'll see similar performance
numbers?

> Could you see if you experience stack corruption with the "rep stosb"
> patch I posted for memset as well?  
It's hard to tell if you experience stack corruption... unless it bites you.
There are a lot of places on the stack where it won't matter if it gets
corrupted.  I don't have a good way to test that.

> I'd like to see that go in, but of
> course it's a problem if it results in a performance degradation on
> other platforms. Perhaps we could enable it only for the platforms where
> instruction footprint/fetches is known to be an issue, ie fam10?
The best thing would be to fix caching on fam10.  Of course if that's not
feasible for some reason, then adding asm just for that architecture could
be the way to go.  In general, the more we can keep it straight C, the
better for me.

Thanks,
Myles

Patch

Index: svn/src/cpu/amd/car/clear_init_ram.c
===================================================================
--- svn.orig/src/cpu/amd/car/clear_init_ram.c
+++ /dev/null
@@ -1,22 +0,0 @@ 
-/* by yhlu 6.2005 */
-/* be warned, this file will be used core 0/node 0 only */
-
-static void __attribute__((noinline)) clear_init_ram(void)
-{
-	// gcc 3.4.5 will inline the copy_and_run and clear_init_ram in post_cache_as_ram
-	// will reuse %edi as 0 from clear_memory for copy_and_run part, actually it is increased already
-	// so noline clear_init_ram
-
-#if CONFIG_HAVE_ACPI_RESUME == 1
-	/* clear only coreboot used region of memory. Note: this may break ECC enabled boards */
-	clear_memory( CONFIG_RAMBASE, (CONFIG_RAMTOP) - CONFIG_RAMBASE - CONFIG_DCACHE_RAM_SIZE);
-#else
-        clear_memory(0, ((CONFIG_RAMTOP) - CONFIG_DCACHE_RAM_SIZE));
-#endif
-}
-
-/* be warned, this file will be used by core other than core 0/node 0 or core0/node0 when cpu_reset*/
-static void set_init_ram_access(void)
-{
-	set_var_mtrr(0, 0x00000000, CONFIG_RAMTOP, MTRR_TYPE_WRBACK);
-}
Index: svn/src/cpu/amd/car/post_cache_as_ram.c
===================================================================
--- svn.orig/src/cpu/amd/car/post_cache_as_ram.c
+++ svn/src/cpu/amd/car/post_cache_as_ram.c
@@ -3,8 +3,6 @@ 
  */
 #include "cpu/amd/car/disable_cache_as_ram.c"
 
-#include "cpu/amd/car/clear_init_ram.c"
-
 static inline void print_debug_pcar(const char *strval, uint32_t val)
 {
         printk_debug("%s%08x\r\n", strval, val);
@@ -64,7 +62,8 @@  static void post_cache_as_ram(void)
         #error "You need to set CONFIG_RAMTOP greater than 1M"
 #endif
 	
-	set_init_ram_access(); /* So we can access RAM from [1M, CONFIG_RAMTOP) */
+	/* So we can access RAM from [1M, CONFIG_RAMTOP) */
+	set_var_mtrr(0, 0x00000000, CONFIG_RAMTOP, MTRR_TYPE_WRBACK);
 
 //	dump_mem(CONFIG_DCACHE_RAM_BASE+CONFIG_DCACHE_RAM_SIZE-0x8000, CONFIG_DCACHE_RAM_BASE+CONFIG_DCACHE_RAM_SIZE-0x7c00);
 	print_debug("Copying data from cache to RAM -- switching to use RAM as stack... ");
@@ -94,7 +93,12 @@  static void post_cache_as_ram(void)
 	disable_cache_as_ram_bsp();  
 
         print_debug("Clearing initial memory region: ");
-        clear_init_ram(); //except the range from [(CONFIG_RAMTOP) - CONFIG_DCACHE_RAM_SIZE, (CONFIG_RAMTOP))
+#if CONFIG_HAVE_ACPI_RESUME == 1
+	/* clear only coreboot used region of memory. Note: this may break ECC enabled boards */
+	memset((void*) CONFIG_RAMBASE, (CONFIG_RAMTOP) - CONFIG_RAMBASE - CONFIG_DCACHE_RAM_SIZE, 0);
+#else
+        memset((void*)0, ((CONFIG_RAMTOP) - CONFIG_DCACHE_RAM_SIZE), 0);
+#endif
         print_debug("Done\r\n");
 
 //	dump_mem((CONFIG_RAMTOP) - 0x8000, (CONFIG_RAMTOP) - 0x7c00);
Index: svn/src/cpu/amd/model_10xxx/model_10xxx_init.c
===================================================================
--- svn.orig/src/cpu/amd/model_10xxx/model_10xxx_init.c
+++ svn/src/cpu/amd/model_10xxx/model_10xxx_init.c
@@ -34,7 +34,6 @@ 
 #include <cpu/cpu.h>
 #include <cpu/x86/cache.h>
 #include <cpu/x86/mtrr.h>
-#include <cpu/x86/mem.h>
 #include <cpu/amd/quadcore.h>
 #include <cpu/amd/model_10xxx_msr.h>
 
Index: svn/src/cpu/amd/model_fxx/model_fxx_init.c
===================================================================
--- svn.orig/src/cpu/amd/model_fxx/model_fxx_init.c
+++ svn/src/cpu/amd/model_fxx/model_fxx_init.c
@@ -24,7 +24,6 @@ 
 #include <cpu/cpu.h>
 #include <cpu/x86/cache.h>
 #include <cpu/x86/mtrr.h>
-#include <cpu/x86/mem.h>
 
 #include <cpu/amd/dualcore.h>
 
@@ -238,7 +237,7 @@  static inline void clear_2M_ram(unsigned
 
                 /* clear memory 2M (limitk - basek) */
                 addr = (void *)(((uint32_t)addr) | ((basek & 0x7ff) << 10));
-                clear_memory(addr, size);
+                memset(addr, size, 0);
 }
 
 static void init_ecc_memory(unsigned node_id)
Index: svn/src/mainboard/amd/serengeti_cheetah_fam10/romstage.c
===================================================================
--- svn.orig/src/mainboard/amd/serengeti_cheetah_fam10/romstage.c
+++ svn/src/mainboard/amd/serengeti_cheetah_fam10/romstage.c
@@ -130,7 +130,6 @@  static int spd_read_byte(u32 device, u32
 #include "northbridge/amd/amdfam10/amdfam10.h"
 #include "northbridge/amd/amdht/ht_wrapper.c"
 
-#include "include/cpu/x86/mem.h"
 #include "northbridge/amd/amdfam10/raminit_sysinfo_in_ram.c"
 #include "northbridge/amd/amdfam10/raminit_amdmct.c"
 #include "northbridge/amd/amdfam10/amdfam10_pci.c"
Index: svn/src/mainboard/msi/ms9652_fam10/romstage.c
===================================================================
--- svn.orig/src/mainboard/msi/ms9652_fam10/romstage.c
+++ svn/src/mainboard/msi/ms9652_fam10/romstage.c
@@ -110,7 +110,6 @@  static inline int spd_read_byte(unsigned
 #include "northbridge/amd/amdfam10/amdfam10.h"
 #include "northbridge/amd/amdht/ht_wrapper.c"
 
-#include "include/cpu/x86/mem.h"
 #include "northbridge/amd/amdfam10/raminit_sysinfo_in_ram.c"
 #include "northbridge/amd/amdfam10/raminit_amdmct.c"
 #include "northbridge/amd/amdfam10/amdfam10_pci.c"
Index: svn/src/mainboard/supermicro/h8dmr_fam10/romstage.c
===================================================================
--- svn.orig/src/mainboard/supermicro/h8dmr_fam10/romstage.c
+++ svn/src/mainboard/supermicro/h8dmr_fam10/romstage.c
@@ -108,7 +108,6 @@  static inline int spd_read_byte(unsigned
 #include "northbridge/amd/amdfam10/amdfam10.h"
 #include "northbridge/amd/amdht/ht_wrapper.c"
 
-#include "include/cpu/x86/mem.h"
 #include "northbridge/amd/amdfam10/raminit_sysinfo_in_ram.c"
 #include "northbridge/amd/amdfam10/raminit_amdmct.c"
 #include "northbridge/amd/amdfam10/amdfam10_pci.c"
Index: svn/src/mainboard/supermicro/h8qme_fam10/romstage.c
===================================================================
--- svn.orig/src/mainboard/supermicro/h8qme_fam10/romstage.c
+++ svn/src/mainboard/supermicro/h8qme_fam10/romstage.c
@@ -108,7 +108,6 @@  static inline int spd_read_byte(unsigned
 #include "northbridge/amd/amdfam10/amdfam10.h"
 #include "northbridge/amd/amdht/ht_wrapper.c"
 
-#include "include/cpu/x86/mem.h"
 #include "northbridge/amd/amdfam10/raminit_sysinfo_in_ram.c"
 #include "northbridge/amd/amdfam10/raminit_amdmct.c"
 #include "northbridge/amd/amdfam10/amdfam10_pci.c"
Index: svn/src/mainboard/tyan/s2912_fam10/romstage.c
===================================================================
--- svn.orig/src/mainboard/tyan/s2912_fam10/romstage.c
+++ svn/src/mainboard/tyan/s2912_fam10/romstage.c
@@ -110,7 +110,6 @@  static inline int spd_read_byte(unsigned
 #include "northbridge/amd/amdfam10/amdfam10.h"
 #include "northbridge/amd/amdht/ht_wrapper.c"
 
-#include "include/cpu/x86/mem.h"
 #include "northbridge/amd/amdfam10/raminit_sysinfo_in_ram.c"
 #include "northbridge/amd/amdfam10/raminit_amdmct.c"
 #include "northbridge/amd/amdfam10/amdfam10_pci.c"
Index: svn/src/northbridge/amd/amdk8/raminit.c
===================================================================
--- svn.orig/src/northbridge/amd/amdk8/raminit.c
+++ svn/src/northbridge/amd/amdk8/raminit.c
@@ -4,7 +4,6 @@ 
 	2005.02 yhlu add E0 memory hole support
 */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/cache.h>
 #include <cpu/x86/mtrr.h>
 #include <stdlib.h>
Index: svn/src/northbridge/amd/amdk8/raminit_f.c
===================================================================
--- svn.orig/src/northbridge/amd/amdk8/raminit_f.c
+++ svn/src/northbridge/amd/amdk8/raminit_f.c
@@ -20,7 +20,6 @@ 
  * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
 */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/cache.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/tsc.h>
Index: svn/src/northbridge/intel/e7520/raminit.c
===================================================================
--- svn.orig/src/northbridge/intel/e7520/raminit.c
+++ svn/src/northbridge/intel/e7520/raminit.c
@@ -18,7 +18,6 @@ 
  *
  */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/cache.h>
 #include <stdlib.h>
Index: svn/src/northbridge/intel/e7525/raminit.c
===================================================================
--- svn.orig/src/northbridge/intel/e7525/raminit.c
+++ svn/src/northbridge/intel/e7525/raminit.c
@@ -18,7 +18,6 @@ 
  *
  */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/cache.h>
 #include <stdlib.h>
Index: svn/src/northbridge/intel/i3100/raminit.c
===================================================================
--- svn.orig/src/northbridge/intel/i3100/raminit.c
+++ svn/src/northbridge/intel/i3100/raminit.c
@@ -19,7 +19,6 @@ 
  *
  */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/cache.h>
 #include <stdlib.h>
Index: svn/src/northbridge/intel/i3100/raminit_ep80579.c
===================================================================
--- svn.orig/src/northbridge/intel/i3100/raminit_ep80579.c
+++ svn/src/northbridge/intel/i3100/raminit_ep80579.c
@@ -19,7 +19,6 @@ 
  *
  */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/cache.h>
 #include "raminit_ep80579.h"
Index: svn/src/northbridge/intel/i945/raminit.c
===================================================================
--- svn.orig/src/northbridge/intel/i945/raminit.c
+++ svn/src/northbridge/intel/i945/raminit.c
@@ -17,7 +17,6 @@ 
  * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
  */
 
-#include <cpu/x86/mem.h>
 #include <cpu/x86/mtrr.h>
 #include <cpu/x86/cache.h>
 #include <spd.h>
Index: svn/src/cpu/amd/model_10xxx/init_cpus.c
===================================================================
--- svn.orig/src/cpu/amd/model_10xxx/init_cpus.c
+++ svn/src/cpu/amd/model_10xxx/init_cpus.c
@@ -421,7 +421,7 @@  static u32 init_cpus(u32 cpu_init_detect
 		 */
 		//wait_till_sysinfo_in_ram();
 
-		set_init_ram_access();
+		set_var_mtrr(0, 0x00000000, CONFIG_RAMTOP, MTRR_TYPE_WRBACK);
 
 		STOP_CAR_AND_CPU();
 		printk_debug("\nAP %02x should be halted but you are reading this....\n", apicid);
Index: svn/src/cpu/amd/model_fxx/init_cpus.c
===================================================================
--- svn.orig/src/cpu/amd/model_fxx/init_cpus.c
+++ svn/src/cpu/amd/model_fxx/init_cpus.c
@@ -317,7 +317,7 @@  static unsigned init_cpus(unsigned cpu_i
 			        print_initcpu8("while waiting for BSP signal to STOP, timeout in ap ", apicid);
 			}
                         lapic_write(LAPIC_MSG_REG, (apicid<<24) | 0x44); // bsp can not check it before stop_this_cpu
-                        set_init_ram_access();
+			set_var_mtrr(0, 0x00000000, CONFIG_RAMTOP, MTRR_TYPE_WRBACK);
 	#if CONFIG_MEM_TRAIN_SEQ == 1
 			train_ram_on_node(id.nodeid, id.coreid, sysinfo,
 					  (unsigned) STOP_CAR_AND_CPU);
Index: svn/src/include/cpu/x86/mem.h
===================================================================
--- svn.orig/src/include/cpu/x86/mem.h
+++ /dev/null
@@ -1,32 +0,0 @@ 
-#ifndef CPU_X86_MEM_H
-#define CPU_X86_MEM_H
-
-/* Optimized generic x86 assembly for clearing memory */
-static inline void clear_memory(void *addr, unsigned long size)
-{
-        asm volatile(
-                "cld \n\t"
-                "rep; stosl\n\t"
-                : /* No outputs */
-                : "a" (0), "D" (addr), "c" (size>>2)
-                );
-
-}
-
-#endif /* CPU_X86_MEM_H */
-#ifndef CPU_X86_MEM_H
-#define CPU_X86_MEM_H
-
-/* Optimized generic x86 assembly for clearing memory */
-static inline void clear_memory(void *addr, unsigned long size)
-{
-        asm volatile(
-                "cld \n\t"
-                "rep; stosl\n\t"
-                : /* No outputs */
-                : "a" (0), "D" (addr), "c" (size>>2)
-                );
-
-}
-
-#endif /* CPU_X86_MEM_H */