Patchwork GeodeLX RAM initialisation issue

login
register
about
Submitter Nathan Williams
Date 2009-11-10 20:26:02
Message ID <4AF9CC5A.7020803@traverse.com.au>
Download mbox | patch
Permalink /patch/550/
State Not Applicable
Headers show

Comments

Nathan Williams - 2009-11-10 20:26:02
Marc Jones wrote:
> On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>> Another observation I made was that by setting the debug_level to BIOS_CRIT,
>> instead of dying at the usual spot in disable_car() and stopping, coreboot
>> would reset continuously (cycling every 1-2 seconds)

Since I needed to have a BIOS that didn't have much debugging enabled 
for a customer sample, I looked a bit deeper to find the cause of this 
continuous reset behaviour.  Even changing the debug level from 
BIOS_SPEW to BIOS_DEBUG caused the reset.  I tracked it down to a single 
  printk and my attached patch means it works at BIOS_CRIT now, just 
with a few extra debug lines.  Without the printk, the code gets to 
"missing phase4_read_resources" (just a few lines down from my patch) 
before restarting.

>>
>> Another issue that's partly related is the ability for coreboot to set  the
>> GeodeLink speed depending on the detected RAM speed.  As a work-around, we
>> are only using 333MHz SODIMMs and have set the bootstrap bits for
>> GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU,
>> 333MHz GLIU instead of bypass mode.  In bypass mode, the GLIU is 266MHz and
>> some of our 333MHz RAM will fail in disable_car(). As a test, I have
>> experimented with
>> pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to
>> change the GLIU to 333MHz.  I probably didn't have the correct bits set, so
>> even though I managed to set GLIU, it failed the last test (DLL) in
>> sdram_enable() and would reset.
> 
> Your second problem might explain the first. You should look closely
> at the detection problem. It depends on the reset and the state of the
> rstpll flags. There could be a corner case or something unusual going
> on. How did you set the boot strap bits with hardware (straps)? You
> should use pll_reset(ManualConf) settings to change it with hardware.
> 
> Marc
> 
> 

Sorry, I should have explained that we set the boostrap bits in hardware:

Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz.
Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, 
otherwise low.
Bit 5: PW0 pad - part of CPU/GLIU frequency selects.
Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects.
Bit 3: GNT2# pad - part of CPU/GLIU frequency selects.
Bit 2: GNT1# pad - part of CPU/GLIU frequency selects.
Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.

We have pulled these pins up or down to be "0010110", which corresponds 
to CPU 500MHz, GLIU 333MHz in table 6-87.  This should also mean that 
the on reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch 
(except that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on 
Power On Reset (POR).  So I should be using pll_reset(ManualConf)?  I'll 
try it later today and see if I can get some debugging output.

Regards,
Nathan
Marc Jones - 2009-11-10 23:07:53
On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams <nathan@traverse.com.au> wrote:
> Marc Jones wrote:
>>
>> On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams <nathan@traverse.com.au>
>> wrote:
>>>
>>> Another observation I made was that by setting the debug_level to
>>> BIOS_CRIT,
>>> instead of dying at the usual spot in disable_car() and stopping,
>>> coreboot
>>> would reset continuously (cycling every 1-2 seconds)
>
> Since I needed to have a BIOS that didn't have much debugging enabled for a
> customer sample, I looked a bit deeper to find the cause of this continuous
> reset behaviour.  Even changing the debug level from BIOS_SPEW to BIOS_DEBUG
> caused the reset.  I tracked it down to a single  printk and my attached
> patch means it works at BIOS_CRIT now, just with a few extra debug lines.
>  Without the printk, the code gets to "missing phase4_read_resources" (just
> a few lines down from my patch) before restarting.

This sounds like it is probably blowing the stack or the stack hits
memory that isn't working correctly.


>
>>>
>>> Another issue that's partly related is the ability for coreboot to set
>>>  the
>>> GeodeLink speed depending on the detected RAM speed.  As a work-around,
>>> we
>>> are only using 333MHz SODIMMs and have set the bootstrap bits for
>>> GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU,
>>> 333MHz GLIU instead of bypass mode.  In bypass mode, the GLIU is 266MHz
>>> and
>>> some of our 333MHz RAM will fail in disable_car(). As a test, I have
>>> experimented with
>>> pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to
>>> change the GLIU to 333MHz.  I probably didn't have the correct bits set,
>>> so
>>> even though I managed to set GLIU, it failed the last test (DLL) in
>>> sdram_enable() and would reset.
>>
>> Your second problem might explain the first. You should look closely
>> at the detection problem. It depends on the reset and the state of the
>> rstpll flags. There could be a corner case or something unusual going
>> on. How did you set the boot strap bits with hardware (straps)? You
>> should use pll_reset(ManualConf) settings to change it with hardware.
>>
>> Marc
>>
>>
>
> Sorry, I should have explained that we set the boostrap bits in hardware:
>
> Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz.
> Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise
> low.
> Bit 5: PW0 pad - part of CPU/GLIU frequency selects.
> Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects.
> Bit 3: GNT2# pad - part of CPU/GLIU frequency selects.
> Bit 2: GNT1# pad - part of CPU/GLIU frequency selects.
> Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
>
> We have pulled these pins up or down to be "0010110", which corresponds to
> CPU 500MHz, GLIU 333MHz in table 6-87.  This should also mean that the on
> reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except
> that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset
> (POR).  So I should be using pll_reset(ManualConf)?  I'll try it later today
> and see if I can get some debugging output.

If it is set by straps, it should be doing the right thing and you
don't need to use the ManualConf. There could still be a corner case
and you should try trace through the soft reset that is causing the
problem. Also, have you diff'd the MC settings between the BIOS and
coreboot. I would be interested in discrepancies.

Marc
Nathan Williams - 2009-11-23 07:27:35
Marc Jones wrote:
> On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams <nathan@traverse.com.au> wrote:
>> Marc Jones wrote:
>>> On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams <nathan@traverse.com.au>
>>> wrote:
>>>> Another observation I made was that by setting the debug_level to
>>>> BIOS_CRIT,
>>>> instead of dying at the usual spot in disable_car() and stopping,
>>>> coreboot
>>>> would reset continuously (cycling every 1-2 seconds)
>> Since I needed to have a BIOS that didn't have much debugging enabled for a
>> customer sample, I looked a bit deeper to find the cause of this continuous
>> reset behaviour.  Even changing the debug level from BIOS_SPEW to BIOS_DEBUG
>> caused the reset.  I tracked it down to a single  printk and my attached
>> patch means it works at BIOS_CRIT now, just with a few extra debug lines.
>>  Without the printk, the code gets to "missing phase4_read_resources" (just
>> a few lines down from my patch) before restarting.
> 
> This sounds like it is probably blowing the stack or the stack hits
> memory that isn't working correctly.
> 
> 
>>>> Another issue that's partly related is the ability for coreboot to set
>>>>  the
>>>> GeodeLink speed depending on the detected RAM speed.  As a work-around,
>>>> we
>>>> are only using 333MHz SODIMMs and have set the bootstrap bits for
>>>> GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU,
>>>> 333MHz GLIU instead of bypass mode.  In bypass mode, the GLIU is 266MHz
>>>> and
>>>> some of our 333MHz RAM will fail in disable_car(). As a test, I have
>>>> experimented with
>>>> pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to
>>>> change the GLIU to 333MHz.  I probably didn't have the correct bits set,
>>>> so
>>>> even though I managed to set GLIU, it failed the last test (DLL) in
>>>> sdram_enable() and would reset.
>>> Your second problem might explain the first. You should look closely
>>> at the detection problem. It depends on the reset and the state of the
>>> rstpll flags. There could be a corner case or something unusual going
>>> on. How did you set the boot strap bits with hardware (straps)? You
>>> should use pll_reset(ManualConf) settings to change it with hardware.
>>>
>>> Marc
>>>
>>>
>> Sorry, I should have explained that we set the boostrap bits in hardware:
>>
>> Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz.
>> Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise
>> low.
>> Bit 5: PW0 pad - part of CPU/GLIU frequency selects.
>> Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects.
>> Bit 3: GNT2# pad - part of CPU/GLIU frequency selects.
>> Bit 2: GNT1# pad - part of CPU/GLIU frequency selects.
>> Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
>>
>> We have pulled these pins up or down to be "0010110", which corresponds to
>> CPU 500MHz, GLIU 333MHz in table 6-87.  This should also mean that the on
>> reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except
>> that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset
>> (POR).  So I should be using pll_reset(ManualConf)?  I'll try it later today
>> and see if I can get some debugging output.
> 
> If it is set by straps, it should be doing the right thing and you
> don't need to use the ManualConf. There could still be a corner case
> and you should try trace through the soft reset that is causing the
> problem. Also, have you diff'd the MC settings between the BIOS and
> coreboot. I would be interested in discrepancies.
> 
> Marc
> 
> 

I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:

http://coreboot.pastebin.com/m39b22c21

The only differences I can see are related to interrupts, which shouldn't matter in relation to
my RAM problems.

I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.

Nathan
Marc Jones - 2009-11-23 21:16:17
On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
<nathan@traverse.com.au> wrote:
> Marc Jones wrote:
>> On Tue, Nov 10, 2009 at 1:26 PM, Nathan Williams <nathan@traverse.com.au> wrote:
>>> Marc Jones wrote:
>>>> On Fri, Nov 6, 2009 at 7:57 AM, Nathan Williams <nathan@traverse.com.au>
>>>> wrote:
>>>>> Another observation I made was that by setting the debug_level to
>>>>> BIOS_CRIT,
>>>>> instead of dying at the usual spot in disable_car() and stopping,
>>>>> coreboot
>>>>> would reset continuously (cycling every 1-2 seconds)
>>> Since I needed to have a BIOS that didn't have much debugging enabled for a
>>> customer sample, I looked a bit deeper to find the cause of this continuous
>>> reset behaviour.  Even changing the debug level from BIOS_SPEW to BIOS_DEBUG
>>> caused the reset.  I tracked it down to a single  printk and my attached
>>> patch means it works at BIOS_CRIT now, just with a few extra debug lines.
>>>  Without the printk, the code gets to "missing phase4_read_resources" (just
>>> a few lines down from my patch) before restarting.
>>
>> This sounds like it is probably blowing the stack or the stack hits
>> memory that isn't working correctly.
>>
>>
>>>>> Another issue that's partly related is the ability for coreboot to set
>>>>>  the
>>>>> GeodeLink speed depending on the detected RAM speed.  As a work-around,
>>>>> we
>>>>> are only using 333MHz SODIMMs and have set the bootstrap bits for
>>>>> GLCP_SYS_RSTPLL[7:1] (section 6.14.2.13 of LX databook) to 500Mhz CPU,
>>>>> 333MHz GLIU instead of bypass mode.  In bypass mode, the GLIU is 266MHz
>>>>> and
>>>>> some of our 333MHz RAM will fail in disable_car(). As a test, I have
>>>>> experimented with
>>>>> pll_reset(MANUALCONF, PLLMSRHI, PLLMSRLO) in initram.c in an attempt to
>>>>> change the GLIU to 333MHz.  I probably didn't have the correct bits set,
>>>>> so
>>>>> even though I managed to set GLIU, it failed the last test (DLL) in
>>>>> sdram_enable() and would reset.
>>>> Your second problem might explain the first. You should look closely
>>>> at the detection problem. It depends on the reset and the state of the
>>>> rstpll flags. There could be a corner case or something unusual going
>>>> on. How did you set the boot strap bits with hardware (straps)? You
>>>> should use pll_reset(ManualConf) settings to change it with hardware.
>>>>
>>>> Marc
>>>>
>>>>
>>> Sorry, I should have explained that we set the boostrap bits in hardware:
>>>
>>> Bit 7: PW1 pad - active high when the PCI clock is 66 MHz, low for 33 MHz.
>>> Bit 6: IRQ13 pad - active high for stall-on-reset debug feature, otherwise
>>> low.
>>> Bit 5: PW0 pad - part of CPU/GLIU frequency selects.
>>> Bit 4: SUSPA# pad - part of CPU/GLIU frequency selects.
>>> Bit 3: GNT2# pad - part of CPU/GLIU frequency selects.
>>> Bit 2: GNT1# pad - part of CPU/GLIU frequency selects.
>>> Bit 1: GNT0# pad - part of CPU/GLIU frequency selects.
>>>
>>> We have pulled these pins up or down to be "0010110", which corresponds to
>>> CPU 500MHz, GLIU 333MHz in table 6-87.  This should also mean that the on
>>> reset, the value of GLCP_SYS_RSTPLL should be 0000049C_0300182Ch (except
>>> that SWFLAGS (GLCP_SYS_RSTPLL[31:26]) is only reset to 0 on Power On Reset
>>> (POR).  So I should be using pll_reset(ManualConf)?  I'll try it later today
>>> and see if I can get some debugging output.
>>
>> If it is set by straps, it should be doing the right thing and you
>> don't need to use the ManualConf. There could still be a corner case
>> and you should try trace through the soft reset that is causing the
>> problem. Also, have you diff'd the MC settings between the BIOS and
>> coreboot. I would be interested in discrepancies.
>>
>> Marc
>>
>>
>
> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>
> http://coreboot.pastebin.com/m39b22c21
>
> The only differences I can see are related to interrupts, which shouldn't matter in relation to
> my RAM problems.
>
> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.

That looks like just the southbridge cs5536 target. The memory
differences would be in the processor geodelx target. Can you send
those results?

Marc
Nathan Williams - 2009-11-24 08:09:11
Marc Jones wrote:
> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
> <nathan@traverse.com.au> wrote:
>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>
>> http://coreboot.pastebin.com/m39b22c21
>>
>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>> my RAM problems.
>>
>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
> 
> That looks like just the southbridge cs5536 target. The memory
> differences would be in the processor geodelx target. Can you send
> those results?
> 
> Marc
> 

I did some new MSR dumps.

Diff:
./msrtool -t geodelx -t cs5536 -d amd_ref_bios
http://coreboot.pastebin.com/m5e487f87

AMD NAS reference BIOS:
./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
http://coreboot.pastebin.com/madc04ac

My Coreboot:
./msrtool -t geodelx -t cs5536 -l -s nathan_bios
http://coreboot.pastebin.com/m7f35d855


The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
Last time I added some code to force it to match the commercial BIOS
GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.

I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
Each time I did a msrtool diff to one I saved on disk.

Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.

Could there be an issue with the initialisation sequence that reading MSRs
after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
Are there any obvious ones that should be added in?

Regards,
Nathan
Marc Jones - 2009-11-24 17:28:59
On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams <nathan@traverse.com.au> wrote:
> Marc Jones wrote:
>> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
>> <nathan@traverse.com.au> wrote:
>>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>>
>>> http://coreboot.pastebin.com/m39b22c21
>>>
>>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>>> my RAM problems.
>>>
>>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
>>
>> That looks like just the southbridge cs5536 target. The memory
>> differences would be in the processor geodelx target. Can you send
>> those results?
>>
>> Marc
>>
>
> I did some new MSR dumps.
>
> Diff:
> ./msrtool -t geodelx -t cs5536 -d amd_ref_bios
> http://coreboot.pastebin.com/m5e487f87
>
> AMD NAS reference BIOS:
> ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
> http://coreboot.pastebin.com/madc04ac
>
> My Coreboot:
> ./msrtool -t geodelx -t cs5536 -l -s nathan_bios
> http://coreboot.pastebin.com/m7f35d855
>
>
> The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
> Last time I added some code to force it to match the commercial BIOS
> GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
>
> I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
> Each time I did a msrtool diff to one I saved on disk.
>
> Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.
>
> Could there be an issue with the initialisation sequence that reading MSRs
> after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
> Are there any obvious ones that should be added in?
>

--- AMD NAS reference BIOS
+++ Nathan's coreboot v3
#
# GLCP_DELAY_CONTROLS
#
-0x4c00000f 0x83f1_00aa_5696_0404
+0x4c00000f 0x8271_005a_ 5696_ 0404

It looks like coreboot and the ref bios detect different dimm
configuration. This timing setup could be part of the instability (I
don't think it explains the reset problem). Look at the code here:
SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets
set to see what might be happening. Make sure that MTest is disabled
in the ref bios setup. This setting is based on the number of devices
(load) there is on the dimm.

I didn't realize that so few registers were in the msr tool for
geodelx. You should add these:
20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA)
10071007_00000040h Page 227
20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229
2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231
2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232
2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233
2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233

4C00000Fh R/W GLCP I/O Delay
Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549
4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL)
Bootstrap specific Page 554

Marc
Nathan Williams - 2009-11-26 07:09:59
Marc Jones wrote:
> On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>> Marc Jones wrote:
>>> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
>>> <nathan@traverse.com.au> wrote:
>>>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>>>
>>>> http://coreboot.pastebin.com/m39b22c21
>>>>
>>>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>>>> my RAM problems.
>>>>
>>>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>>>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
>>> That looks like just the southbridge cs5536 target. The memory
>>> differences would be in the processor geodelx target. Can you send
>>> those results?
>>>
>>> Marc
>>>
>> I did some new MSR dumps.
>>
>> Diff:
>> ./msrtool -t geodelx -t cs5536 -d amd_ref_bios
>> http://coreboot.pastebin.com/m5e487f87
>>
>> AMD NAS reference BIOS:
>> ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
>> http://coreboot.pastebin.com/madc04ac
>>
>> My Coreboot:
>> ./msrtool -t geodelx -t cs5536 -l -s nathan_bios
>> http://coreboot.pastebin.com/m7f35d855
>>
>>
>> The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
>> Last time I added some code to force it to match the commercial BIOS
>> GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
>>
>> I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
>> Each time I did a msrtool diff to one I saved on disk.
>>
>> Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.
>>
>> Could there be an issue with the initialisation sequence that reading MSRs
>> after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
>> Are there any obvious ones that should be added in?
>>
> 
> --- AMD NAS reference BIOS
> +++ Nathan's coreboot v3
> #
> # GLCP_DELAY_CONTROLS
> #
> -0x4c00000f 0x83f1_00aa_5696_0404
> +0x4c00000f 0x8271_005a_ 5696_ 0404
> 
> It looks like coreboot and the ref bios detect different dimm
> configuration. This timing setup could be part of the instability (I
> don't think it explains the reset problem). Look at the code here:
> SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets
> set to see what might be happening. Make sure that MTest is disabled
> in the ref bios setup. This setting is based on the number of devices
> (load) there is on the dimm.
> 
> I didn't realize that so few registers were in the msr tool for
> geodelx. You should add these:
> 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA)
> 10071007_00000040h Page 227
> 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229
> 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231
> 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232
> 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233
> 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
> 
> 4C00000Fh R/W GLCP I/O Delay
> Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549
> 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL)
> Bootstrap specific Page 554
> 
> Marc
> 

I've now added the MSRs and uploaded to pastebin:

AMD NAS:
http://coreboot.pastebin.com/m53aed60b

My coreboot:
http://coreboot.pastebin.com/md23bc6a

./msrtool -d AMD_NAS:
http://coreboot.pastebin.com/m77663de5

Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards
just in case there are some hidden hardware issues.

Regards,
Nathan
Peter Stuge - 2009-11-26 07:38:59
Nathan Williams wrote:
> AMD NAS:
> http://coreboot.pastebin.com/m53aed60b

If you want to unclutter output a little, you can wipe the 5536 MSRs
from the file after the first run. msrtool only considers the MSRs
that are explicitly listed in the input file when run with -d.

(Another alternative is to list relevant MSRs in the file before the
first run and run with -s rather than -l -s. The former reads and
outputs values only for listed MSRs, the latter reads all known MSRs
but has the benefit that no file needs to be created beforehand.)


//Peter
Nathan Williams - 2009-11-26 11:41:45
Peter Stuge wrote:
> Nathan Williams wrote:
>> AMD NAS:
>> http://coreboot.pastebin.com/m53aed60b
> 
> If you want to unclutter output a little, you can wipe the 5536 MSRs
> from the file after the first run. msrtool only considers the MSRs
> that are explicitly listed in the input file when run with -d.
> 
> (Another alternative is to list relevant MSRs in the file before the
> first run and run with -s rather than -l -s. The former reads and
> outputs values only for listed MSRs, the latter reads all known MSRs
> but has the benefit that no file needs to be created beforehand.)
> 
> 
> //Peter
> 

Thanks for the tips.  Very helpful.

Nathan
Nathan Williams - 2009-11-27 09:05:47
Nathan Williams wrote:
> Marc Jones wrote:
>> On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>>> Marc Jones wrote:
>>>> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
>>>> <nathan@traverse.com.au> wrote:
>>>>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>>>>
>>>>> http://coreboot.pastebin.com/m39b22c21
>>>>>
>>>>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>>>>> my RAM problems.
>>>>>
>>>>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>>>>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
>>>> That looks like just the southbridge cs5536 target. The memory
>>>> differences would be in the processor geodelx target. Can you send
>>>> those results?
>>>>
>>>> Marc
>>>>
>>> I did some new MSR dumps.
>>>
>>> Diff:
>>> ./msrtool -t geodelx -t cs5536 -d amd_ref_bios
>>> http://coreboot.pastebin.com/m5e487f87
>>>
>>> AMD NAS reference BIOS:
>>> ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
>>> http://coreboot.pastebin.com/madc04ac
>>>
>>> My Coreboot:
>>> ./msrtool -t geodelx -t cs5536 -l -s nathan_bios
>>> http://coreboot.pastebin.com/m7f35d855
>>>
>>>
>>> The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
>>> Last time I added some code to force it to match the commercial BIOS
>>> GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
>>>
>>> I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
>>> Each time I did a msrtool diff to one I saved on disk.
>>>
>>> Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.
>>>
>>> Could there be an issue with the initialisation sequence that reading MSRs
>>> after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
>>> Are there any obvious ones that should be added in?
>>>
>> --- AMD NAS reference BIOS
>> +++ Nathan's coreboot v3
>> #
>> # GLCP_DELAY_CONTROLS
>> #
>> -0x4c00000f 0x83f1_00aa_5696_0404
>> +0x4c00000f 0x8271_005a_ 5696_ 0404
>>
>> It looks like coreboot and the ref bios detect different dimm
>> configuration. This timing setup could be part of the instability (I
>> don't think it explains the reset problem). Look at the code here:
>> SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets
>> set to see what might be happening. Make sure that MTest is disabled
>> in the ref bios setup. This setting is based on the number of devices
>> (load) there is on the dimm.
>>
>> I didn't realize that so few registers were in the msr tool for
>> geodelx. You should add these:
>> 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA)
>> 10071007_00000040h Page 227
>> 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229
>> 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231
>> 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232
>> 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233
>> 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
>>
>> 4C00000Fh R/W GLCP I/O Delay
>> Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549
>> 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL)
>> Bootstrap specific Page 554
>>
>> Marc
>>
> 
> I've now added the MSRs and uploaded to pastebin:
> 
> AMD NAS:
> http://coreboot.pastebin.com/m53aed60b
> 
> My coreboot:
> http://coreboot.pastebin.com/md23bc6a
> 
> ./msrtool -d AMD_NAS:
> http://coreboot.pastebin.com/m77663de5
> 
> Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards
> just in case there are some hidden hardware issues.
> 
> Regards,
> Nathan
> 

On the NAS reference board I got the following diff between coreboot
and the commercial BIOS:

http://coreboot.pastebin.com/m1353db1a

As you can see there are a lot of latency differences.
Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333.  So when I repeat the same on our boards, the only difference in the geodelx MSRs is:

# MC_CFCLK_DBUG
-0x2000001d 0x0000000000000000
+0x2000001d 0x0000000000001000
#    12 TRISTATE_DIS TRI-STATE Disable
-0: Tri-stating enabled
+1: Tri-stating disabled

Nathan
Marc Jones - 2009-11-30 22:17:20
On Fri, Nov 27, 2009 at 2:05 AM, Nathan Williams <nathan@traverse.com.au> wrote:
> Nathan Williams wrote:
>> Marc Jones wrote:
>>> On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>>>> Marc Jones wrote:
>>>>> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
>>>>> <nathan@traverse.com.au> wrote:
>>>>>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>>>>>
>>>>>> http://coreboot.pastebin.com/m39b22c21
>>>>>>
>>>>>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>>>>>> my RAM problems.
>>>>>>
>>>>>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>>>>>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
>>>>> That looks like just the southbridge cs5536 target. The memory
>>>>> differences would be in the processor geodelx target. Can you send
>>>>> those results?
>>>>>
>>>>> Marc
>>>>>
>>>> I did some new MSR dumps.
>>>>
>>>> Diff:
>>>> ./msrtool -t geodelx -t cs5536 -d amd_ref_bios
>>>> http://coreboot.pastebin.com/m5e487f87
>>>>
>>>> AMD NAS reference BIOS:
>>>> ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
>>>> http://coreboot.pastebin.com/madc04ac
>>>>
>>>> My Coreboot:
>>>> ./msrtool -t geodelx -t cs5536 -l -s nathan_bios
>>>> http://coreboot.pastebin.com/m7f35d855
>>>>
>>>>
>>>> The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
>>>> Last time I added some code to force it to match the commercial BIOS
>>>> GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
>>>>
>>>> I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
>>>> Each time I did a msrtool diff to one I saved on disk.
>>>>
>>>> Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.
>>>>
>>>> Could there be an issue with the initialisation sequence that reading MSRs
>>>> after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
>>>> Are there any obvious ones that should be added in?
>>>>
>>> --- AMD NAS reference BIOS
>>> +++ Nathan's coreboot v3
>>> #
>>> # GLCP_DELAY_CONTROLS
>>> #
>>> -0x4c00000f 0x83f1_00aa_5696_0404
>>> +0x4c00000f 0x8271_005a_ 5696_ 0404
>>>
>>> It looks like coreboot and the ref bios detect different dimm
>>> configuration. This timing setup could be part of the instability (I
>>> don't think it explains the reset problem). Look at the code here:
>>> SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets
>>> set to see what might be happening. Make sure that MTest is disabled
>>> in the ref bios setup. This setting is based on the number of devices
>>> (load) there is on the dimm.
>>>
>>> I didn't realize that so few registers were in the msr tool for
>>> geodelx. You should add these:
>>> 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA)
>>> 10071007_00000040h Page 227
>>> 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229
>>> 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231
>>> 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232
>>> 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233
>>> 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
>>>
>>> 4C00000Fh R/W GLCP I/O Delay
>>> Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549
>>> 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL)
>>> Bootstrap specific Page 554
>>>
>>> Marc
>>>
>>
>> I've now added the MSRs and uploaded to pastebin:
>>
>> AMD NAS:
>> http://coreboot.pastebin.com/m53aed60b
>>
>> My coreboot:
>> http://coreboot.pastebin.com/md23bc6a
>>
>> ./msrtool -d AMD_NAS:
>> http://coreboot.pastebin.com/m77663de5
>>
>> Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards
>> just in case there are some hidden hardware issues.
>>
>> Regards,
>> Nathan
>>
>
> On the NAS reference board I got the following diff between coreboot
> and the commercial BIOS:
>
> http://coreboot.pastebin.com/m1353db1a
>
> As you can see there are a lot of latency differences.
> Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333.  So when I repeat the same on our boards, the only difference in the geodelx MSRs is:
>
> # MC_CFCLK_DBUG
> -0x2000001d 0x0000000000000000
> +0x2000001d 0x0000000000001000
> #    12 TRISTATE_DIS TRI-STATE Disable
> -0: Tri-stating enabled
> +1: Tri-stating disabled


Nathan,

I don't think the tri-state disable bit explains the problems you have
seen. Since the memory has the same settings, the problem must be
somewhere else. You will need to go back the the reboot path to
investigate. It seems like something in the reset isn't doing a
complete reset, which causes a problem with the cache disable.

Marc
Nathan Williams - 2009-11-30 23:17:47
Marc Jones wrote:
> On Fri, Nov 27, 2009 at 2:05 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>> Nathan Williams wrote:
>>> Marc Jones wrote:
>>>> On Tue, Nov 24, 2009 at 1:09 AM, Nathan Williams <nathan@traverse.com.au> wrote:
>>>>> Marc Jones wrote:
>>>>>> On Mon, Nov 23, 2009 at 12:27 AM, Nathan Williams
>>>>>> <nathan@traverse.com.au> wrote:
>>>>>>> I managed to get the commercial BIOS to boot on my board and diffed it with coreboot:
>>>>>>>
>>>>>>> http://coreboot.pastebin.com/m39b22c21
>>>>>>>
>>>>>>> The only differences I can see are related to interrupts, which shouldn't matter in relation to
>>>>>>> my RAM problems.
>>>>>>>
>>>>>>> I have also run a memtest86 with the commercial BIOS (from bootable CDROM) and as a payload in coreboot.
>>>>>>> The commercial BIOS didn't have any errors, but my coreboot did.  So the hardware can't be too bad.
>>>>>> That looks like just the southbridge cs5536 target. The memory
>>>>>> differences would be in the processor geodelx target. Can you send
>>>>>> those results?
>>>>>>
>>>>>> Marc
>>>>>>
>>>>> I did some new MSR dumps.
>>>>>
>>>>> Diff:
>>>>> ./msrtool -t geodelx -t cs5536 -d amd_ref_bios
>>>>> http://coreboot.pastebin.com/m5e487f87
>>>>>
>>>>> AMD NAS reference BIOS:
>>>>> ./msrtool -t geodelx -t cs5536 -l -s amd_ref_bios
>>>>> http://coreboot.pastebin.com/madc04ac
>>>>>
>>>>> My Coreboot:
>>>>> ./msrtool -t geodelx -t cs5536 -l -s nathan_bios
>>>>> http://coreboot.pastebin.com/m7f35d855
>>>>>
>>>>>
>>>>> The diffs I did today show some differences with GLCP_DELAY_CONTROLS.
>>>>> Last time I added some code to force it to match the commercial BIOS
>>>>> GLCP_DELAY_CONTROLS MSR, but it didn't seem to make any difference.
>>>>>
>>>>> I also tested all the SODIMMS I have here (about 10) with the commercial BIOS.
>>>>> Each time I did a msrtool diff to one I saved on disk.
>>>>>
>>>>> Most are 333MHz, but 2 are 400MHz.  There weren't any changes to the MSRs.
>>>>>
>>>>> Could there be an issue with the initialisation sequence that reading MSRs
>>>>> after booting won't show?  Also, quite a few MSRs aren't defined in geodelx.c yet.
>>>>> Are there any obvious ones that should be added in?
>>>>>
>>>> --- AMD NAS reference BIOS
>>>> +++ Nathan's coreboot v3
>>>> #
>>>> # GLCP_DELAY_CONTROLS
>>>> #
>>>> -0x4c00000f 0x83f1_00aa_5696_0404
>>>> +0x4c00000f 0x8271_005a_ 5696_ 0404
>>>>
>>>> It looks like coreboot and the ref bios detect different dimm
>>>> configuration. This timing setup could be part of the instability (I
>>>> don't think it explains the reset problem). Look at the code here:
>>>> SetDelayControl(void) and anywhere else that GLCP_DELAY_CONTROLS gets
>>>> set to see what might be happening. Make sure that MTest is disabled
>>>> in the ref bios setup. This setting is based on the number of devices
>>>> (load) there is on the dimm.
>>>>
>>>> I didn't realize that so few registers were in the msr tool for
>>>> geodelx. You should add these:
>>>> 20000018h R/W Refresh and SDRAM Program (MC_CF07_DATA)
>>>> 10071007_00000040h Page 227
>>>> 20000019h R/W Timing and Mode Program (MC_CF8F_DATA) 18000008_287337A3h Page 229
>>>> 2000001Ah R/W Feature Enables (MC_CF1017_DATA) 00000000_11080001h Page 231
>>>> 2000001Bh RO Performance Counters (MC_CFPERF_CNT1) 00000000_00000000h Page 232
>>>> 2000001Ch R/W Counter and CAS Control (MC_PERCNT2) 00000000_00FF00FFh Page 233
>>>> 2000001Dh R/W Clocking and Debug (MC_CFCLK_DBUG) 00000000_00001300h Page 233
>>>>
>>>> 4C00000Fh R/W GLCP I/O Delay
>>>> Controls(GLCP_DELAY_CONTROLS)00000000_00000000h Page 549
>>>> 4C000014h R/W GLCP System Reset and PLL Control (GLCP_SYS_RSTPLL)
>>>> Bootstrap specific Page 554
>>>>
>>>> Marc
>>>>
>>> I've now added the MSRs and uploaded to pastebin:
>>>
>>> AMD NAS:
>>> http://coreboot.pastebin.com/m53aed60b
>>>
>>> My coreboot:
>>> http://coreboot.pastebin.com/md23bc6a
>>>
>>> ./msrtool -d AMD_NAS:
>>> http://coreboot.pastebin.com/m77663de5
>>>
>>> Tomorrow I'll try the tests on the NAS hardware, instead of our own motherboards
>>> just in case there are some hidden hardware issues.
>>>
>>> Regards,
>>> Nathan
>>>
>> On the NAS reference board I got the following diff between coreboot
>> and the commercial BIOS:
>>
>> http://coreboot.pastebin.com/m1353db1a
>>
>> As you can see there are a lot of latency differences.
>> Unfortunately it was only later that I realised that the differences are because the bootstraps are set to bypass, which means coreboot uses 266 as the speed, where as the commercial bios uses 333.  So when I repeat the same on our boards, the only difference in the geodelx MSRs is:
>>
>> # MC_CFCLK_DBUG
>> -0x2000001d 0x0000000000000000
>> +0x2000001d 0x0000000000001000
>> #    12 TRISTATE_DIS TRI-STATE Disable
>> -0: Tri-stating enabled
>> +1: Tri-stating disabled
> 
> 
> Nathan,
> 
> I don't think the tri-state disable bit explains the problems you have
> seen. Since the memory has the same settings, the problem must be
> somewhere else. You will need to go back the the reboot path to
> investigate. It seems like something in the reset isn't doing a
> complete reset, which causes a problem with the cache disable.
> 
> Marc
> 
> 

I am suspicious that the reset problem only occurs when I'm using a laptop hard drive
off the 44pin IDE connector on our board.  I have tried booting with a 3.5" drive
and external 12V, but I can't replicate the problem.  With the 3.5" drive, a reboot from
fsck works fine.  Hopefully the next PCB revision should perform better because we've
moved the 5V plane further away from the DDR tracks.

I don't know if I mentioned another problem that has similar symptoms.  Some RAM causes
the same cache disable problem, even if there are no IDE devices connected.  This happens
from power-up, so it's not a reset issue.

Nathan

Patch

--- a/device/device.c
+++ b/device/device.c
@@ -282,7 +282,7 @@  void read_resources(struct bus *bus)
 	/* Walk through all devices and find which resources they need. */
 	for (curdev = bus->children; curdev; curdev = curdev->sibling) {
 		int i;
-		printk(BIOS_SPEW,
+		printk(BIOS_CRIT,
 		       "%s: %s(%s) dtsname %s enabled %d\n",
 		       __func__, bus->dev ? bus->dev->dtsname : "NOBUSDEV",
 		       bus->dev ? dev_path(bus->dev) : "NOBUSDEV",