Discussion:
SRP timeout
Matthew Lowe
2010-06-22 10:25:05 UTC
Permalink
I'm receiving a connection timed out error while trying to boot from a SRP target. The same SRP target works perfectly under a locally running OS (tested in Linux and Windows).

My setup includes one Linux box connected in a back-to-back configuration to a desktop. My goal is to boot the desktop from the Linux server's fast raid array. Each computer has a Mellanox MHES18-XTC (MT25204) card in it, which is running the latest firmware. I've followed the instructions for SRP booting in the Wiki. I think my problem is related to the arbel driver.

On boot gPXE loads fine, it loads the arbel driver, it detects the card, it obtains an IP address from the DHCP server, gets the root-path string and then attempts a sanboot. This is where it times out.

To troubleshoot I compiled gPXE with DEBUG=srp:3 and only got more frequent reporting of the timeout error. The following set of messages repeat several times:

SRP 0x1d9d4 TX request tag 00000000000000000000001b SRP 0x1d9d4 socket closed: Connection timed out (0x4c206035)

The Linux box is a little more helpful. As we can see below, the connection is established just fine, but eventually times out on this end too.

Any help would be greatly appreciated.

- Matt

[24923.754700] [19451]: scst: scst_register_target:469:Target ib_srpt_target_4 (rel ID 5) for template ib_srpt registered successfully
[24955.191467] ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c9020021b829, t_port_id 0x2c9020021b8d0:0x2c9020021b8d0 and it_iu_len 80 on port 1 (guid=0xfe80000000000000:0x2c9020021b8d1)
[24955.192272] [1207]: scst: scst_init_session:6166:Using security group "Default" for initiator "0x00000000000000000002c9020021b829"
[24955.192282] [1207]: scst_alloc_add_tgt_dev:2976:Device vdisk0 on SCST lun=0
[24955.192290] [1207]: scst_alloc_set_UA:6004:Adding new UA to tgt_dev ffff8800d77928a0
[24955.192537] [19457]: scst: scst_cmd_thread:4171:Processing thread vdisk00_0 (PID 19457) started
[24955.192546] [19457]: scst_cmd_thread:4192:Alloced new IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192573] [19458]: scst: scst_cmd_thread:4171:Processing thread vdisk00_1 (PID 19458) started
[24955.192576] [19458]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192588] [19459]: scst: scst_cmd_thread:4171:Processing thread vdisk00_2 (PID 19459) started
[24955.192591] [19459]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192599] [19460]: scst: scst_cmd_thread:4171:Processing thread vdisk00_3 (PID 19460) started
[24955.192601] [19460]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192612] [19461]: scst: scst_cmd_thread:4171:Processing thread vdisk00_4 (PID 19461) started
[24955.192615] [19461]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192622] [19462]: scst: scst_cmd_thread:4171:Processing thread vdisk00_5 (PID 19462) started
[24955.192625] [19462]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192637] [19463]: scst: scst_cmd_thread:4171:Processing thread vdisk00_6 (PID 19463) started
[24955.192641] [19463]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192659] [19464]: scst: scst_cmd_thread:4171:Processing thread vdisk00_7 (PID 19464) started
[24955.192663] [19464]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24967.090591] ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c9020021b829, t_port_id 0x2c9020021b8d0:0x2c9020021b8d0 and it_iu_len 80 on port 1 (guid=0xfe80000000000000:0x2c9020021b8d1)
[24967.090601] [1207]: scst_unregister_session:6380:Unregistering session ffff880015b0c9c0 (wait 0)
[24967.090610] [1207]: scst_rx_mgmt_fn:5907:TM fn 10
[24967.090614] [1207]: scst_rx_mgmt_fn:5917:sess=ffff880015b0c9c0, tag_set 0, tag 0, lun_set 1, lun=0, cmd_sn_set 0, cmd_sn 0, priv (null)
[24967.090622] [1207]: scst_post_rx_mgmt_cmd:5836:Adding mgmt cmd ffff8801154a8958 to active mgmt cmd list
[24967.090630] ib_srpt: ***ERROR***: rejected SRP_LOGIN_REQ because another login request is being processed.
[24967.090734] [4319]: scst_tm_thread:5722:Deleting mgmt cmd ffff8801154a8958 from active cmd list
[24967.090738] [4319]: scst_abort_all_nexus_loss_sess:5280:Nexus loss for sess ffff880015b0c9c0 (mcmd ffff8801154a8958)
[24967.090742] [4319]: scst_call_dev_task_mgmt_fn:4540:Calling dev handler vdisk_fileio task_mgmt_fn(fn=10)
[24967.090745] [4319]: scst_call_dev_task_mgmt_fn:4544:Dev handler vdisk_fileio task_mgmt_fn() returned 1
Stefan Hajnoczi
2010-06-22 10:41:00 UTC
Permalink
Post by Matthew Lowe
SRP 0x1d9d4 TX request tag 00000000000000000000001b SRP 0x1d9d4 socket closed: Connection timed out (0x4c206035)
11:31 < stefanha> gpxebot: error 0x4c206035
11:31 < gpxebot> pxenv_status=PXENV_STATUS_TFTP_READ_TIMEOUT uniq=0
errfile=ERRFILE_ib_mi
posix_errno=ETIMEDOUT

gpxe/src/net/infiniband/ib_mi.c:
static void ib_mi_timer_expired ( struct retry_timer *timer, int expired ) {

gPXE is giving up.
Post by Matthew Lowe
The Linux box is a little more helpful. As we can see below, the connection is established just fine, but eventually times out on this end too.
[..]
Post by Matthew Lowe
[24923.754700] [19451]: scst: scst_register_target:469:Target ib_srpt_target_4 (rel ID 5) for template ib_srpt registered successfully
[24955.191467] ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c9020021b829, t_port_id 0x2c9020021b8d0:0x2c9020021b8d0 and it_iu_len 80 on port 1 (guid=0xfe80000000000000:0x2c9020021b8d1)
The target receives the login request and sets up the new session...
Post by Matthew Lowe
[24955.192272] [1207]: scst: scst_init_session:6166:Using security group "Default" for initiator "0x00000000000000000002c9020021b829"
[24955.192282] [1207]: scst_alloc_add_tgt_dev:2976:Device vdisk0 on SCST lun=0
[24955.192290] [1207]: scst_alloc_set_UA:6004:Adding new UA to tgt_dev ffff8800d77928a0
[24955.192537] [19457]: scst: scst_cmd_thread:4171:Processing thread vdisk00_0 (PID 19457) started
[24955.192546] [19457]: scst_cmd_thread:4192:Alloced new IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192573] [19458]: scst: scst_cmd_thread:4171:Processing thread vdisk00_1 (PID 19458) started
[24955.192576] [19458]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192588] [19459]: scst: scst_cmd_thread:4171:Processing thread vdisk00_2 (PID 19459) started
[24955.192591] [19459]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192599] [19460]: scst: scst_cmd_thread:4171:Processing thread vdisk00_3 (PID 19460) started
[24955.192601] [19460]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192612] [19461]: scst: scst_cmd_thread:4171:Processing thread vdisk00_4 (PID 19461) started
[24955.192615] [19461]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192622] [19462]: scst: scst_cmd_thread:4171:Processing thread vdisk00_5 (PID 19462) started
[24955.192625] [19462]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192637] [19463]: scst: scst_cmd_thread:4171:Processing thread vdisk00_6 (PID 19463) started
[24955.192641] [19463]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24955.192659] [19464]: scst: scst_cmd_thread:4171:Processing thread vdisk00_7 (PID 19464) started
[24955.192663] [19464]: scst_cmd_thread:4211:Linked IO context ffff88004b1bfbd0 (p_cmd_threads ffff8800d7792990)
[24967.090591] ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c9020021b829, t_port_id 0x2c9020021b8d0:0x2c9020021b8d0 and it_iu_len 80 on port 1 (guid=0xfe80000000000000:0x2c9020021b8d1)
gPXE doesn't notice that the session has been started. It retransmits
the login request.

Building with DEBUG=ib_mi:3 would show the management interface debug
messages from gPXE. I wonder if whether gPXE sees the reply but
discards it or doesn't see the reply at all.

Stefan
M Lowe
2010-06-23 05:44:50 UTC
Permalink
Post by Stefan Hajnoczi
gPXE doesn't notice that the session has been started. It retransmits
the login request.
Building with DEBUG=ib_mi:3 would show the management interface debug
messages from gPXE. I wonder if whether gPXE sees the reply but
discards it or doesn't see the reply at all.
I built with DEBUG=ib_mi:3 and attempted another boot. I couldn't quite
catch all
of the debug output, because it scrolls quite fast. From what I could
catch,
there was not any packets received, only transmits appeared in the
output I could
read. My motherboard doesn't have a serial port, so that's not an
option. Unless
gPXE supports USB-Serial converters?

I also tried a build with DEBUG=arbel,srp,ib_srp,ib_mi. The arbel driver
reported
some odd errors that I'm not sure are related.

Arbel 0x1f7c4 firmware version 1.2.0
Arbel 0x1f7c4 requires 8192 kB for firmware
Arbel 0x1f7c4 command failed with status 22:
000404c8: 00 00 00 00 00 00 00 00-00 00 00 00 00 (rest of line cut off)
000404d8: cf ec 00 00 00 00 0 (rest of line cut off)
Arbel 0x1f7c4 requires 1080kB ICM and 12kB AUX ICM
Arbel 0x1f7c4 EQN 0x0 ring at [0x1f940,0x1f9c0]
Arbel 0x1f7c4 command failed with status 0a:
0004019c: 00 00 00 00 cf eb f0 00-00 00 00 02 00 00 00 00 :
...............
000401ac: cf ec 00 00 00 00 00 00-0a 00 30 24 : .........0$
Arbel 0x1f7c4 could not issue MAD IFC: Input/output error (0x1d714039)

After I ran autoboot, I got the following output after each attempt to
retry the
connection:

Arbel 0x1f7c4 CQN 0x83 ring at [0x21ae0, 0x21be0)
Arbel 0x1f7cr4 QPN 0x550404 send ring at [0x22b60,0x22d60)
arbel 0x1f7c4 qpn 0x550404 receive ring at [0x22d60,0x22de0)
SRP 0x20f34 socket closed: Connection timed out (0x4c206035)

To try and step through the whole process I tried to run gdb (following
the
instructions in the wiki). It seems that running gdbstub halts whatever
thread is
handling the network IO, making it impossible to connect to gdbstub over
udp.
After exiting gdbstub, gPXE starts responding to pings and arp requests
again.

To ensure the card itself was not damaged I tried a different card that
is exactly
the same model. The results were identical. I also tried booting over
iSCSI which
worked fine.

Any ideas?
Stefan Hajnoczi
2010-06-23 08:27:59 UTC
Permalink
Post by M Lowe
My motherboard doesn't have a serial port, so that's not an
option. Unless gPXE supports USB-Serial converters?
gPXE doesn't support USB. Can your BIOS can redirect to (USB-)serial?
Post by M Lowe
000404c8: 00 00 00 00 00 00 00 00-00 00 00 00 00 (rest of line cut off)
000404d8: cf ec 00 00 00 00 0 (rest of line cut off)
I think this is the error code (from the Linux driver):

/* HCA local attached memory not present: */
MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22,

The gPXE source says this error can be ignored.
Post by M Lowe
...............
000401ac: cf ec 00 00 00 00 00 00-0a 00 30 24 : .........0$
Arbel 0x1f7c4 could not issue MAD IFC: Input/output error (0x1d714039)
Error code from Linux again:

/* Index out of range: */
MTHCA_CMD_STAT_BAD_INDEX = 0x0a,

I think this happens here:
/* Update MAD parameters */
for ( i = 0 ; i < ARBEL_NUM_PORTS ; i++ )
ib_smc_update ( arbel->ibdev[i], arbel_mad );

The driver defines ARBEL_NUM_PORTS to 2, so perhaps it is probing a port that
doesn't exist. This should be fine, too.
Post by M Lowe
It seems that running gdbstub halts whatever thread is handling the network
IO, making it impossible to connect to gdbstub over udp. After exiting
gdbstub, gPXE starts responding to pings and arp requests again.
The gdbstub performs low-level network I/O - it directly polls the network
device for packets. The network stack will not respond while the gdbstub is
active. However, the gdbstub implements ARP response directly.

Are you running gdbudp on the NIC you are trying to debug? In order to be able
to debug the arbel driver, gdbudp needs to use another NIC (e.g. an e1000
card). This is because setting breakpoints in the arbel code won't work if
gdbudp is using the arbel card.
Post by M Lowe
Any ideas?
I think you are on the right track looking at DBG() messages. You've
established that transmit is working and the target receives the login
request.

You might need to reduce the number of DBG() messages in gPXE's
receive code path when debugging without a serial port. Run without
the ":3" on the DEBUG= options for less verbose output. You can also
try commenting out or moving DBG() messages that are too frequent and
not useful.

The aim would be to find out if the response is being received at each
layer of the stack (arbel driver, infiniband, srp) and then understand
the reason for dropping the response.

Michael Brown and Itay Gazit may have better Infiniband and SRP
debugging ideas. I have CCed them and added the gPXE mailing list
(the Etherboot-discuss list has been replaced by ***@etherboot.org).

Stefan
Itay Gazit
2010-06-25 17:46:59 UTC
Permalink
Hi Matthew,

Stefan is right, you should reduce the DEBUG messages depth to find the fail
cause.

I have tried SRP boot only with Hermon driver (ConnectX) and it worked for
me.

Regards,
Itay
Post by Stefan Hajnoczi
Post by M Lowe
My motherboard doesn't have a serial port, so that's not an
option. Unless gPXE supports USB-Serial converters?
gPXE doesn't support USB. Can your BIOS can redirect to (USB-)serial?
Post by M Lowe
000404c8: 00 00 00 00 00 00 00 00-00 00 00 00 00 (rest of line cut off)
000404d8: cf ec 00 00 00 00 0 (rest of line cut off)
/* HCA local attached memory not present: */
MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22,
The gPXE source says this error can be ignored.
Post by M Lowe
...............
000401ac: cf ec 00 00 00 00 00 00-0a 00 30 24 : .........0$
Arbel 0x1f7c4 could not issue MAD IFC: Input/output error (0x1d714039)
/* Index out of range: */
MTHCA_CMD_STAT_BAD_INDEX = 0x0a,
/* Update MAD parameters */
for ( i = 0 ; i < ARBEL_NUM_PORTS ; i++ )
ib_smc_update ( arbel->ibdev[i], arbel_mad );
The driver defines ARBEL_NUM_PORTS to 2, so perhaps it is probing a port that
doesn't exist. This should be fine, too.
Post by M Lowe
It seems that running gdbstub halts whatever thread is handling the
network
Post by M Lowe
IO, making it impossible to connect to gdbstub over udp. After exiting
gdbstub, gPXE starts responding to pings and arp requests again.
The gdbstub performs low-level network I/O - it directly polls the network
device for packets. The network stack will not respond while the gdbstub is
active. However, the gdbstub implements ARP response directly.
Are you running gdbudp on the NIC you are trying to debug? In order to be able
to debug the arbel driver, gdbudp needs to use another NIC (e.g. an e1000
card). This is because setting breakpoints in the arbel code won't work if
gdbudp is using the arbel card.
Post by M Lowe
Any ideas?
I think you are on the right track looking at DBG() messages. You've
established that transmit is working and the target receives the login
request.
You might need to reduce the number of DBG() messages in gPXE's
receive code path when debugging without a serial port. Run without
the ":3" on the DEBUG= options for less verbose output. You can also
try commenting out or moving DBG() messages that are too frequent and
not useful.
The aim would be to find out if the response is being received at each
layer of the stack (arbel driver, infiniband, srp) and then understand
the reason for dropping the response.
Michael Brown and Itay Gazit may have better Infiniband and SRP
debugging ideas. I have CCed them and added the gPXE mailing list
Stefan
M Lowe
2010-07-12 00:18:11 UTC
Permalink
I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.

Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue).

On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.

Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp

Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages.

Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)



From: Itay Gazit [mailto:***@gmail.com]
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Cc: etherboot-***@lists.sourceforge.net; gpxe; Michael Brown
Subject: Re: [Etherboot-discuss] SRP timeout

Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay
Itay Gazit
2010-07-13 14:40:49 UTC
Permalink
Michael,
Do you have an idea? What can be the problem with the arbel driver?

Itay
Post by M Lowe
I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.
Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue).
On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.
Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp
Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages.
Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Subject: Re: [Etherboot-discuss] SRP timeout
Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay
M Lowe
2010-07-22 04:19:55 UTC
Permalink
Even ideas on how I can debug this issue further would help. I don't
mind putting in the leg work at all, but a lot of this code is over my
head.





From: Itay Gazit [mailto:***@gmail.com]
Sent: Tuesday, July 13, 2010 8:41 AM
To: M Lowe; Michael Brown
Cc: Stefan Hajnoczi; etherboot-***@lists.sourceforge.net; gpxe
Subject: Re: [Etherboot-discuss] SRP timeout



Michael,

Do you have an idea? What can be the problem with the arbel driver?



Itay

On Mon, Jul 12, 2010 at 3:18 AM, M Lowe <***@shaw.ca> wrote:

I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.

Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue).

On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.

Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp

Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages.

Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)



From: Itay Gazit [mailto:***@gmail.com]
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Cc: etherboot-***@lists.sourceforge.net; gpxe; Michael Brown
Subject: Re: [Etherboot-discuss] SRP timeout


Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay
Viswanath Krishnamurthy
2010-07-23 05:03:56 UTC
Permalink
I just looked into arbel code. It does not implement RC qp type at all (It
implements UD only). This means SRP will not work with arbel card since SRP
uses RC/RDMA for data transfer.

Your best bet at this moment is to us Hermon card (unless you want to add RC
QP capability into arbel driver :-) )

Itay,

Can you reconfirm this from the arbel code?

Thx,
Viswa
Even ideas on how I can debug this issue further would help. I don’t mind
putting in the leg work at all, but a lot of this code is over my head.
*Sent:* Tuesday, July 13, 2010 8:41 AM
*To:* M Lowe; Michael Brown
*Subject:* Re: [Etherboot-discuss] SRP timeout
Michael,
Do you have an idea? What can be the problem with the arbel driver?
Itay
I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.
Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue).
On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.
Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp
Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages.
Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Subject: Re: [Etherboot-discuss] SRP timeout
Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Etherboot-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/etherboot-discuss
Itay Gazit
2010-07-24 11:01:31 UTC
Permalink
Viswa,
Michael is trying to add the RC support for the Arbel.

Michael,
I suspect something earlier went wrong before posting the WQE. Can you
verify the all MAP_FA process suceed.

Regards,

Itay
On Fri, Jul 23, 2010 at 8:03 AM, Viswanath Krishnamurthy <
Post by Viswanath Krishnamurthy
I just looked into arbel code. It does not implement RC qp type at all (It
implements UD only). This means SRP will not work with arbel card since SRP
uses RC/RDMA for data transfer.
Your best bet at this moment is to us Hermon card (unless you want to add
RC QP capability into arbel driver :-) )
Itay,
Can you reconfirm this from the arbel code?
Thx,
Viswa
Even ideas on how I can debug this issue further would help. I don’t
mind putting in the leg work at all, but a lot of this code is over my head.
*Sent:* Tuesday, July 13, 2010 8:41 AM
*To:* M Lowe; Michael Brown
*Subject:* Re: [Etherboot-discuss] SRP timeout
Michael,
Do you have an idea? What can be the problem with the arbel driver?
Itay
I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.
Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue).
On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.
Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp
Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages.
Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Subject: Re: [Etherboot-discuss] SRP timeout
Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Etherboot-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/etherboot-discuss
Loading...