2014-03-16

Tourist trip to MX fabric

Tourist, because it's mostly original research so quality may be dubious.

You can infer lot about the fabric by looking at 'show hsl2 ...' commands. Let's start.

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) serdes MQCHIP(0) serdes table : MQCHIP(0)-Avago 65NM-0 [0xf300000]: 24 links 0 - 23 MQCHIP(0)-Avago 65NM-1 [0xf304000]: 24 links 24 - 47 MQCHIP(0)-Avago 65NM-2 [0xf308000]: 8 links 48 - 55 MQCHIP(0)-Avago 65NM-3 [0xf309000]: 8 links 56 - 63 MQCHIP(0)-Avago 65NM-4 [0xf30a000]: 8 links 64 - 71 MQCHIP(0)-Avago 65NM-5 [0xf30b000]: 8 links 72 - 79 MQCHIP(0)-Avago 65NM-6 [0xf30c000]: 8 links 80 - 87 MQCHIP(0)-Avago 65NM-7 [0xf30d000]: 8 links 88 - 95 MQCHIP(0)-Avago 65NM-8 [0xf30e000]: 8 links 96 - 103 MQCHIP(0)-Avago 65NM-9 [0xf30f000]: 8 links 104 - 111 MQCHIP(0)-Avago 65NM-10 [0xf310000]: 8 links 112 - 119 MQCHIP(0)-Avago 65NM-11 [0xf311000]: 8 links 120 - 127 MQCHIP(0)-Avago 65NM-12 [0xf312000]: 8 links 128 - 135 MQCHIP(0)-Avago 65NM-13 [0xf313000]: 8 links 136 - 143 MQCHIP(0)-Avago 65NM-14 [0xf318000]: 2 links 144 - 145 MQCHIP(0)-Avago 65NM-15 [0xf31a000]: 2 links 146 - 147

Avago is well known manufacturer of SerDes (SERialization / DESerialization), 65NM probably means Avago's 65nm lithography line of products. SerDes presentation here is unidirectional. But that is still quite large number of SerDes links. They are mapped/bundled to channels, looking at those channels we can get bit more clue about them:

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) MQCHIP(0)(16) 74 channels 16 serdes 148 links chip_flag 0x9 Flag: log_poll_error log_ioerr rd_fn 0x40760d64 wr_fn 0x40760d10 cookie 0x4282bf28 disaster_fn 0x0 MQCHIP(0) channel table : MQCHIP(0)-chan-rx-0 0x7200000 <= Table QXCHIP(0)-chan-tx-1 MQCHIP(0)-Avago 65NM-link-146 MQCHIP(0)-chan-tx-1 0x7201000 => Table QXCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-144 MQCHIP(0)-chan-rx-2 0x7202000 <= Table LUCHIP(0)-chan-tx-2 MQCHIP(0)-Avago 65NM-link-24 MQCHIP(0)-chan-rx-3 0x7202000 <= Table LUCHIP(0)-chan-tx-3 MQCHIP(0)-Avago 65NM-link-36 MQCHIP(0)-chan-tx-4 0x7203000 => Table LUCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-0 MQCHIP(0)-chan-tx-5 0x7203000 => Table LUCHIP(0)-chan-rx-1 MQCHIP(0)-Avago 65NM-link-12 MQCHIP(0)-chan-rx-6 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-7 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-8 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-9 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-10 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-82 MQCHIP(0)-chan-rx-11 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-12 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-86 MQCHIP(0)-chan-rx-13 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-14 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-15 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-16 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-17 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-90 MQCHIP(0)-chan-rx-18 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-19 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-94 MQCHIP(0)-chan-rx-20 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-21 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-22 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-23 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-24 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-98 MQCHIP(0)-chan-rx-25 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-26 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-102 MQCHIP(0)-chan-rx-27 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-28 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-29 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-30 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-31 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-106 MQCHIP(0)-chan-rx-32 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-33 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-110 MQCHIP(0)-chan-tx-34 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-35 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-36 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-37 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-38 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-50 MQCHIP(0)-chan-tx-39 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-40 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-54 MQCHIP(0)-chan-tx-41 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-42 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-43 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-44 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-45 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-58 MQCHIP(0)-chan-tx-46 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-47 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-62 MQCHIP(0)-chan-tx-48 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-49 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-50 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-51 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-52 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-66 MQCHIP(0)-chan-tx-53 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-54 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-70 MQCHIP(0)-chan-tx-55 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-56 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-57 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-58 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-59 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-74 MQCHIP(0)-chan-tx-60 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-61 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-78 MQCHIP(0)-chan-rx-62 0x720c000 <= Channel remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-63 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-64 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-132 MQCHIP(0)-chan-rx-65 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-66 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-67 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-140 MQCHIP(0)-chan-tx-68 0x720e000 => Channel remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-69 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-70 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-116 MQCHIP(0)-chan-tx-71 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-72 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-73 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-124

That's lot of information, but we can ignore the 'Invalid' then we're left with 'Primitive' and 'Table'. The 'Table' ones clearly mark where they are connecting to in this case LU chip and QX chip, so we can work on assumption that 'Primitive' is used to connect to fabric. We can further determine there are 8 channels in both directions, two of the channels are always next to each other numerically with lot of empty channels between, this may prove useful. But let's take closer look at one channel.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 7 MQCHIP(0)-chan-rx-7 : Up Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-80 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x8acfff19 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

What is interesting here is Sub channel 1 of 2 with 4 links, sub-channel 1 of 2? What is 2nd? But 4 links is interesting, so channel has 4 links. Let's check the next one. Also interesting to notice we have cell counter and there is 64b66b encoding overhead.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 8 MQCHIP(0)-chan-rx-8 : Up Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-84 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

Ok, 7 and 8 are next each other, since it's logically one channel made out of two sub-channels. So the 8 channels we saw per direction, really are 4 logical channels with two sub-channels each. Looking at the other channels, everything looks the same, until we hit 28, 29 i.e. the 4th channel:

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 28 MQCHIP(0)-chan-rx-28 : Down Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-104 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x0 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 29 MQCHIP(0)-chan-rx-29 : Down Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-108 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

It's down, and no cells have been transferred. Isn't this convenient, we have 4 channels per direction, but one of them is down, so we're left with 3 channels per direction, this being MX960, it matches our SCB count. But why two sub-channels per SCB? Because fabric actually has two Avago chips, you need to connect to both, but present it as single channel to higher level use. Why do we have four channels, if we only use 3 of them? Well, if we look this same stat in MX240/MX480 we actually see the fourth channel up and pushing cells! Clearly MX960 has 8+8+8 channels per MQ for fabric and MX240/MX480 have 8+8+8+8 per MQ, 8 more! But MX240/MX480 have only two SCB, so clearly it must be arranged as (8+8)+(8+8)

Now, we know MX240/MX480 SCB can handle MPC2 (2*MQ, i.e. 2*40G) fabric connectivity redundantly with single SCB up, so we know (8+8) (single SCB up) must be enough to run 40G, i.e. 16 must be enough to run single 40G MQ fabric capacity. This gives us bottom range to SerDes capacity as 40G/16SerDes = 2.5Gbps. But wait, we had 64b/66b encoding and we know fabric uses cells which have header, which is overhead, so 2.5Gbps can't be possible it must be more. Bit of googling we'll find that Avago uses 3.125Gbps SerDes, this makes sense. So the raw capacity available to MQ is 16*3.125=50Gbps, redundantly! Double in MX240/MX480 with both SCB are up, so 100Gbps fabric capacity, that seems like LOT! But actually multicast replication in MPC2 is 'btree', so each linecard needs to send two copies of multicast over fabric to replicate it, so 100Gbps is back again halved to 50Gbps for multicast scenario, and cell overhead can be devastating (think of small packet using two cells, but just 1 byte of 2nd cell, efficiency is 50%!).

This also means to us, that MX960 with all SCB up, has just 75Gbps of capacity per MQ when all SCB are used, so plenty for unicast, but not enough for worst case multicast btree replication. Still seems like SCB and MPC2 are good match for most any realistic/practical scenario, no one is running 100% multicast, plenty of fabric capacity for real life, in MX960 and MX240/MX480, with all SCB up or one SCB down, still full wire-rate fabric capacity for unicast.

But if we actually try to send traffic from DPCE to MPC2 in lab we'll see MPC2 MQ=>DCPE being fine, single MQ can congest the 40G DPCE card. But DPCE=>MQ in MX960 experiences just 26Gbps out of 40Gbps needed, even when all SCB are up. Wait what? Thought this was murica! We had 75Gbps MX960 capacity, and we can do just 26Gbps? What is up with that! Furthermore, if we pull one SCB out, lab test is 100% clean, DPCE=>MQ experiences 40Gbps out of 40Gbps needed, full linerate. My head hurts!

Let's ponder a bit. We know DPCE does not connect to the redundant SCB's, so it's always just 2 SCB in MX960, even though MQ in MPC2 connects to 3 SCB. So if we divide 40Gbps MQ needs nominally by 3 SCB, it needs 13.33Gbps per SCB, which multiplied by 2 is, 26.66Gbps. Looking at fabric requests and grants, we see that MQ is not giving grants to DPCE (ichip), we have fabric capacity but MQ is not allowing DPCE to send. Clearly MQ does not want to deal with oversubscription, for one reason or another, and it only gives out fabric grants at 13.33Gbps per SCB, so that in no case would aggregate amount of fabric traffic be over 40Gbps, no oversubscription. But this strategy brokes down, when MQ has 3 connection, and DPCE only 2, and again issue is remedied when MQ downgrades to 2 connection, as now MQ is giving out fabric grants at 40/2 => 20Gbps per SCB, instead of just 13.3Gbps.

Why does not MQ want to deal with oversubscription? Maybe it won't have sufficient information at ingress to drop intelligently enough to keep QoS contracts? Maybe this behavior is done to avoid software complexity? Turns out, you don't even have to pull out the SCB to force MQ to connect just to two SCB, you can use set chassis fabric redundancy-mode redundantand MPC2<=>DPCE is wire rate, and 3rd SCB in MX960 is unused, as it's not needed for MPC2, remember we have 50Gbps per MQ without it, and 75Gbps with it. So only reason we even might need it, is if we do multicast.

Interestingly, if MPC2 would be 120Gbps instead of 80Gbps, so 60Gbps of fabric capacity per MQ instead of 40Gbps. This issue shouldn't exist, as you'd be giving out fabric grants at 60/3 => 20Gbps per SCB, so even in non-redundant/capacity mode DPCE would get 40Gbps it need. Other way to fix this, is of course upgrade SerDes from 3.125 to 6.25Gbps, turns out, Avago has such chips. This means MQ can get 6.25*(8+8) i.e. 100Gbps, with just two SCBE. So enough to do btree replication, and as it's just 2 SCB, you're still giving DPCE full 40Gbps of fabric grants. So clearly SCBE is better fix, but had MQ originally been 60Gbps instead of 40Gbps, I think that would have been much better as well.

Can this issue happen in future again? Can't see why not. Imagine SCBE system with far-end linecard being 2*80Gbps (enought to connect to just 2 SCBE) and near-end linecard being 2*100Gbps MQ, with similar behavior about curbing fabric grants to 100Gbps per MQ, so you're giving out 100/3 = 33.33Gbps and far-end would be getting 66.66Gbps of fabric grants from 80Gbps needed. But looking at DPCE and MPC I don't think the problem will come again, as looking at fabric streams in MQ and DPCE (ichip) it looks like traffic coming from MQ can be classified in far-end MQ and can be dropped while retaining QoS contract. Current DPCE=>MQ is probably unfortunate combination of mis-matching bandwidths and DPCE having single stupid fabric stream so far-end cant easily determine what to drop in case of congestion, so best policy was just to make sure congestion never happens in ingress.

We can check the fabric streams and see there is now priority in MQ:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream all Input Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: DWWW---------------------------- -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H Output Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: WW------------------------------ -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H NPC1(test2nxb10-re0.dk vty)#

Notice there are four fabric streams (2 fabric stream per NPU in system), 4,5 and 132,133. It's worth another post, but there is offset of 128 between low-prio and high-prio. We can shortly try to prove this:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 4 Input Stream 4 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 4 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 4 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 132 Input Stream 132 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 132 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 132 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 4 Q node 4: allocated : true parent node : 0 guarantee prio : 3 GL excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 5 mnc in gp enable : 1 ----------------------- guarantee credits : 0 excess credits added: 0 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 132 Q node 132: allocated : true parent node : 2 guarantee prio : 1 GH excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 95 mnc in gp enable : 1 ----------------------- guarantee credits : 65408 excess credits added: 1 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)#

So traffic which is in-contract going to fabric in stream 4 gets 'GL' guaranteed-low, and traffic which is in-contract going to fabric in stream 132 gets 'GH' guaranteed-high. So provided far-end MQ has this stream information, it would be very easy for it to know what to drop in oversubscriptiption, drop 4, not 132. IChip does not have this, so maybe it's reason why it was just guaranteed that there is no oversubscription from fabric?

It's important to realize fabric bandwidth is just one issue to think, MQ has internally less than needed 80Gbps (40G fabric, 40G wan) of memory bandwidth, actual number depends on cell alignment etc, but can be 80Gbps in good climate, and maybe 75Gbps in average climate or lot less with manufactured climate.

Some bullet points for TL;DR people

  1. MQ has 32 SerDes towards fabric
  2. Each SerDes is 3.125Gbps gross (add 64/66, cell overhead+alignment)
  3. MX960 allocates SerDes as 8+8+8 to 3 SCB
  4. MX240/MX480 allocates SerDes as 16+16 to 2 SCB
  5. SCBE upgrades the SerDes to 6.25Gbps
  6. SCBE2 upgrades the SerDes to 16.5Gbps (matches what Avago offers, but not very sure)

1 comment: