2014-03-17

Quick look at Trio ddos-protection with flow-detection

Some things are easy to protect with iACL and lo0 ACL but others are really hard, like BGP, you need to allow BGP from customers and from core, and it's not convenient or practical to handle them separately in lo0 ACL + policer. Luckily JunOS has feature called flow-detection, you turn it on with set system ddos-protection global flow-detection

I'm sending DoS from single source to lo0, my iBGP goes immediately down. After I turn on flow-detection iBGP connectivity is restored. Looking at PFE, we can see what is happening:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 721 1400 sub 338 21 79161 c158ef22 c158ef1f 53571 179 0 1 2679 1400 sub 356 11159404 2187242988 64640102 c158ef1f 179 179 0 2 2015 1400 sub 338 29 112468 c158ef23 c158ef1f 179 65020

Pretty nice and clear, 64.64.01.02 => c1.58.ef.1f is our attack traffic and it's getting its own policer, iBGP is stable, attack traffic is policed separately. Let's check those policers more closely:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 2679 PFE: 0 Flow Record Index: 2679 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 64640102 (100.100.1.2) Dst IP addr: c158ef1f (193.88.239.31) Src port: 179 Dst port: 179 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 0 Proto idx: 79 Policer idx: 3885 Time inserted: 3484405383 Time last violated: 3484485825 Last received: 42480417 Flow Statitics: Packet Count: 42655279 Byte Count: 8360434488 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp stats DDOS Policer Statistics: arrival pass # of idx prot group proto on loc pass drop rate rate flows --- --- ----------- ----------- -- ------ -------- -------- ------ ------ ----- 79 1400 bgp aggregate Y UKERN 8378805 0 9 9 3 PFE-0 8393411 2310827652 528095 9 3 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp configuration DDOS Policer Configuration: UKERN-Config PFE-Config idx prot group proto on Pri rate burst rate burst --- --- ------------ ------------ -- -- ------ ----- ------ ----- 79 1400 bgp aggregate Y Lo 8000 20000 8000 20000 MX104-ABB-0(test13nqa1-re0.dk vty)#

Looks good, what if we increase the sources from /32 to /16, let's see:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3083 1400 sub 356 80 15484 6464e743 c158ef1f 179 179 0 1 805 1400 sub 356 217 42336 64642df4 c158ef1f 179 179 0 2 672 1400 sub 356 291 56840 6464eb64 c158ef1f 179 179 0 3 3382 1400 sub 356 256 49980 6464ba66 c158ef1f 179 179 0 4 3956 1400 sub 356 82 15876 64643b3d c158ef1f 179 179 0 5 1999 1400 sub 356 92 17836 646403a6 c158ef1f 179 179 0 6 2808 1400 sub 356 22 4116 64642f33 c158ef1f 179 179 <..> 0 3455 866 1400 sub 356 80 15484 646475a1 c158ef1f 179 179 0 3456 4096 1400 sub 356 319 62328 646470a8 c158ef1f 179 179 MX104-ABB-0(test13nqa1-re0.dk vty)#

No mas! I expected to see 65537 or so, but instead I'm seeing just 3456. My iBGP also went down! Let's poke around bit:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b39 Culprit flows: 3902 Culprit flows on scan: 3902 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588425 Free SCFD states: 4096 Free flow entries: 194 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 321 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=3902 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 191 current culprit flows: 3905 discovered suspicious flows: 123802 discovered culprit flows: 23356 deleted culprit flows: 19451 false positives: 100255 hash insertion errors: 148669170 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23353 flow timeout: 0 flow return to normal: 115 flow cleared: 19336 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Immediately what caught my attention was Free flow entries: 194, oops! We're running out of flow entries. But as we wait for few more seconds, my iBGP comes back up, why?

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3183 1400 ifl 338 96 220159 0 0 0 0 0 1 1439 1400 ifl 356 70851932 13886978476 0 0 0 0 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 3183 PFE: 0 Flow Record Index: 3183 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 338 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 1 Rcvd last flow op: 3 Flow state: 3 Aggr level: 1 Proto idx: 79 Policer idx: 2105 Time inserted: 3485722214 Time last violated: 3485723825 Last received: 105 Flow Statitics: Packet Count: 105 Byte Count: 221115 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 1439 PFE: 0 Flow Record Index: 1439 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 1 Proto idx: 79 Policer idx: 2155 Time inserted: 3485637825 Time last violated: 3485792825 Last received: 81852797 Flow Statitics: Packet Count: 82308537 Byte Count: 16132473056 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b40 Culprit flows: 2 Culprit flows on scan: 2 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588463 Free SCFD states: 4096 Free flow entries: 4094 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 4221 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=2 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 0 current culprit flows: 2 discovered suspicious flows: 123806 discovered culprit flows: 23360 deleted culprit flows: 23358 false positives: 100446 hash insertion errors: 149071671 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23360 flow timeout: 0 flow return to normal: 117 flow cleared: 23241 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Wow! Only two flows left, and they've changed from sub to ifl, keys are all zero, no src, dst IP match, no src, dst port match! But my iBGP seems to work, as my attack traffic (simulating BGP customer) is coming from separate IFL than my iBGP. I guess it's lot better than nothing, lot better than overflowing the 4k policers.

Yet, waiting bit longer I see that my INET/INET6 iBGP is happy, but iBGP VPNv4/VPNv6 is unhappy, both are on same IFL, but different than attack traffic. Turns out default level for sub and ifl is 10pps and ifd is 20kpps. Our iBGP was 5 session and was periodically exceeding 10pps IFL limit, and by default, once flow-limit is hit (flow-control-mode), you don't drop just the exceeding traffic (flow-control-mode police), but you drop everything in offending flow (flow-control-mode drop), which I think is great default. But is sub==ifl==10pps great default? I don't think so. Anyway my iBGP IFL was hitting 10pps, and dropping everything for duration of violation, not sure why it affected 3xVPN4/VPN6 session, but not INET/INET6, same hold-time and keepalive, but doesn't matter misconfig on my part. Once I added set system ddos-protection protocols bgp aggregate flow-level-bandwidth logical-interface 1000 VPN4/VPN6 stabilized.

I noticed that for some reason I couldn't anymore make the platform automatically switch during DoS from sub detection to ifl detection. This mean, my configuration was again broken, as the subscriber flows congested my HW's 4k slots. Not sure why this is and how it should work. I ended up disabling 'sub' level detection αΊƒith set system ddos-protection protocols bgp aggregate flow-level-detection subscriber off, I guess it would be nice to keep it on, if we could reliably and rapidly switch from sub=>ifl during policers being congested, but right now it seems that if there is danger of seeing 4k of sub-level flows, it might be safest to just ignore them and protect only at ifl level. My full config is now:

{master}[edit system ddos-protection protocols bgp] fisakytt@test13nqa1-re0.dk# show aggregate { bandwidth 8000; flow-level-bandwidth { logical-interface 100; } flow-level-detection { subscriber off; } }

It seems to consistently protect my BGP from attacks which I'm interested in, quite bit of config, if you're doing that for 200 DDoS policer protocols (after researching each relevant about what you even should do). It might be nice if Juniper would have concept of 'protected flows' which always have priority, which would be those flows, which have manual hint in configuration, like in Cisco LPTS when you configure BGP, you get BGP policer for that configured session, JNPR could easily do same and configure those persistently and in case of congestion, don't break those, only break policers among dynamically detected flows. I'd also really want to figure out how sub=>ifl fallback should work, why does it work sometimes and sometimes not.

2014-03-16

Tourist trip to MX fabric

Tourist, because it's mostly original research so quality may be dubious.

You can infer lot about the fabric by looking at 'show hsl2 ...' commands. Let's start.

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) serdes MQCHIP(0) serdes table : MQCHIP(0)-Avago 65NM-0 [0xf300000]: 24 links 0 - 23 MQCHIP(0)-Avago 65NM-1 [0xf304000]: 24 links 24 - 47 MQCHIP(0)-Avago 65NM-2 [0xf308000]: 8 links 48 - 55 MQCHIP(0)-Avago 65NM-3 [0xf309000]: 8 links 56 - 63 MQCHIP(0)-Avago 65NM-4 [0xf30a000]: 8 links 64 - 71 MQCHIP(0)-Avago 65NM-5 [0xf30b000]: 8 links 72 - 79 MQCHIP(0)-Avago 65NM-6 [0xf30c000]: 8 links 80 - 87 MQCHIP(0)-Avago 65NM-7 [0xf30d000]: 8 links 88 - 95 MQCHIP(0)-Avago 65NM-8 [0xf30e000]: 8 links 96 - 103 MQCHIP(0)-Avago 65NM-9 [0xf30f000]: 8 links 104 - 111 MQCHIP(0)-Avago 65NM-10 [0xf310000]: 8 links 112 - 119 MQCHIP(0)-Avago 65NM-11 [0xf311000]: 8 links 120 - 127 MQCHIP(0)-Avago 65NM-12 [0xf312000]: 8 links 128 - 135 MQCHIP(0)-Avago 65NM-13 [0xf313000]: 8 links 136 - 143 MQCHIP(0)-Avago 65NM-14 [0xf318000]: 2 links 144 - 145 MQCHIP(0)-Avago 65NM-15 [0xf31a000]: 2 links 146 - 147

Avago is well known manufacturer of SerDes (SERialization / DESerialization), 65NM probably means Avago's 65nm lithography line of products. SerDes presentation here is unidirectional. But that is still quite large number of SerDes links. They are mapped/bundled to channels, looking at those channels we can get bit more clue about them:

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) MQCHIP(0)(16) 74 channels 16 serdes 148 links chip_flag 0x9 Flag: log_poll_error log_ioerr rd_fn 0x40760d64 wr_fn 0x40760d10 cookie 0x4282bf28 disaster_fn 0x0 MQCHIP(0) channel table : MQCHIP(0)-chan-rx-0 0x7200000 <= Table QXCHIP(0)-chan-tx-1 MQCHIP(0)-Avago 65NM-link-146 MQCHIP(0)-chan-tx-1 0x7201000 => Table QXCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-144 MQCHIP(0)-chan-rx-2 0x7202000 <= Table LUCHIP(0)-chan-tx-2 MQCHIP(0)-Avago 65NM-link-24 MQCHIP(0)-chan-rx-3 0x7202000 <= Table LUCHIP(0)-chan-tx-3 MQCHIP(0)-Avago 65NM-link-36 MQCHIP(0)-chan-tx-4 0x7203000 => Table LUCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-0 MQCHIP(0)-chan-tx-5 0x7203000 => Table LUCHIP(0)-chan-rx-1 MQCHIP(0)-Avago 65NM-link-12 MQCHIP(0)-chan-rx-6 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-7 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-8 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-9 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-10 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-82 MQCHIP(0)-chan-rx-11 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-12 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-86 MQCHIP(0)-chan-rx-13 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-14 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-15 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-16 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-17 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-90 MQCHIP(0)-chan-rx-18 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-19 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-94 MQCHIP(0)-chan-rx-20 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-21 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-22 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-23 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-24 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-98 MQCHIP(0)-chan-rx-25 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-26 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-102 MQCHIP(0)-chan-rx-27 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-28 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-29 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-30 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-31 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-106 MQCHIP(0)-chan-rx-32 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-33 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-110 MQCHIP(0)-chan-tx-34 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-35 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-36 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-37 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-38 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-50 MQCHIP(0)-chan-tx-39 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-40 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-54 MQCHIP(0)-chan-tx-41 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-42 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-43 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-44 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-45 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-58 MQCHIP(0)-chan-tx-46 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-47 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-62 MQCHIP(0)-chan-tx-48 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-49 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-50 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-51 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-52 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-66 MQCHIP(0)-chan-tx-53 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-54 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-70 MQCHIP(0)-chan-tx-55 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-56 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-57 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-58 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-59 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-74 MQCHIP(0)-chan-tx-60 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-61 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-78 MQCHIP(0)-chan-rx-62 0x720c000 <= Channel remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-63 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-64 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-132 MQCHIP(0)-chan-rx-65 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-66 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-67 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-140 MQCHIP(0)-chan-tx-68 0x720e000 => Channel remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-69 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-70 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-116 MQCHIP(0)-chan-tx-71 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-72 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-73 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-124

That's lot of information, but we can ignore the 'Invalid' then we're left with 'Primitive' and 'Table'. The 'Table' ones clearly mark where they are connecting to in this case LU chip and QX chip, so we can work on assumption that 'Primitive' is used to connect to fabric. We can further determine there are 8 channels in both directions, two of the channels are always next to each other numerically with lot of empty channels between, this may prove useful. But let's take closer look at one channel.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 7 MQCHIP(0)-chan-rx-7 : Up Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-80 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x8acfff19 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

What is interesting here is Sub channel 1 of 2 with 4 links, sub-channel 1 of 2? What is 2nd? But 4 links is interesting, so channel has 4 links. Let's check the next one. Also interesting to notice we have cell counter and there is 64b66b encoding overhead.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 8 MQCHIP(0)-chan-rx-8 : Up Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-84 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

Ok, 7 and 8 are next each other, since it's logically one channel made out of two sub-channels. So the 8 channels we saw per direction, really are 4 logical channels with two sub-channels each. Looking at the other channels, everything looks the same, until we hit 28, 29 i.e. the 4th channel:

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 28 MQCHIP(0)-chan-rx-28 : Down Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-104 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x0 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 29 MQCHIP(0)-chan-rx-29 : Down Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-108 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

It's down, and no cells have been transferred. Isn't this convenient, we have 4 channels per direction, but one of them is down, so we're left with 3 channels per direction, this being MX960, it matches our SCB count. But why two sub-channels per SCB? Because fabric actually has two Avago chips, you need to connect to both, but present it as single channel to higher level use. Why do we have four channels, if we only use 3 of them? Well, if we look this same stat in MX240/MX480 we actually see the fourth channel up and pushing cells! Clearly MX960 has 8+8+8 channels per MQ for fabric and MX240/MX480 have 8+8+8+8 per MQ, 8 more! But MX240/MX480 have only two SCB, so clearly it must be arranged as (8+8)+(8+8)

Now, we know MX240/MX480 SCB can handle MPC2 (2*MQ, i.e. 2*40G) fabric connectivity redundantly with single SCB up, so we know (8+8) (single SCB up) must be enough to run 40G, i.e. 16 must be enough to run single 40G MQ fabric capacity. This gives us bottom range to SerDes capacity as 40G/16SerDes = 2.5Gbps. But wait, we had 64b/66b encoding and we know fabric uses cells which have header, which is overhead, so 2.5Gbps can't be possible it must be more. Bit of googling we'll find that Avago uses 3.125Gbps SerDes, this makes sense. So the raw capacity available to MQ is 16*3.125=50Gbps, redundantly! Double in MX240/MX480 with both SCB are up, so 100Gbps fabric capacity, that seems like LOT! But actually multicast replication in MPC2 is 'btree', so each linecard needs to send two copies of multicast over fabric to replicate it, so 100Gbps is back again halved to 50Gbps for multicast scenario, and cell overhead can be devastating (think of small packet using two cells, but just 1 byte of 2nd cell, efficiency is 50%!).

This also means to us, that MX960 with all SCB up, has just 75Gbps of capacity per MQ when all SCB are used, so plenty for unicast, but not enough for worst case multicast btree replication. Still seems like SCB and MPC2 are good match for most any realistic/practical scenario, no one is running 100% multicast, plenty of fabric capacity for real life, in MX960 and MX240/MX480, with all SCB up or one SCB down, still full wire-rate fabric capacity for unicast.

But if we actually try to send traffic from DPCE to MPC2 in lab we'll see MPC2 MQ=>DCPE being fine, single MQ can congest the 40G DPCE card. But DPCE=>MQ in MX960 experiences just 26Gbps out of 40Gbps needed, even when all SCB are up. Wait what? Thought this was murica! We had 75Gbps MX960 capacity, and we can do just 26Gbps? What is up with that! Furthermore, if we pull one SCB out, lab test is 100% clean, DPCE=>MQ experiences 40Gbps out of 40Gbps needed, full linerate. My head hurts!

Let's ponder a bit. We know DPCE does not connect to the redundant SCB's, so it's always just 2 SCB in MX960, even though MQ in MPC2 connects to 3 SCB. So if we divide 40Gbps MQ needs nominally by 3 SCB, it needs 13.33Gbps per SCB, which multiplied by 2 is, 26.66Gbps. Looking at fabric requests and grants, we see that MQ is not giving grants to DPCE (ichip), we have fabric capacity but MQ is not allowing DPCE to send. Clearly MQ does not want to deal with oversubscription, for one reason or another, and it only gives out fabric grants at 13.33Gbps per SCB, so that in no case would aggregate amount of fabric traffic be over 40Gbps, no oversubscription. But this strategy brokes down, when MQ has 3 connection, and DPCE only 2, and again issue is remedied when MQ downgrades to 2 connection, as now MQ is giving out fabric grants at 40/2 => 20Gbps per SCB, instead of just 13.3Gbps.

Why does not MQ want to deal with oversubscription? Maybe it won't have sufficient information at ingress to drop intelligently enough to keep QoS contracts? Maybe this behavior is done to avoid software complexity? Turns out, you don't even have to pull out the SCB to force MQ to connect just to two SCB, you can use set chassis fabric redundancy-mode redundantand MPC2<=>DPCE is wire rate, and 3rd SCB in MX960 is unused, as it's not needed for MPC2, remember we have 50Gbps per MQ without it, and 75Gbps with it. So only reason we even might need it, is if we do multicast.

Interestingly, if MPC2 would be 120Gbps instead of 80Gbps, so 60Gbps of fabric capacity per MQ instead of 40Gbps. This issue shouldn't exist, as you'd be giving out fabric grants at 60/3 => 20Gbps per SCB, so even in non-redundant/capacity mode DPCE would get 40Gbps it need. Other way to fix this, is of course upgrade SerDes from 3.125 to 6.25Gbps, turns out, Avago has such chips. This means MQ can get 6.25*(8+8) i.e. 100Gbps, with just two SCBE. So enough to do btree replication, and as it's just 2 SCB, you're still giving DPCE full 40Gbps of fabric grants. So clearly SCBE is better fix, but had MQ originally been 60Gbps instead of 40Gbps, I think that would have been much better as well.

Can this issue happen in future again? Can't see why not. Imagine SCBE system with far-end linecard being 2*80Gbps (enought to connect to just 2 SCBE) and near-end linecard being 2*100Gbps MQ, with similar behavior about curbing fabric grants to 100Gbps per MQ, so you're giving out 100/3 = 33.33Gbps and far-end would be getting 66.66Gbps of fabric grants from 80Gbps needed. But looking at DPCE and MPC I don't think the problem will come again, as looking at fabric streams in MQ and DPCE (ichip) it looks like traffic coming from MQ can be classified in far-end MQ and can be dropped while retaining QoS contract. Current DPCE=>MQ is probably unfortunate combination of mis-matching bandwidths and DPCE having single stupid fabric stream so far-end cant easily determine what to drop in case of congestion, so best policy was just to make sure congestion never happens in ingress.

We can check the fabric streams and see there is now priority in MQ:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream all Input Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: DWWW---------------------------- -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H Output Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: WW------------------------------ -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H NPC1(test2nxb10-re0.dk vty)#

Notice there are four fabric streams (2 fabric stream per NPU in system), 4,5 and 132,133. It's worth another post, but there is offset of 128 between low-prio and high-prio. We can shortly try to prove this:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 4 Input Stream 4 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 4 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 4 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 132 Input Stream 132 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 132 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 132 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 4 Q node 4: allocated : true parent node : 0 guarantee prio : 3 GL excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 5 mnc in gp enable : 1 ----------------------- guarantee credits : 0 excess credits added: 0 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 132 Q node 132: allocated : true parent node : 2 guarantee prio : 1 GH excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 95 mnc in gp enable : 1 ----------------------- guarantee credits : 65408 excess credits added: 1 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)#

So traffic which is in-contract going to fabric in stream 4 gets 'GL' guaranteed-low, and traffic which is in-contract going to fabric in stream 132 gets 'GH' guaranteed-high. So provided far-end MQ has this stream information, it would be very easy for it to know what to drop in oversubscriptiption, drop 4, not 132. IChip does not have this, so maybe it's reason why it was just guaranteed that there is no oversubscription from fabric?

It's important to realize fabric bandwidth is just one issue to think, MQ has internally less than needed 80Gbps (40G fabric, 40G wan) of memory bandwidth, actual number depends on cell alignment etc, but can be 80Gbps in good climate, and maybe 75Gbps in average climate or lot less with manufactured climate.

Some bullet points for TL;DR people

  1. MQ has 32 SerDes towards fabric
  2. Each SerDes is 3.125Gbps gross (add 64/66, cell overhead+alignment)
  3. MX960 allocates SerDes as 8+8+8 to 3 SCB
  4. MX240/MX480 allocates SerDes as 16+16 to 2 SCB
  5. SCBE upgrades the SerDes to 6.25Gbps
  6. SCBE2 upgrades the SerDes to 16.5Gbps (matches what Avago offers, but not very sure)

2014-03-15

Capture your fancy, part two, Trio

Like with 7600/PFC3, it is possible to capture transit traffic on Juniper Trio (MPC, MX80, MX104, FPC5 etc). First decide what you know about the packet and convert that data to hex, it can be pretty much anywhere in the packet in the first 320B or so.

[ytti@ytti.fi ~]% pry [1] pry(main)> '194.100.7.227'.split('.').map{|e|"%02x" % [e.to_i]}.join => "c26407e3" [2] pry(main)> '91.198.120.24'.split('.').map{|e|"%02x" % [e.to_i]}.join => "5bc67818"

I'm using boringly IPv4 addresses but I could have used anything. Unlike in PFC3 you do not need tell the location in the packet where the pattern must occur, you just tell pattern and any packet having that pattern anywhere is triggered, let's try it:

fisakytt@mec-pe1-re0.hel.fi> start shell pfe network tfeb0 TFEB platform (1000Mhz MPC 8544 processor, 1024MB memory, 512KB flash) TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem enable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem capture 0x3 5bc67818c26407e3 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem dump Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x08 0x80 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3f 0x01 0x1b 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x11 0x71 0x00 0x00 0x60 0x80 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3e 0x01 0x1c 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 <...> Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x09 0x00 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3f 0x01 0x1b 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x12 0x71 0x00 0x00 0x60 0x10 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3e 0x01 0x1c 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem disable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)#

So the format is test jnh (mq_where_interface_is) packet-via-dmem capture (16_bit_type_mask_in_hex) (up-to-8-bytes-pattern) (optional offset from start of the packet) you should use mask 0x3, in my test test bits 1 and 2 are production traffic, bits 10 and 15 are some crap, and others are just some types I don't seem to be using on my boxes

Unlike in PFC3, we capture many packets, and list is constantly updated until you stop the capture. This is very nice when you're not exactly sure what you're looking for and you know your trigger will also match packets you don't care for. We also easily see both packet received and packet send, so we can be sure the traffic is arriving to the box, from MAC addresses we can determine how and where and sent parcel gives us high degree of confidence the packet is leaving the box.

I don't know what the dispatch cookie means, nor what the first 6 bytes in the received parcel or the first 13 bytes in the sent parcel. I'm guessing that is some internal metadata, quickly trying to check for stream ID in MQ and IX or IFL and IFD numbers I can't find a match for them. But I'm mostly interested in just seeing that the packet came in from expected DMAC+SMAC and went out with expected DMAC+SMAC. Would be very useful to be able to at least extract somehow ingress and egress port information (MQ, IX, port, IFL, IFD, anything).

If you're capturing on box with multiple MQ and fabric then you won't see sent parcel having the rewrite information (but you'll see more metadata), you need to jump on the egress MQ to catch rewrite information. If you are having trouble reading the hexdump you can always use 'text2pcap' from wireshark to turn it into PCAP file and browse it in wireshark. Lot more playing around is needed to understand parcel types, cookie, metadata and use in multiple MQ scenario.

2014-03-01

Capture your fancy, part one, PFC3

It's often incredibly useful to be able to capture transit traffic, it's quick way to prove that you're actually receiving some frames and with any luck have good idea how and where you are sending them. It's unfortunately common, especially in 7600/6500 PFC3 to have bug where packets are not going where software FIB suggests they are. Luckily there is quite good tooling to inspect what really is happening. So we're taking a peek at 'ELAM'.

We have traffic coming in unlabeled to 7600 and going out labeled. Let's see how to capture it

psl2-pe2.hel.fi#show platform capture elam asic superman slot 5 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 help SEQ_NUM [5] QOS [3] QOS_TYPE [1] TYPE [4] STATUS_BPDU [1] IPO [1] NO_ESTBLS [1] RBH [3] CR [1] TRUSTED [1] NOTIFY_IL [1] NOTIFY_NL [1] DISABLE_NL [1] DISABLE_IL [1] DONT_FWD [1] INDEX_DIRECT [1] DONT_LEARN [1] COND_LEARN [1] BUNDLE_BYPASS [1] QOS_TIC [1] INBAND [1] IGNORE_QOSO [1] IGNORE_QOSI [1] IGNORE_ACLO [1] IGNORE_ACLI [1] PORT_QOS [1] CACHE_CNTRL [2] VLAN [12] SRC_FLOOD [1] SRC_INDEX [19] LEN [16] FORMAT [2] MPLS_EXP [3] REC [1] NO_STATS [1] VPN_INDEX [10] PACKET_TYPE [3] L3_PROTOCOL [4] L3_PT [8] MPLS_TTL [8] SRC_XTAG [4] DEST_XTAG [4] FF [1] MN [1] RF [1] SC [1] CARD_TYPE [4] DMAC [48] SMAC [48] IPVER [1] IP_DF [1] IP_MF [1] IP_HDR_LEN [4] IP_TOS [8] IP_LEN [16] IP_HDR_VALID [1] IP_CHKSUM_VALID [1] IP_L4HDR_VALID [1] IP_OFFSET [13] IP_TTL [8] IP_CHKSUM [16] IP_SA [32] IP_DA [32] IP_DATA [120] TCP_SPORT [16] TCP_DPORT [16] TCP_SEQ [32] TCP_ACK [32] TCP_CODE [6] TCP_CHKSUM [16] UDP_SPORT [16] UDP_DPORT [16] UDP_DATA [88] IGMP_VERS [4] IGMP_TYPE [4] IGMP_DATA [112] ICMP_TYPE [8] ICMP_CODE [8] ICMP_DATA [104] CRC [16] EXTRA_TYPE [4] THR_PKT_HI_NF [14] THR_BYTE_HI_NF [22] THR_PKT_LO_NF [14] THR_BYTE_LO_NF [22] PKT_CNT_NF [14] BYTE_CNT_NF [22] GEMINI_NF_SEG [3] FINRST [1] STATS [1] DUMMY [1] CRC_16 [16] well-know values for field TYPE ETHERNET 0x0 TOKENRING 0x1 FDDI 0x2 ATM 0x3 VOICE 0x4 VIDEO 0x5 INBAND 0x7 well-know values for field FORMAT IP 0x0 IPX 0x1 INBAND 0x2 OTHERS 0x3 well-know values for field PACKET_TYPE ETHERNET 0x0 RAW_IPX 0x1 802_C1 0x2 802_C2 0x3 802_SNAP 0x4 ENCAP_INVALID 0x7 well-know values for field L3_PROTOCOL IPV4 0x0 IPX 0x1 APPLETALK 0x2 DECNET 0x3 VINES 0x4 XNS 0x5 IPV6 0x6 INVALID 0xF well-know values for field L3_PT ICMP 0x1 IGMP 0x2 IP 0x4 TCP 0x6 UDP 0x11 well-know values for field CACHE_CNTRL NORMAL 0x0 RESERVED 0x1 GEMINI_MISS 0x2 DBUS_EXTRA_CICLE 0x3 well-know values for field IPVER IPV4 0x0 IPV6 0x1 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 elam capture completed psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x9 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 0 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 0 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3510 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0x7F LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x3 PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0x8 DEST_XTAG [4] = 0x0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 5c5e.abee.d320 IPVER [1] = 0 [IPV4] IP_DF [1] = 1 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x44BA IP_SA = 91.198.120.24 IP_DA = 194.100.7.227 ICMP_TYPE [8] = 0x8 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 61 FD 10 EA 15 39 1B AD 11 53 00 78 0C "a....9...S.x." CRC [16] = 0x8310 RBUS data: SEQ_NUM [5] = 0x9 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 1 [GENERIC] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xC60 VLAN [12] = 3168 RBH [3] = b001 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 DELTA_LEN [8] = 4 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq 'B0 A8 6E 7C 62 42 00 27 0D C9 88 00 88 47'. insert seq '49 58 01 FF' before ofs 14. FCS2 [8] = 0x0 psl2-pe2.hel.fi#

Hooray, great success. We can only capture 1 packet, first packet hitting the trigger criteria and capture is complete and nothing else is seen. We see 'DBUS' and 'RBUS' data. DBUS is where lookup engine is getting the lookup question from linecard, the question is 64B (IPv4) or 96B (MPLS, IPv6). DBUS is 62.5MHz, which 32B per cycle, so 62.5/2 = 31.25Mpps and 62.5/3 = 20.83Mpps, those are pretty much our absolute theoretical limits how much work we can give the lookup engine, at that point our DBUS is congested and we cannot send more queries to it. RBUS is where the lookup engine sends lookup result back to linecard, by looking at this data, linecard will know where to send the frame and how to rewrite the packet headers.

In DBUS we can see the incoming VLAN (typically internal VLAN), was it recirculated (CR), QOS, IP, PORT, ICMP, etc. But it's usually the RBUS which is more interesting. Is CCC 'L3_RW' if it were something like 'L2_POLICER' packet wasn't forwarded but was punted for usually 'mls rate-limit'. DEST_INDEX+VLAN where is the packet going, here DEST_INDEX is just VLAN, but it could be some special adjacency too, which we can see in 'remote command switch show platform hardware tycho register 0 1794 | XXXX'. Say if it was CoPP drop, RBUS would likely show 7FFF, we could check that there are several register numbers which have value 7FFF, and I don't know how to check which one was used, luckily the registers are named and the names are usually very indicative.

There is interesting trick we can do for packets hitting those special adjacencies. Say DBUS gave us '7F0A' and we guessed it's really '0x03C7: RED_PKT_ERR_IDX = 0x00007F0A', or RED_PKT_ERR_IDX. Now we can reprogram 0x3c7 to send out to some physical port, instead of 0x7fa0 which is going to end up dropped. You might want to do this, to figure out more closely what maybe was wrong in some dropped packet, or maybe just to check what your CoPP policies are dropping. To reprogram the value, we need to know how ports are numbered in HW. Each port has 'LTL Index' this is calculated as SlotN-1*64 + PortN-1, so giga5/2 would be 4*64 + 1, or 257. Then we could poke the register 'remote command switch show platform hardware tycho poke 3c7 257' and 3c7 register would be diverted to giga5/2 instead of original 0x7f0a special drop adjacency. You can repoke it to 0x7f0a when satisfied.

RBH is hash result for load-balancing. And then maybe most importantly we have REWRITE_INFO, this will tell us with high confidence that packet was sent out, and we can see if it just got new DMAC+SMAC, or did it it also get VLAN or MPLS or both. Here we can clearly see 0x8847, so it's going out with label 0x495801FF, so label value 0x49580, cos+bos 0x1 (so cos 0, bos on), ttl 0xff. We can confirm that the REWRITE_INFO matches what we expect:

psl2-pe2.hel.fi#sh mls cef lookup 194.100.7.227 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(121242 ): E | 1 FFF 0 0 0 0 255.255.255.240 V(121242 ): 8 | 1 0 0 0 0 0 194.100.7.224 (A:165858 ,P:1,D:0,m:0 ,B:0 ) M(121242 ): E | 1 FFF 0 0 0 255.255.255.240 V(121242 ): 9 | 1 0 0 0 0 194.100.7.224 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 165858 detail Index: 165858 smac: 0027.0dc9.8800, dmac: b0a8.6e7c.6242 mtu: 9196, vlan: 3168, dindex: 0x23BC4, l3rw_vld: 1 format: MPLS, flags: 0x1000208518 label0: 0, exp: 0, ovr: 0 label1: 0, exp: 0, ovr: 0 label2: 300416, exp: 0, ovr: 0 op: PUSH_LABEL2 packets: 1, bytes: 80

Same MAC addresses, egress VLAN, label value, so everything is fine here. Now what if we want to capture the packet coming in from the MPLS? That is quite bit harder, as we cannot use any syntactic sugar trigger values. But ELAM will allow you to match on any bit and allows you to mask any bit. So it's just matter of patience or little script to have proper capture filter for MPLS packets too. We should try to figure out as much as we can about how the packet is coming in, especially how many labels it has (as it affects our offset) the matching begins from DMAC and continues as on the wire. We could match SMAC+DMAC+ETYPE+dont_care+SADDR+DADDR etc. Right now, I'm lazy and I'll just match mpls etype and DADDR, we were seeing SADDR 91.198.120.24 going to MPLS network, let's see if we can find DADDR to 91.198.120.24 (5b c6 78 18) coming from MPLS network. Here it is beneficial to have cursory understanding of ethernet + mpls + ip header, especially you want to know your offset to the bits you're interested in. The ELAM capture is specified in groups of 4 bytes and can be compressed to 0 if you don't care on any of them.

psl2-pe2.hel.fi#show platform capture elam trigger dbus others if DATA=0 0 0 0x88470000 0 0 0 0 0x00005bc6 0x78180000 [ 0 0 0 0xffff000 0 0 0 0 0x0000ffff 0xffff0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=OTHERS DATA=0 0 0 0X88470000 0 0 0 0 0X00005BC6 0X78180000 [ 0 0 0 0XFFFF0000 0 0 0 0 0X0000FFFF 0XFFFF0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x12 QOS [3] = 0 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3168 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 106 FORMAT [2] = 3 [OTHERS] PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 15 [INVALID] L3_PT [8] = 0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 ISL [16] = 0x0 DATA [592] 0000: 00 27 0D C9 88 00 B0 A8 6E 7C 62 42 88 47 00 00 ".'......n|bB.G.." 0010: 03 FE 45 00 00 54 A6 E8 00 00 3F 01 36 9B C2 64 "..E..T....?.6..d" 0020: 07 E3 5B C6 78 18 00 00 A3 FD 12 28 0A E5 2F B9 "..[.x......(../." 0030: 11 53 00 00 00 00 CF BB 01 00 00 00 00 00 42 42 ".S............BB" 0040: 42 42 42 42 42 42 42 42 42 42 "BBBBBBBBBB" CRC [16] = 0x8184 RBUS data: SEQ_NUM [5] = 0x12 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 1 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 1 [B64] FLOOD [1] = 0 DEST_INDEX [19] = 0x7FFA VLAN [12] = 4012 RBH [3] = b011 RDT [1] = 1 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 19 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 26 IP_TTL [8] = 63 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 28 IP_CSUM [16] = 0x369B DELTA_LEN [8] = 252 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq '00 27 0D C9 88 00 00 00 00 00 00 00 08 00'. remove bytes from ofs 14 to ofs 17. FCS2 [8] = 0x0

Faaantastic. We have expected DADDR+SADDR, it's ICMP and I even set the payload now to 0x42 just be clear. But hmm, DEST_INDEX 0x7FFA that does not look good, that's something special, cursory look on the REWRITE_INFO makes me dubious. Let's dig around bit.

psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 986, bytes: 100544 psl2-pe2.hel.fi#sh mls cef lookup 91.198.120.24 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(159300 ): E | 1 FFF 0 0 0 0 255.255.255.0 V(159300 ): 8 | 1 0 0 0 0 0 91.198.120.0 (A:67379 ,P:1,D:0,m:0 ,B:0 ) M(159300 ): E | 1 FFF 0 0 0 255.255.255.0 V(159300 ): 9 | 1 0 0 0 0 91.198.120.0 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 203, bytes: 20666

Yeah, definitely we're not rewriting those MAC addresses and definitely the VLAN is not correct in the RBUS. But what is that VLAN and the DEST_INDEX.

psl2-pe2.hel.fi#remote command switch show platform hardware tycho register 0 1794 | i 7FFA 0x03C1: RED_CEN_RW_IDX = 0x00007FFA [32762 ] 0x03C3: RED_IN_RW_FF_IDX = 0x00007FFA [32762 ] psl2-pe2.hel.fi#show vlan internal usage | i 4012 4012 vrf_0_vlan

Aaah, it's a recirculate. We're putting it to VRF 0 (INET) and sending 'er right back. Notice how we're chanaging the MAC addresses and we're dropping the mpls label out on rewrite. So as MPLS label is gone, we again can capture the packet with fancy triggers, just to be sure we're capturing the recirculated one, we'll also set CR=1 (if you need another recirculate, too bad, may not have any way to see the 3rd one)

psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x7 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 1 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 4012 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x7F PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0xF DEST_XTAG [4] = 0xA FF [1] = 0 MN [1] = 0 RF [1] = 1 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 0000.0000.0000 IPVER [1] = 0 [IPV4] IP_DF [1] = 0 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x8774 IP_SA = 194.100.7.227 IP_DA = 91.198.120.24 ICMP_TYPE [8] = 0x0 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 89 06 13 36 17 1E 5A C3 11 53 00 B0 61 "...6..Z..S..a" CRC [16] = 0x136E RBUS data: SEQ_NUM [5] = 0x7 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xDB6 VLAN [12] = 3510 RBH [3] = b010 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 15 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 22 IP_TTL [8] = 62 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 24 IP_CSUM [16] = 0x8874 DELTA_LEN [8] = 0 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 11 with seq '5C 5E AB EE D3 20 00 27 0D C9 88 00'. FCS2 [8] = 0x0

Greeaaat! We have expected rewrite info and VLAN, so we have high confidence for both directions of packets, and we've learned the packets coming from MPLS side are getting recirculated (it's because of EXP0, but that's another story, there are many reasons why it might happen).