2016-04-22

Junos and DHCP relay

There are two different ways to configure DHCP in Junos, bootp helper and dhcp relay. These work in very different manner, bootp helper is being phased out and is not supported for example in QFX10k. Behaviour of bootp helper is obvious, it works like it works in every other sensible platform. Behaviour of dhcp-relay is very confusing and it's not documented at all anywhere.

If it's possible in your platform to configure bootp helper, do it. If not, complain to Junos about dhcp-relay implementation and ask them to fix it. The main problem with dhcp-relay implementation is that once you've configured it, you're punting all dhcp traffic in all interfaces. Normal transit traffic crossing your router is subject to this punt, so transit customers will experience larger jitter and delay of packets being punted and almost certainly reordering, because the non-dhcp packet that came after but was not subject to punt will be forwarded first. Technically reordering does not matter, as long as it does not happen inside a flow, but it's not desirable.

How the sequence of operation works in Junos for dhcp-relay:

  1. Transit packet touches ingress NPU
  2. After L2 lookup, before L3 lookup ingress NPU punts the transit packet to PFE CPU
  3. PFE CPU, for reasons obvious to people on drugs encapsulates the transit DHCP packet with new IP, UDP and 28magic bytes header. [ip_new, udp_new, 28_bytes, ip_old, udp_old, payload]
  4. PFE CPU sends the encapsulated DHCP packet to RE, so that JDHCPD can inspect it
  5. JDHCPD determines if it's relevant to it or not, in our case it's not, in normal configuration it proceeds to drop the transit packet as it was not interesting to us!
I don't disagree that on some instances it may be desirable to snoop transit DHCP messages, to see what server unicasts to our client, but that is small percentage of the traffic. We know the DHCP servers we have, why can't we decide what is punted and what is not?

If operator is doing her due diligence and has exact lo0 filter which only allows things to enter control-plane which are actually needed there, all of this is broken, but in a very confusing way. Firstly these transit DHCP packets are NOT subject to hardware level lo0 filter, you cannot drop them there, which to me makes sense, they are transit, lo0 filter shouldn't affect them. However after punt and encapsulation they magically become subject to lo0 filter on the software side! In your typical lo0 filter with ultimate 'discard all' term, you'll see these x.y.z.k => 255.255.255.255 packets being discarded and you might be bit confused how on earth we're getting DADDR 255.255.255.255 from our neighbouring core routers! So perhaps you'll run 'monitor traffic interface X matching "host 255.255.55.255 detail"' to understand better what is going on, well, you won't see any of these 255.255.255.255 packets you were dropping, because Junos has added support in their own tcpdump for this DHCP encapsulation, so you're actually seeing the original embedded headers, not the new top headers (where the host 255.255.255.255 is matching). If you add 'write-file dhcp.pcap' and open that file in wireshark, you'll see the injected new headers and packet being interpreted as DHCP, including the original header portion, which makes out for a VERY confusing looking DHCP packet. If you manually pop from the packet ip_new, udp_new and 28 bytes, you'll see the expected transit packet.

Strangeness does not end here, you can easily discard/log/count these 'destination-address 255.255.255.255' in lo0 filter (in software, not in hardware), but when you change that 'discard' to 'accept' you won't see anything in counter or logs anymore! Yet it is crucial that you do accept them, because otherwise they are dropped before jdhcpd has chance to process them, and you're killing all your transit DHCP. Even after you add this confusing rule to permit transit traffic to enter jdhcpd you're still going to be dropping all transit dhcp until you configure 'set forwarding-options dhcp-relay forward-snooped-clients all-interfaces'.

Problem here of course is, now we're not discriminating in lo0 filter the transit packets hitting SW processed lo0, and all real DHCP discover packets coming from local interfaces. I usually specifically only allow DHCP discover from interfaces where I've enabled DHCP. But with this dhcp-relay configuration, I have to allow it everywhere! I'm no longer protected from customers having L2 loops and injecting wire-rate of DHCP Discover to my control-plane, I now have to accept those, because I cannot discriminate in lo0 filter if they are transit or discovers.

What should JNPR do? Continue to support bootp helper style operation, where no transit traffic is ever punted. Make dhcp-relay work like that out-of-the-box, and people who need to snoop transit, must enable it and give them tools to enable it based on various keys, saddr/daddr, interface, npu. I'm pretty sure there are now bunch of JNPR boxes which silently drop transit DHCP, because there is no documentation anywhere on how this works.

I'm not as convinced as JTAC that this isn't simply a bug, it feels odd that all this really would be the intended behaviour. The telling problem here is, that JNPR is somehow avoiding lo0 evaluation in HW, I suspect it is, because the packet is not classified as IPv4 protocol but DHCP-snooping protocol (yes, Junos has ipv4, ipv6, mpls, bridge, fibrechannel and dhcp-snooping protocol route tables!) and as it's not IPv4 it's not subject to HW lo0 filter. However they seem to drop the ball after punt, making the embedded packet subject to SW lo0 filter, I think it really should not behave like this.

I wish I could say this is only situation when transit traffic can hit lo0 filter, but that's not true. Some JNPR platforms punt transit IP options and transit IPv6 HBH through lo0 filter. So in those cases you need to match on all local addresses and ip-options and drop, then second rule to allow all ip-options, unless you want to drop also all transit ip-options (which is probably just fine). Pretty much no one knows this, so people likely don't know what is their network's policy regarding ip-options and actual policy is just determined by your network upgrade cycle.

2014-03-17

Quick look at Trio ddos-protection with flow-detection

Some things are easy to protect with iACL and lo0 ACL but others are really hard, like BGP, you need to allow BGP from customers and from core, and it's not convenient or practical to handle them separately in lo0 ACL + policer. Luckily JunOS has feature called flow-detection, you turn it on with set system ddos-protection global flow-detection

I'm sending DoS from single source to lo0, my iBGP goes immediately down. After I turn on flow-detection iBGP connectivity is restored. Looking at PFE, we can see what is happening:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 721 1400 sub 338 21 79161 c158ef22 c158ef1f 53571 179 0 1 2679 1400 sub 356 11159404 2187242988 64640102 c158ef1f 179 179 0 2 2015 1400 sub 338 29 112468 c158ef23 c158ef1f 179 65020

Pretty nice and clear, 64.64.01.02 => c1.58.ef.1f is our attack traffic and it's getting its own policer, iBGP is stable, attack traffic is policed separately. Let's check those policers more closely:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 2679 PFE: 0 Flow Record Index: 2679 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 64640102 (100.100.1.2) Dst IP addr: c158ef1f (193.88.239.31) Src port: 179 Dst port: 179 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 0 Proto idx: 79 Policer idx: 3885 Time inserted: 3484405383 Time last violated: 3484485825 Last received: 42480417 Flow Statitics: Packet Count: 42655279 Byte Count: 8360434488 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp stats DDOS Policer Statistics: arrival pass # of idx prot group proto on loc pass drop rate rate flows --- --- ----------- ----------- -- ------ -------- -------- ------ ------ ----- 79 1400 bgp aggregate Y UKERN 8378805 0 9 9 3 PFE-0 8393411 2310827652 528095 9 3 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp configuration DDOS Policer Configuration: UKERN-Config PFE-Config idx prot group proto on Pri rate burst rate burst --- --- ------------ ------------ -- -- ------ ----- ------ ----- 79 1400 bgp aggregate Y Lo 8000 20000 8000 20000 MX104-ABB-0(test13nqa1-re0.dk vty)#

Looks good, what if we increase the sources from /32 to /16, let's see:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3083 1400 sub 356 80 15484 6464e743 c158ef1f 179 179 0 1 805 1400 sub 356 217 42336 64642df4 c158ef1f 179 179 0 2 672 1400 sub 356 291 56840 6464eb64 c158ef1f 179 179 0 3 3382 1400 sub 356 256 49980 6464ba66 c158ef1f 179 179 0 4 3956 1400 sub 356 82 15876 64643b3d c158ef1f 179 179 0 5 1999 1400 sub 356 92 17836 646403a6 c158ef1f 179 179 0 6 2808 1400 sub 356 22 4116 64642f33 c158ef1f 179 179 <..> 0 3455 866 1400 sub 356 80 15484 646475a1 c158ef1f 179 179 0 3456 4096 1400 sub 356 319 62328 646470a8 c158ef1f 179 179 MX104-ABB-0(test13nqa1-re0.dk vty)#

No mas! I expected to see 65537 or so, but instead I'm seeing just 3456. My iBGP also went down! Let's poke around bit:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b39 Culprit flows: 3902 Culprit flows on scan: 3902 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588425 Free SCFD states: 4096 Free flow entries: 194 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 321 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=3902 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 191 current culprit flows: 3905 discovered suspicious flows: 123802 discovered culprit flows: 23356 deleted culprit flows: 19451 false positives: 100255 hash insertion errors: 148669170 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23353 flow timeout: 0 flow return to normal: 115 flow cleared: 19336 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Immediately what caught my attention was Free flow entries: 194, oops! We're running out of flow entries. But as we wait for few more seconds, my iBGP comes back up, why?

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3183 1400 ifl 338 96 220159 0 0 0 0 0 1 1439 1400 ifl 356 70851932 13886978476 0 0 0 0 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 3183 PFE: 0 Flow Record Index: 3183 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 338 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 1 Rcvd last flow op: 3 Flow state: 3 Aggr level: 1 Proto idx: 79 Policer idx: 2105 Time inserted: 3485722214 Time last violated: 3485723825 Last received: 105 Flow Statitics: Packet Count: 105 Byte Count: 221115 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 1439 PFE: 0 Flow Record Index: 1439 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 1 Proto idx: 79 Policer idx: 2155 Time inserted: 3485637825 Time last violated: 3485792825 Last received: 81852797 Flow Statitics: Packet Count: 82308537 Byte Count: 16132473056 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b40 Culprit flows: 2 Culprit flows on scan: 2 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588463 Free SCFD states: 4096 Free flow entries: 4094 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 4221 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=2 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 0 current culprit flows: 2 discovered suspicious flows: 123806 discovered culprit flows: 23360 deleted culprit flows: 23358 false positives: 100446 hash insertion errors: 149071671 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23360 flow timeout: 0 flow return to normal: 117 flow cleared: 23241 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Wow! Only two flows left, and they've changed from sub to ifl, keys are all zero, no src, dst IP match, no src, dst port match! But my iBGP seems to work, as my attack traffic (simulating BGP customer) is coming from separate IFL than my iBGP. I guess it's lot better than nothing, lot better than overflowing the 4k policers.

Yet, waiting bit longer I see that my INET/INET6 iBGP is happy, but iBGP VPNv4/VPNv6 is unhappy, both are on same IFL, but different than attack traffic. Turns out default level for sub and ifl is 10pps and ifd is 20kpps. Our iBGP was 5 session and was periodically exceeding 10pps IFL limit, and by default, once flow-limit is hit (flow-control-mode), you don't drop just the exceeding traffic (flow-control-mode police), but you drop everything in offending flow (flow-control-mode drop), which I think is great default. But is sub==ifl==10pps great default? I don't think so. Anyway my iBGP IFL was hitting 10pps, and dropping everything for duration of violation, not sure why it affected 3xVPN4/VPN6 session, but not INET/INET6, same hold-time and keepalive, but doesn't matter misconfig on my part. Once I added set system ddos-protection protocols bgp aggregate flow-level-bandwidth logical-interface 1000 VPN4/VPN6 stabilized.

I noticed that for some reason I couldn't anymore make the platform automatically switch during DoS from sub detection to ifl detection. This mean, my configuration was again broken, as the subscriber flows congested my HW's 4k slots. Not sure why this is and how it should work. I ended up disabling 'sub' level detection αΊƒith set system ddos-protection protocols bgp aggregate flow-level-detection subscriber off, I guess it would be nice to keep it on, if we could reliably and rapidly switch from sub=>ifl during policers being congested, but right now it seems that if there is danger of seeing 4k of sub-level flows, it might be safest to just ignore them and protect only at ifl level. My full config is now:

{master}[edit system ddos-protection protocols bgp] fisakytt@test13nqa1-re0.dk# show aggregate { bandwidth 8000; flow-level-bandwidth { logical-interface 100; } flow-level-detection { subscriber off; } }

It seems to consistently protect my BGP from attacks which I'm interested in, quite bit of config, if you're doing that for 200 DDoS policer protocols (after researching each relevant about what you even should do). It might be nice if Juniper would have concept of 'protected flows' which always have priority, which would be those flows, which have manual hint in configuration, like in Cisco LPTS when you configure BGP, you get BGP policer for that configured session, JNPR could easily do same and configure those persistently and in case of congestion, don't break those, only break policers among dynamically detected flows. I'd also really want to figure out how sub=>ifl fallback should work, why does it work sometimes and sometimes not.

2014-03-16

Tourist trip to MX fabric

Tourist, because it's mostly original research so quality may be dubious.

You can infer lot about the fabric by looking at 'show hsl2 ...' commands. Let's start.

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) serdes MQCHIP(0) serdes table : MQCHIP(0)-Avago 65NM-0 [0xf300000]: 24 links 0 - 23 MQCHIP(0)-Avago 65NM-1 [0xf304000]: 24 links 24 - 47 MQCHIP(0)-Avago 65NM-2 [0xf308000]: 8 links 48 - 55 MQCHIP(0)-Avago 65NM-3 [0xf309000]: 8 links 56 - 63 MQCHIP(0)-Avago 65NM-4 [0xf30a000]: 8 links 64 - 71 MQCHIP(0)-Avago 65NM-5 [0xf30b000]: 8 links 72 - 79 MQCHIP(0)-Avago 65NM-6 [0xf30c000]: 8 links 80 - 87 MQCHIP(0)-Avago 65NM-7 [0xf30d000]: 8 links 88 - 95 MQCHIP(0)-Avago 65NM-8 [0xf30e000]: 8 links 96 - 103 MQCHIP(0)-Avago 65NM-9 [0xf30f000]: 8 links 104 - 111 MQCHIP(0)-Avago 65NM-10 [0xf310000]: 8 links 112 - 119 MQCHIP(0)-Avago 65NM-11 [0xf311000]: 8 links 120 - 127 MQCHIP(0)-Avago 65NM-12 [0xf312000]: 8 links 128 - 135 MQCHIP(0)-Avago 65NM-13 [0xf313000]: 8 links 136 - 143 MQCHIP(0)-Avago 65NM-14 [0xf318000]: 2 links 144 - 145 MQCHIP(0)-Avago 65NM-15 [0xf31a000]: 2 links 146 - 147

Avago is well known manufacturer of SerDes (SERialization / DESerialization), 65NM probably means Avago's 65nm lithography line of products. SerDes presentation here is unidirectional. But that is still quite large number of SerDes links. They are mapped/bundled to channels, looking at those channels we can get bit more clue about them:

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) MQCHIP(0)(16) 74 channels 16 serdes 148 links chip_flag 0x9 Flag: log_poll_error log_ioerr rd_fn 0x40760d64 wr_fn 0x40760d10 cookie 0x4282bf28 disaster_fn 0x0 MQCHIP(0) channel table : MQCHIP(0)-chan-rx-0 0x7200000 <= Table QXCHIP(0)-chan-tx-1 MQCHIP(0)-Avago 65NM-link-146 MQCHIP(0)-chan-tx-1 0x7201000 => Table QXCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-144 MQCHIP(0)-chan-rx-2 0x7202000 <= Table LUCHIP(0)-chan-tx-2 MQCHIP(0)-Avago 65NM-link-24 MQCHIP(0)-chan-rx-3 0x7202000 <= Table LUCHIP(0)-chan-tx-3 MQCHIP(0)-Avago 65NM-link-36 MQCHIP(0)-chan-tx-4 0x7203000 => Table LUCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-0 MQCHIP(0)-chan-tx-5 0x7203000 => Table LUCHIP(0)-chan-rx-1 MQCHIP(0)-Avago 65NM-link-12 MQCHIP(0)-chan-rx-6 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-7 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-8 0x7204000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-9 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-80 MQCHIP(0)-chan-rx-10 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-82 MQCHIP(0)-chan-rx-11 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-84 MQCHIP(0)-chan-rx-12 0x7204000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-86 MQCHIP(0)-chan-rx-13 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-14 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-15 0x7205000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-16 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-88 MQCHIP(0)-chan-rx-17 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-90 MQCHIP(0)-chan-rx-18 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-92 MQCHIP(0)-chan-rx-19 0x7205000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-94 MQCHIP(0)-chan-rx-20 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-21 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-22 0x7206000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-23 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-96 MQCHIP(0)-chan-rx-24 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-98 MQCHIP(0)-chan-rx-25 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-100 MQCHIP(0)-chan-rx-26 0x7206000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-102 MQCHIP(0)-chan-rx-27 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-28 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-29 0x7207000 <= Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-30 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-104 MQCHIP(0)-chan-rx-31 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-106 MQCHIP(0)-chan-rx-32 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-108 MQCHIP(0)-chan-rx-33 0x7207000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-110 MQCHIP(0)-chan-tx-34 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-35 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-36 0x7208000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-37 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-48 MQCHIP(0)-chan-tx-38 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-50 MQCHIP(0)-chan-tx-39 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-52 MQCHIP(0)-chan-tx-40 0x7208000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-54 MQCHIP(0)-chan-tx-41 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-42 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-43 0x7209000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-44 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-56 MQCHIP(0)-chan-tx-45 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-58 MQCHIP(0)-chan-tx-46 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-60 MQCHIP(0)-chan-tx-47 0x7209000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-62 MQCHIP(0)-chan-tx-48 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-49 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-50 0x720a000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-51 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-64 MQCHIP(0)-chan-tx-52 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-66 MQCHIP(0)-chan-tx-53 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-68 MQCHIP(0)-chan-tx-54 0x720a000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-70 MQCHIP(0)-chan-tx-55 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-56 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-57 0x720b000 => Primitive remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-58 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-72 MQCHIP(0)-chan-tx-59 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-74 MQCHIP(0)-chan-tx-60 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-76 MQCHIP(0)-chan-tx-61 0x720b000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-78 MQCHIP(0)-chan-rx-62 0x720c000 <= Channel remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-63 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-128 MQCHIP(0)-chan-rx-64 0x720c000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-132 MQCHIP(0)-chan-rx-65 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-66 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-136 MQCHIP(0)-chan-rx-67 0x720d000 <= Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-140 MQCHIP(0)-chan-tx-68 0x720e000 => Channel remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-69 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-112 MQCHIP(0)-chan-tx-70 0x720e000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-116 MQCHIP(0)-chan-tx-71 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-72 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-120 MQCHIP(0)-chan-tx-73 0x720f000 => Invalid remote/unknown MQCHIP(0)-Avago 65NM-link-124

That's lot of information, but we can ignore the 'Invalid' then we're left with 'Primitive' and 'Table'. The 'Table' ones clearly mark where they are connecting to in this case LU chip and QX chip, so we can work on assumption that 'Primitive' is used to connect to fabric. We can further determine there are 8 channels in both directions, two of the channels are always next to each other numerically with lot of empty channels between, this may prove useful. But let's take closer look at one channel.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 7 MQCHIP(0)-chan-rx-7 : Up Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-80 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x8acfff19 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

What is interesting here is Sub channel 1 of 2 with 4 links, sub-channel 1 of 2? What is 2nd? But 4 links is interesting, so channel has 4 links. Let's check the next one. Also interesting to notice we have cell counter and there is 64b66b encoding overhead.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 8 MQCHIP(0)-chan-rx-8 : Up Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-84 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

Ok, 7 and 8 are next each other, since it's logically one channel made out of two sub-channels. So the 8 channels we saw per direction, really are 4 logical channels with two sub-channels each. Looking at the other channels, everything looks the same, until we hit 28, 29 i.e. the 4th channel:

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 28 MQCHIP(0)-chan-rx-28 : Down Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-104 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x0 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 29 MQCHIP(0)-chan-rx-29 : Down Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-108 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

It's down, and no cells have been transferred. Isn't this convenient, we have 4 channels per direction, but one of them is down, so we're left with 3 channels per direction, this being MX960, it matches our SCB count. But why two sub-channels per SCB? Because fabric actually has two Avago chips, you need to connect to both, but present it as single channel to higher level use. Why do we have four channels, if we only use 3 of them? Well, if we look this same stat in MX240/MX480 we actually see the fourth channel up and pushing cells! Clearly MX960 has 8+8+8 channels per MQ for fabric and MX240/MX480 have 8+8+8+8 per MQ, 8 more! But MX240/MX480 have only two SCB, so clearly it must be arranged as (8+8)+(8+8)

Now, we know MX240/MX480 SCB can handle MPC2 (2*MQ, i.e. 2*40G) fabric connectivity redundantly with single SCB up, so we know (8+8) (single SCB up) must be enough to run 40G, i.e. 16 must be enough to run single 40G MQ fabric capacity. This gives us bottom range to SerDes capacity as 40G/16SerDes = 2.5Gbps. But wait, we had 64b/66b encoding and we know fabric uses cells which have header, which is overhead, so 2.5Gbps can't be possible it must be more. Bit of googling we'll find that Avago uses 3.125Gbps SerDes, this makes sense. So the raw capacity available to MQ is 16*3.125=50Gbps, redundantly! Double in MX240/MX480 with both SCB are up, so 100Gbps fabric capacity, that seems like LOT! But actually multicast replication in MPC2 is 'btree', so each linecard needs to send two copies of multicast over fabric to replicate it, so 100Gbps is back again halved to 50Gbps for multicast scenario, and cell overhead can be devastating (think of small packet using two cells, but just 1 byte of 2nd cell, efficiency is 50%!).

This also means to us, that MX960 with all SCB up, has just 75Gbps of capacity per MQ when all SCB are used, so plenty for unicast, but not enough for worst case multicast btree replication. Still seems like SCB and MPC2 are good match for most any realistic/practical scenario, no one is running 100% multicast, plenty of fabric capacity for real life, in MX960 and MX240/MX480, with all SCB up or one SCB down, still full wire-rate fabric capacity for unicast.

But if we actually try to send traffic from DPCE to MPC2 in lab we'll see MPC2 MQ=>DCPE being fine, single MQ can congest the 40G DPCE card. But DPCE=>MQ in MX960 experiences just 26Gbps out of 40Gbps needed, even when all SCB are up. Wait what? Thought this was murica! We had 75Gbps MX960 capacity, and we can do just 26Gbps? What is up with that! Furthermore, if we pull one SCB out, lab test is 100% clean, DPCE=>MQ experiences 40Gbps out of 40Gbps needed, full linerate. My head hurts!

Let's ponder a bit. We know DPCE does not connect to the redundant SCB's, so it's always just 2 SCB in MX960, even though MQ in MPC2 connects to 3 SCB. So if we divide 40Gbps MQ needs nominally by 3 SCB, it needs 13.33Gbps per SCB, which multiplied by 2 is, 26.66Gbps. Looking at fabric requests and grants, we see that MQ is not giving grants to DPCE (ichip), we have fabric capacity but MQ is not allowing DPCE to send. Clearly MQ does not want to deal with oversubscription, for one reason or another, and it only gives out fabric grants at 13.33Gbps per SCB, so that in no case would aggregate amount of fabric traffic be over 40Gbps, no oversubscription. But this strategy brokes down, when MQ has 3 connection, and DPCE only 2, and again issue is remedied when MQ downgrades to 2 connection, as now MQ is giving out fabric grants at 40/2 => 20Gbps per SCB, instead of just 13.3Gbps.

Why does not MQ want to deal with oversubscription? Maybe it won't have sufficient information at ingress to drop intelligently enough to keep QoS contracts? Maybe this behavior is done to avoid software complexity? Turns out, you don't even have to pull out the SCB to force MQ to connect just to two SCB, you can use set chassis fabric redundancy-mode redundantand MPC2<=>DPCE is wire rate, and 3rd SCB in MX960 is unused, as it's not needed for MPC2, remember we have 50Gbps per MQ without it, and 75Gbps with it. So only reason we even might need it, is if we do multicast.

Interestingly, if MPC2 would be 120Gbps instead of 80Gbps, so 60Gbps of fabric capacity per MQ instead of 40Gbps. This issue shouldn't exist, as you'd be giving out fabric grants at 60/3 => 20Gbps per SCB, so even in non-redundant/capacity mode DPCE would get 40Gbps it need. Other way to fix this, is of course upgrade SerDes from 3.125 to 6.25Gbps, turns out, Avago has such chips. This means MQ can get 6.25*(8+8) i.e. 100Gbps, with just two SCBE. So enough to do btree replication, and as it's just 2 SCB, you're still giving DPCE full 40Gbps of fabric grants. So clearly SCBE is better fix, but had MQ originally been 60Gbps instead of 40Gbps, I think that would have been much better as well.

Can this issue happen in future again? Can't see why not. Imagine SCBE system with far-end linecard being 2*80Gbps (enought to connect to just 2 SCBE) and near-end linecard being 2*100Gbps MQ, with similar behavior about curbing fabric grants to 100Gbps per MQ, so you're giving out 100/3 = 33.33Gbps and far-end would be getting 66.66Gbps of fabric grants from 80Gbps needed. But looking at DPCE and MPC I don't think the problem will come again, as looking at fabric streams in MQ and DPCE (ichip) it looks like traffic coming from MQ can be classified in far-end MQ and can be dropped while retaining QoS contract. Current DPCE=>MQ is probably unfortunate combination of mis-matching bandwidths and DPCE having single stupid fabric stream so far-end cant easily determine what to drop in case of congestion, so best policy was just to make sure congestion never happens in ingress.

We can check the fabric streams and see there is now priority in MQ:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream all Input Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: DWWW---------------------------- -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H Output Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: WW------------------------------ -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H NPC1(test2nxb10-re0.dk vty)#

Notice there are four fabric streams (2 fabric stream per NPU in system), 4,5 and 132,133. It's worth another post, but there is offset of 128 between low-prio and high-prio. We can shortly try to prove this:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 4 Input Stream 4 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 4 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 4 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 132 Input Stream 132 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 132 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 132 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 4 Q node 4: allocated : true parent node : 0 guarantee prio : 3 GL excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 5 mnc in gp enable : 1 ----------------------- guarantee credits : 0 excess credits added: 0 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 132 Q node 132: allocated : true parent node : 2 guarantee prio : 1 GH excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 95 mnc in gp enable : 1 ----------------------- guarantee credits : 65408 excess credits added: 1 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)#

So traffic which is in-contract going to fabric in stream 4 gets 'GL' guaranteed-low, and traffic which is in-contract going to fabric in stream 132 gets 'GH' guaranteed-high. So provided far-end MQ has this stream information, it would be very easy for it to know what to drop in oversubscriptiption, drop 4, not 132. IChip does not have this, so maybe it's reason why it was just guaranteed that there is no oversubscription from fabric?

It's important to realize fabric bandwidth is just one issue to think, MQ has internally less than needed 80Gbps (40G fabric, 40G wan) of memory bandwidth, actual number depends on cell alignment etc, but can be 80Gbps in good climate, and maybe 75Gbps in average climate or lot less with manufactured climate.

Some bullet points for TL;DR people

  1. MQ has 32 SerDes towards fabric
  2. Each SerDes is 3.125Gbps gross (add 64/66, cell overhead+alignment)
  3. MX960 allocates SerDes as 8+8+8 to 3 SCB
  4. MX240/MX480 allocates SerDes as 16+16 to 2 SCB
  5. SCBE upgrades the SerDes to 6.25Gbps
  6. SCBE2 upgrades the SerDes to 16.5Gbps (matches what Avago offers, but not very sure)

2014-03-15

Capture your fancy, part two, Trio

Like with 7600/PFC3, it is possible to capture transit traffic on Juniper Trio (MPC, MX80, MX104, FPC5 etc). First decide what you know about the packet and convert that data to hex, it can be pretty much anywhere in the packet in the first 320B or so.

[ytti@ytti.fi ~]% pry [1] pry(main)> '194.100.7.227'.split('.').map{|e|"%02x" % [e.to_i]}.join => "c26407e3" [2] pry(main)> '91.198.120.24'.split('.').map{|e|"%02x" % [e.to_i]}.join => "5bc67818"

I'm using boringly IPv4 addresses but I could have used anything. Unlike in PFC3 you do not need tell the location in the packet where the pattern must occur, you just tell pattern and any packet having that pattern anywhere is triggered, let's try it:

fisakytt@mec-pe1-re0.hel.fi> start shell pfe network tfeb0 TFEB platform (1000Mhz MPC 8544 processor, 1024MB memory, 512KB flash) TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem enable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem capture 0x3 5bc67818c26407e3 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem dump Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x08 0x80 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3f 0x01 0x1b 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x11 0x71 0x00 0x00 0x60 0x80 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3e 0x01 0x1c 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 <...> Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x09 0x00 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3f 0x01 0x1b 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x12 0x71 0x00 0x00 0x60 0x10 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3e 0x01 0x1c 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem disable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)#

So the format is test jnh (mq_where_interface_is) packet-via-dmem capture (16_bit_type_mask_in_hex) (up-to-8-bytes-pattern) (optional offset from start of the packet) you should use mask 0x3, in my test test bits 1 and 2 are production traffic, bits 10 and 15 are some crap, and others are just some types I don't seem to be using on my boxes

Unlike in PFC3, we capture many packets, and list is constantly updated until you stop the capture. This is very nice when you're not exactly sure what you're looking for and you know your trigger will also match packets you don't care for. We also easily see both packet received and packet send, so we can be sure the traffic is arriving to the box, from MAC addresses we can determine how and where and sent parcel gives us high degree of confidence the packet is leaving the box.

I don't know what the dispatch cookie means, nor what the first 6 bytes in the received parcel or the first 13 bytes in the sent parcel. I'm guessing that is some internal metadata, quickly trying to check for stream ID in MQ and IX or IFL and IFD numbers I can't find a match for them. But I'm mostly interested in just seeing that the packet came in from expected DMAC+SMAC and went out with expected DMAC+SMAC. Would be very useful to be able to at least extract somehow ingress and egress port information (MQ, IX, port, IFL, IFD, anything).

If you're capturing on box with multiple MQ and fabric then you won't see sent parcel having the rewrite information (but you'll see more metadata), you need to jump on the egress MQ to catch rewrite information. If you are having trouble reading the hexdump you can always use 'text2pcap' from wireshark to turn it into PCAP file and browse it in wireshark. Lot more playing around is needed to understand parcel types, cookie, metadata and use in multiple MQ scenario.

2014-03-01

Capture your fancy, part one, PFC3

It's often incredibly useful to be able to capture transit traffic, it's quick way to prove that you're actually receiving some frames and with any luck have good idea how and where you are sending them. It's unfortunately common, especially in 7600/6500 PFC3 to have bug where packets are not going where software FIB suggests they are. Luckily there is quite good tooling to inspect what really is happening. So we're taking a peek at 'ELAM'.

We have traffic coming in unlabeled to 7600 and going out labeled. Let's see how to capture it

psl2-pe2.hel.fi#show platform capture elam asic superman slot 5 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 help SEQ_NUM [5] QOS [3] QOS_TYPE [1] TYPE [4] STATUS_BPDU [1] IPO [1] NO_ESTBLS [1] RBH [3] CR [1] TRUSTED [1] NOTIFY_IL [1] NOTIFY_NL [1] DISABLE_NL [1] DISABLE_IL [1] DONT_FWD [1] INDEX_DIRECT [1] DONT_LEARN [1] COND_LEARN [1] BUNDLE_BYPASS [1] QOS_TIC [1] INBAND [1] IGNORE_QOSO [1] IGNORE_QOSI [1] IGNORE_ACLO [1] IGNORE_ACLI [1] PORT_QOS [1] CACHE_CNTRL [2] VLAN [12] SRC_FLOOD [1] SRC_INDEX [19] LEN [16] FORMAT [2] MPLS_EXP [3] REC [1] NO_STATS [1] VPN_INDEX [10] PACKET_TYPE [3] L3_PROTOCOL [4] L3_PT [8] MPLS_TTL [8] SRC_XTAG [4] DEST_XTAG [4] FF [1] MN [1] RF [1] SC [1] CARD_TYPE [4] DMAC [48] SMAC [48] IPVER [1] IP_DF [1] IP_MF [1] IP_HDR_LEN [4] IP_TOS [8] IP_LEN [16] IP_HDR_VALID [1] IP_CHKSUM_VALID [1] IP_L4HDR_VALID [1] IP_OFFSET [13] IP_TTL [8] IP_CHKSUM [16] IP_SA [32] IP_DA [32] IP_DATA [120] TCP_SPORT [16] TCP_DPORT [16] TCP_SEQ [32] TCP_ACK [32] TCP_CODE [6] TCP_CHKSUM [16] UDP_SPORT [16] UDP_DPORT [16] UDP_DATA [88] IGMP_VERS [4] IGMP_TYPE [4] IGMP_DATA [112] ICMP_TYPE [8] ICMP_CODE [8] ICMP_DATA [104] CRC [16] EXTRA_TYPE [4] THR_PKT_HI_NF [14] THR_BYTE_HI_NF [22] THR_PKT_LO_NF [14] THR_BYTE_LO_NF [22] PKT_CNT_NF [14] BYTE_CNT_NF [22] GEMINI_NF_SEG [3] FINRST [1] STATS [1] DUMMY [1] CRC_16 [16] well-know values for field TYPE ETHERNET 0x0 TOKENRING 0x1 FDDI 0x2 ATM 0x3 VOICE 0x4 VIDEO 0x5 INBAND 0x7 well-know values for field FORMAT IP 0x0 IPX 0x1 INBAND 0x2 OTHERS 0x3 well-know values for field PACKET_TYPE ETHERNET 0x0 RAW_IPX 0x1 802_C1 0x2 802_C2 0x3 802_SNAP 0x4 ENCAP_INVALID 0x7 well-know values for field L3_PROTOCOL IPV4 0x0 IPX 0x1 APPLETALK 0x2 DECNET 0x3 VINES 0x4 XNS 0x5 IPV6 0x6 INVALID 0xF well-know values for field L3_PT ICMP 0x1 IGMP 0x2 IP 0x4 TCP 0x6 UDP 0x11 well-know values for field CACHE_CNTRL NORMAL 0x0 RESERVED 0x1 GEMINI_MISS 0x2 DBUS_EXTRA_CICLE 0x3 well-know values for field IPVER IPV4 0x0 IPV6 0x1 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 elam capture completed psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x9 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 0 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 0 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3510 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0x7F LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x3 PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0x8 DEST_XTAG [4] = 0x0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 5c5e.abee.d320 IPVER [1] = 0 [IPV4] IP_DF [1] = 1 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x44BA IP_SA = 91.198.120.24 IP_DA = 194.100.7.227 ICMP_TYPE [8] = 0x8 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 61 FD 10 EA 15 39 1B AD 11 53 00 78 0C "a....9...S.x." CRC [16] = 0x8310 RBUS data: SEQ_NUM [5] = 0x9 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 1 [GENERIC] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xC60 VLAN [12] = 3168 RBH [3] = b001 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 DELTA_LEN [8] = 4 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq 'B0 A8 6E 7C 62 42 00 27 0D C9 88 00 88 47'. insert seq '49 58 01 FF' before ofs 14. FCS2 [8] = 0x0 psl2-pe2.hel.fi#

Hooray, great success. We can only capture 1 packet, first packet hitting the trigger criteria and capture is complete and nothing else is seen. We see 'DBUS' and 'RBUS' data. DBUS is where lookup engine is getting the lookup question from linecard, the question is 64B (IPv4) or 96B (MPLS, IPv6). DBUS is 62.5MHz, which 32B per cycle, so 62.5/2 = 31.25Mpps and 62.5/3 = 20.83Mpps, those are pretty much our absolute theoretical limits how much work we can give the lookup engine, at that point our DBUS is congested and we cannot send more queries to it. RBUS is where the lookup engine sends lookup result back to linecard, by looking at this data, linecard will know where to send the frame and how to rewrite the packet headers.

In DBUS we can see the incoming VLAN (typically internal VLAN), was it recirculated (CR), QOS, IP, PORT, ICMP, etc. But it's usually the RBUS which is more interesting. Is CCC 'L3_RW' if it were something like 'L2_POLICER' packet wasn't forwarded but was punted for usually 'mls rate-limit'. DEST_INDEX+VLAN where is the packet going, here DEST_INDEX is just VLAN, but it could be some special adjacency too, which we can see in 'remote command switch show platform hardware tycho register 0 1794 | XXXX'. Say if it was CoPP drop, RBUS would likely show 7FFF, we could check that there are several register numbers which have value 7FFF, and I don't know how to check which one was used, luckily the registers are named and the names are usually very indicative.

There is interesting trick we can do for packets hitting those special adjacencies. Say DBUS gave us '7F0A' and we guessed it's really '0x03C7: RED_PKT_ERR_IDX = 0x00007F0A', or RED_PKT_ERR_IDX. Now we can reprogram 0x3c7 to send out to some physical port, instead of 0x7fa0 which is going to end up dropped. You might want to do this, to figure out more closely what maybe was wrong in some dropped packet, or maybe just to check what your CoPP policies are dropping. To reprogram the value, we need to know how ports are numbered in HW. Each port has 'LTL Index' this is calculated as SlotN-1*64 + PortN-1, so giga5/2 would be 4*64 + 1, or 257. Then we could poke the register 'remote command switch show platform hardware tycho poke 3c7 257' and 3c7 register would be diverted to giga5/2 instead of original 0x7f0a special drop adjacency. You can repoke it to 0x7f0a when satisfied.

RBH is hash result for load-balancing. And then maybe most importantly we have REWRITE_INFO, this will tell us with high confidence that packet was sent out, and we can see if it just got new DMAC+SMAC, or did it it also get VLAN or MPLS or both. Here we can clearly see 0x8847, so it's going out with label 0x495801FF, so label value 0x49580, cos+bos 0x1 (so cos 0, bos on), ttl 0xff. We can confirm that the REWRITE_INFO matches what we expect:

psl2-pe2.hel.fi#sh mls cef lookup 194.100.7.227 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(121242 ): E | 1 FFF 0 0 0 0 255.255.255.240 V(121242 ): 8 | 1 0 0 0 0 0 194.100.7.224 (A:165858 ,P:1,D:0,m:0 ,B:0 ) M(121242 ): E | 1 FFF 0 0 0 255.255.255.240 V(121242 ): 9 | 1 0 0 0 0 194.100.7.224 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 165858 detail Index: 165858 smac: 0027.0dc9.8800, dmac: b0a8.6e7c.6242 mtu: 9196, vlan: 3168, dindex: 0x23BC4, l3rw_vld: 1 format: MPLS, flags: 0x1000208518 label0: 0, exp: 0, ovr: 0 label1: 0, exp: 0, ovr: 0 label2: 300416, exp: 0, ovr: 0 op: PUSH_LABEL2 packets: 1, bytes: 80

Same MAC addresses, egress VLAN, label value, so everything is fine here. Now what if we want to capture the packet coming in from the MPLS? That is quite bit harder, as we cannot use any syntactic sugar trigger values. But ELAM will allow you to match on any bit and allows you to mask any bit. So it's just matter of patience or little script to have proper capture filter for MPLS packets too. We should try to figure out as much as we can about how the packet is coming in, especially how many labels it has (as it affects our offset) the matching begins from DMAC and continues as on the wire. We could match SMAC+DMAC+ETYPE+dont_care+SADDR+DADDR etc. Right now, I'm lazy and I'll just match mpls etype and DADDR, we were seeing SADDR 91.198.120.24 going to MPLS network, let's see if we can find DADDR to 91.198.120.24 (5b c6 78 18) coming from MPLS network. Here it is beneficial to have cursory understanding of ethernet + mpls + ip header, especially you want to know your offset to the bits you're interested in. The ELAM capture is specified in groups of 4 bytes and can be compressed to 0 if you don't care on any of them.

psl2-pe2.hel.fi#show platform capture elam trigger dbus others if DATA=0 0 0 0x88470000 0 0 0 0 0x00005bc6 0x78180000 [ 0 0 0 0xffff000 0 0 0 0 0x0000ffff 0xffff0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=OTHERS DATA=0 0 0 0X88470000 0 0 0 0 0X00005BC6 0X78180000 [ 0 0 0 0XFFFF0000 0 0 0 0 0X0000FFFF 0XFFFF0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x12 QOS [3] = 0 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3168 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 106 FORMAT [2] = 3 [OTHERS] PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 15 [INVALID] L3_PT [8] = 0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 ISL [16] = 0x0 DATA [592] 0000: 00 27 0D C9 88 00 B0 A8 6E 7C 62 42 88 47 00 00 ".'......n|bB.G.." 0010: 03 FE 45 00 00 54 A6 E8 00 00 3F 01 36 9B C2 64 "..E..T....?.6..d" 0020: 07 E3 5B C6 78 18 00 00 A3 FD 12 28 0A E5 2F B9 "..[.x......(../." 0030: 11 53 00 00 00 00 CF BB 01 00 00 00 00 00 42 42 ".S............BB" 0040: 42 42 42 42 42 42 42 42 42 42 "BBBBBBBBBB" CRC [16] = 0x8184 RBUS data: SEQ_NUM [5] = 0x12 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 1 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 1 [B64] FLOOD [1] = 0 DEST_INDEX [19] = 0x7FFA VLAN [12] = 4012 RBH [3] = b011 RDT [1] = 1 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 19 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 26 IP_TTL [8] = 63 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 28 IP_CSUM [16] = 0x369B DELTA_LEN [8] = 252 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq '00 27 0D C9 88 00 00 00 00 00 00 00 08 00'. remove bytes from ofs 14 to ofs 17. FCS2 [8] = 0x0

Faaantastic. We have expected DADDR+SADDR, it's ICMP and I even set the payload now to 0x42 just be clear. But hmm, DEST_INDEX 0x7FFA that does not look good, that's something special, cursory look on the REWRITE_INFO makes me dubious. Let's dig around bit.

psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 986, bytes: 100544 psl2-pe2.hel.fi#sh mls cef lookup 91.198.120.24 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(159300 ): E | 1 FFF 0 0 0 0 255.255.255.0 V(159300 ): 8 | 1 0 0 0 0 0 91.198.120.0 (A:67379 ,P:1,D:0,m:0 ,B:0 ) M(159300 ): E | 1 FFF 0 0 0 255.255.255.0 V(159300 ): 9 | 1 0 0 0 0 91.198.120.0 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 203, bytes: 20666

Yeah, definitely we're not rewriting those MAC addresses and definitely the VLAN is not correct in the RBUS. But what is that VLAN and the DEST_INDEX.

psl2-pe2.hel.fi#remote command switch show platform hardware tycho register 0 1794 | i 7FFA 0x03C1: RED_CEN_RW_IDX = 0x00007FFA [32762 ] 0x03C3: RED_IN_RW_FF_IDX = 0x00007FFA [32762 ] psl2-pe2.hel.fi#show vlan internal usage | i 4012 4012 vrf_0_vlan

Aaah, it's a recirculate. We're putting it to VRF 0 (INET) and sending 'er right back. Notice how we're chanaging the MAC addresses and we're dropping the mpls label out on rewrite. So as MPLS label is gone, we again can capture the packet with fancy triggers, just to be sure we're capturing the recirculated one, we'll also set CR=1 (if you need another recirculate, too bad, may not have any way to see the 3rd one)

psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x7 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 1 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 4012 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x7F PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0xF DEST_XTAG [4] = 0xA FF [1] = 0 MN [1] = 0 RF [1] = 1 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 0000.0000.0000 IPVER [1] = 0 [IPV4] IP_DF [1] = 0 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x8774 IP_SA = 194.100.7.227 IP_DA = 91.198.120.24 ICMP_TYPE [8] = 0x0 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 89 06 13 36 17 1E 5A C3 11 53 00 B0 61 "...6..Z..S..a" CRC [16] = 0x136E RBUS data: SEQ_NUM [5] = 0x7 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xDB6 VLAN [12] = 3510 RBH [3] = b010 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 15 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 22 IP_TTL [8] = 62 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 24 IP_CSUM [16] = 0x8874 DELTA_LEN [8] = 0 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 11 with seq '5C 5E AB EE D3 20 00 27 0D C9 88 00'. FCS2 [8] = 0x0

Greeaaat! We have expected rewrite info and VLAN, so we have high confidence for both directions of packets, and we've learned the packets coming from MPLS side are getting recirculated (it's because of EXP0, but that's another story, there are many reasons why it might happen).

2014-02-25

JunOS and ARP Glean

I'm using Cisco vocabulary 'glean' here as I don't know better word for it. Glean is any IPv4 packet which is going to connected host which is not resolved. It is NOT an ARP packet, so ARP policers won't help you. They are punted, since you need to generate ARP packet and try to resolve them.

In 7600 we can use 'mls rate-limit unicast cef glean 200 50' to limit how many packets per second are punted to control-plane for glean purposes. How can we limit this in JunOS? As far as I can see, there is no way. But I remember testing this attack and was unable to break MX80, so why didn't it break?

First let's check what does connected network look like

fisakytt@test13nqa1-re0.dk> show route forwarding-table destination 62.236.255.179/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 62.236.255.0/24 intf 0 rslv 828 1 xe-0/0/0.42

Ok, fair enough. Type 'rslv', which we can guess means packet is punted to control-plane for resolving ARP. Let's try to ping some address rapidly which does not resolve and check what it looks like

fisakytt@test13nqa1-re0.dk> show route forwarding-table destination 62.236.255.179/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 62.236.255.179/32 dest 0 62.236.255.179 hold 830 1 xe-0/0/0.42

Interesting, we get host/32 'hold' entry for it, no longer whole connected network as itself. Hold adjacency is programmed by software, so between 1st packet coming in and hold adjacency being programmed all packets in between are punted as well. I'm bit optimistic now that maybe there is some intelligence here. Maybe we'll get more information by checking the index (it changes quite often, so try to catch it at same time)

MX104-ABB-0(test13nqa1-re0.dk vty)# show nhdb id 830 extensive ID Type Interface Next Hop Addr Protocol Encap MTU Flags PFE internal Flags ----- -------- ------------- --------------- ---------- ------------ ---- ------------------ ------------------ 830 Hold xe-0/0/0.42 - IPv4 Ethernet 0 0x0000000000000000 0x0000000000000000 BFD Session Id: 0 Hold Status Packet: 0x4499d788 Timer: Running Flags: 0x0000000000000000

Well not too interesting. Didn't try to figure out what 'Packet' means yet, maybe it's memory location for packet which caused it to be on hold, maybe it's timer information. But curious thing is, that it's timed. So we'll know it'll go away automatically after some time. But how can we figure out what happens to hold and rslv? Maybe they are exceptions?

MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== Routing ---------------------- hold route DISC(70) 6339 532476 resolve route PUNT(33) 175 14700

Yup, the hold routes are discarded. I'm sending 10pps, so we're punting 0.27 per second. And we can extrapolate hold timer being about 3.7s. So 1 packet per 3.7s is punted. If I had been sending lot more packets, I'd see lot more punts, 10pps is just very slow rate so 'hold' adjacency will drop rest of them. Also if we have very many addresses this can add up to good number of packets. I might try to walk through the punt(33) to see if there is policer attached to it or not (There should be, but in certain situations it will be bypassed, this will be fixed and hopefully at same time we'll see ddos-policer limiters for resolve packets).

Just to be sure we've interpreted all this correctly, let's check that DISC(70) is seeing our ping packet to 62.236.255.179

MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions 70 discard MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions-trace MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh exceptions-trace [430247] jnh_exception_packet_trace: ############### [430248] jnh_exception_packet_trace: [iif:338,code/info:198/0x273,score:(0x0),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:102] [430249] jnh_exception_packet_trace: 0x00: 20 00 c6 00 27 30 01 52 00 12 00 66 80 00 00 20 [430250] jnh_exception_packet_trace: 0x10: 12 00 00 12 00 10 00 00 00 3c 8a b0 8d a6 4a 28 [430251] jnh_exception_packet_trace: 0x20: c0 da 07 c1 c8 88 47 00 00 0d 3e 45 c0 00 54 06 [430252] jnh_exception_packet_trace: 0x30: 62 00 00 40 01 84 6d c1 58 ef 21 3e ec ff b3 08 [430253] jnh_exception_packet_trace: 0x40: 00 99 6b 3a c5 16 07 53 0c c7 ed 00 00 07 cb 08 [430254] jnh_exception_packet_trace: 0x50: 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 [430255] jnh_exception_packet_trace: 0x60: 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 [430256] jnh_exception_packet_trace: 0x70: 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37

3e ec ff b3 => 62.236.255.179, yup that's our ping packet being discarded due to 'hold' adjacency

2014-02-24

JunOS 'L3 incompletes', what and why?

There is quite often chatter about L3 incompletes, and it seems there are lot of opinions what they are. Maybe some of these opinions are based on some particular counter bug in some release. Juniper has introduced also toggle to allow stopping the counter from working. It seems very silly to use this toggle, as it is really one of the few ways you can gather information about broken packets via SNMP.

What they (at least) are not

  • Unknown unicast
  • CDP
  • BPDU
  • Packet from connected host which does not ARP
  • Packet from unconfigured VLAN

What they (at least) are

  • IP header checksum error
  • IP header error (impossibly small IHL, IP version 3, etc)
  • IP header size does not match packet size

Troubleshooting

So if you are seeing them, what can you do? As it is aggregate counter for many different issues, how do you actually know which one is it and is there way to figure out who is sending them? Luckily for Trio based platforms answers and highly encouraging, we have very good tools to troubleshoot the issue.

To figure out what they exactly are, first you need to figure out your internal IFD index (not snmp ifindex)

im@ruuter> show interfaces xe-7/0/0 |match index: Interface index: 224, SNMP ifIndex: 586

After figuring out index, we can login to the the PFE and check stream counters for that IFD

im@ruuter> start shell pfe network fpc7 NPC platform (1067Mhz MPC 8548 processor, 2048MB memory, 512KB flash) NPC7(ruuter vty)# show jnh ifd 224 stream ifd = 224, Stream = 33 Stream ID: 33 (inst = 0) Cntr : 0x00c0f102 Encap : Ether Encap = 0, StartNH = 0xc040e1 lacp:+, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, dc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001 Input Statistics: 0003126353191368 pkts, 3351074223070319 bytes Detail Statistics: rx0: 0000000000000000 pkts, 0000000000000000 bytes rx1: 0000007792865413 pkts, 0000923636240746 bytes rx2: 0003118560325955 pkts, 3350150586829573 bytes drop0: 0000000000000000 pkts, 0000000000000000 bytes drop1: 0000000000000000 pkts, 0000000000000000 bytes drop2: 0000000000000000 pkts, 0000000000000000 bytes unknown-iif: 0000000000000000 pkts, 0000000000000000 bytes checksum: 0000000000625225 pkts, 0000000268883747 bytes unknown-proto: 0000000000024793 pkts, 0000000006398918 bytes bad-ucastmac: 0000000218713670 pkts, 0000034352327467 bytes bad-ucastmac-IPv6: 0000000002160892 pkts, 0000000172764339 bytes bad-smac: 0000000000000000 pkts, 0000000000000000 bytes in-stp: 0000000000000000 pkts, 0000000000000000 bytes out-stp: 0000000000000000 pkts, 0000000000000000 bytes vlan-check: 0000000000000000 pkts, 0000000000000000 bytes frame-errors: 0000000000000108 pkts, 0000000000014451 bytes bad-IPv4-hdr: 0000000000033339 pkts, 0000000012708126 bytes bad-IPv4-len: 0000000000070901 pkts, 0000000025836710 bytes bad-IPv6-hdr: 0000000000000133 pkts, 0000000000009508 bytes bad-IPv6-len: 0000000000000993 pkts, 0000000000071269 bytes out-mtu-errors: 0000000000003391 pkts, 0000000005122005 bytes L4-len: 0000000000038084 pkts, 0000000001765247 bytes Stream Features: Topology: stream-(33) Flavor: i-root (1), Refcount 0, Flags 0x1 Addr: 0x4513f3c8, Next: 0x4fdd3c78, Context 0x4513f3c0 Link 0: da40602e:32000303, Offset 12, Next: da40602e:32000303 Link 1: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 2: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 3: 00000000:00000000, Offset 12, Next: 00000000:00000000 Topology Neighbors: [none]-> stream-(33)-> flist-master(stream) Feature List: stream [pfe-0]: 0xda40602e32000303; f_mask:0x80000000000000; c_mask:0x8000000000000000; f_num:9; c_num:1, inst:0 Idx#8 iif-lookup: [pfe-0]: 0xda40602e32000303

Here we can see 'checksum, bad-ipvX-hdr, bad-ipvX-len' at least all of these are 'L3 incompletes', there may be other reasons, but that's the absolutely minimum. We can also see aggregate counters for all the interfaces in given Trio, we'll need some of this information later:

NPC7(ruuter vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== PFE State Invalid ---------------------- sw error DISC(64) 197636729 13174899216 invalid fabric token DISC(75) 68 4311 unknown family DISC(73) 24793 6398918 iif down DISC(87) 4516 337076 egress pfe unspecified DISC(19) 5857595 1900968530 Packet Exceptions ---------------------- bad ipv4 hdr checksum DISC( 2) 660667 289608849 bad IPv6 options pkt DISC( 9) 3 216 bad IPv4 hdr DISC(11) 33339 12708126 bad IPv6 hdr DISC(56) 133 9508 bad IPv4 pkt len DISC(12) 108203 33978274 bad IPv6 pkt len DISC(57) 1009 72421 L4 len too short DISC(13) 143678 6622571 frag needed but DF set DISC(22) 21915 33137575 ttl expired PUNT( 1) 51770371 3124910479 IP options PUNT( 2) 777 108006 frame format error DISC( 0) 108 14451 my-mac check failed DISC(28) 218721556 34352693971 my-mac check failed IPv6 DISC(58) 2161073 172779483 DDOS policer violation notifs PUNT(15) 2438770 326360620 Firewall ---------------------- firewall discard DISC(67) 1284437202 544450843517 firewall discard V6 DISC(101) 34130853 4143870718 Routing ---------------------- discard route DISC(66) 3740477632 552767152133 discard route IPv6 DISC(102) 3894436247 281512808097 hold route DISC(70) 471 35151 resolve route PUNT(33) 10 776 resolve route V6 PUNT(69) 818 63097 control pkt punt via nh PUNT(34) 993912636 45817932600 host route PUNT(32) 228855708 19337408523 mcast host copy PUNT( 6) 2591 422909 reject route PUNT(40) 2855554 402183663 reject route V6 PUNT(68) 9277 1901206

The counters for L3 incompletes are 'bad ipv4 hdr checksum', 'bad ipvX hdr' and 'bad ipvX pkt len'. Notice how we're missing IPv6 hdr checksum, obviously because IPv6 does not have this, because it was deemed unnecessary, but we'll shortly see this may have been bad decision.

Now why could we possibly see L3 incomplete increasing? If frame is mangled, we'll see ethernet CRC failing (which is much better than IP checksum) and we'll never even check any of these, we'll drop frame much earlier. So clearly we received packet which had correct ethernet CRC yet it was broken. We recently had issue where pretty much all egress PE boxes started logging 'l3 incompletes' because IPv4 header checksum was failing, they incremented maybe 20 times per hour, so very moderately. But how is this possible? If someone generated broken IP packet and sent it to us, we'd drop it in ingress PE box with incrementing these counters, yet packet traversed through MPLS core all the way to egress PE. So clearly we were mangling them. Obviously core is like Jon Snow, it's just MPLS frame to it, it does not need to know it's IP nor should it verify it for correctness, so core will happily pass broken packets around. It seems like complex problem to try to figure out who is mangling the packets. Luckily Trio gives us ability to capture exception packets, here we need to use the exception number we see above in exceptions counters

NPC7(ruuter vty)# debug jnh exceptions 2 discard NPC7(ruuter vty)# debug jnh exceptions-trace NPC7(ruuter vty)# show jnh exceptions-trace [1768975] jnh_exception_packet_trace: ############### [1768976] jnh_exception_packet_trace: [iif:344,code/info:130/0x0,score:tcp|(0x40),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:60] [1768977] jnh_exception_packet_trace: 0x00: 20 40 82 00 00 00 01 58 00 12 00 3c 80 00 00 20 [1768978] jnh_exception_packet_trace: 0x10: 12 00 00 3c 00 00 00 00 00 28 c0 da 07 c0 00 00 [1768979] jnh_exception_packet_trace: 0x20: 12 1e d5 97 f8 88 47 00 00 03 3d 45 00 00 28 1e [1768980] jnh_exception_packet_trace: 0x30: 4f 40 00 87 06 44 d4 XX XX 39 58 XX XX 3f 6a d0 [1768981] jnh_exception_packet_trace: 0x40: c0 e2 82 4f 0d a2 2d cc ec aa 6b fd 78 0f 10 22 [1768982] jnh_exception_packet_trace: 0x50: 60 cd 42 00 00

It should be lot simpler for us now to troubleshoot the issue, we just figure out where is that SADDR (XX XX 39 58) entering the network, with any luck, the mangling node is somewhere in the edge of the network and we will find some common theme in source addresses. If it's in core, you're pretty much out of luck, you'll replace whole network or accept that you mangle something. I wrote little script which you can give IP header, incorrect checksum, correct checksum and it'll show you what each 16b field should be, to result in correct checksum, with luck you can use it to figure out which part of the packet is being mangled, as some of those 16b fields are bound to result in impossible/invalid values to have correct checksum, which allows you to exclude them and concentrate efforts on rest of the fields. But field like ID can be anything, so impossible to exclude. SADDR is also problematic, DADDR usually not (if it were this value, it would not have been routed to my network).

Why this happens? How often this happens? I really would like to know. Obviously as CRC is correct, it's not happening because error in links/optics/etc. My guess is, this happens mostly because bad memory in forwarding logic. Packet can touch many memories on its path, ring => sram => dram => sram => ring, do all of these have ECC? Does PHY ring ever have ECC? I don't know, but clearly there can be problems, as L3 incompletes exist. How common they are? Probably lot more common than we think, as we only know about mangling when it happens to happen in IPv4 header (IPv6 header mangling would usually pass unnoticed). IPv4 header is 20B, packet lenght typically is 1500B, so are we only seeing 1.3% of the tip of the iceberg?

I guess key takeaway here is, don't use 'ignore-l3-incompletes', monitor your 'l3 incompletes' via SNMP, figure out why they are happening and fix them. Especially check right now if your egress PE is has L3 incompletes from core.