random musings about networks and everything

Junos and DHCP relay

2016-04-22T16:19:00.001+03:00

There are two different ways to configure DHCP in Junos, bootp helper and dhcp relay. These work in very different manner, bootp helper is being phased out and is not supported for example in QFX10k. Behaviour of bootp helper is obvious, it works like it works in every other sensible platform. Behaviour of dhcp-relay is very confusing and it's not documented at all anywhere.

If it's possible in your platform to configure bootp helper, do it. If not, complain to Junos about dhcp-relay implementation and ask them to fix it. The main problem with dhcp-relay implementation is that once you've configured it, you're punting all dhcp traffic in all interfaces. Normal transit traffic crossing your router is subject to this punt, so transit customers will experience larger jitter and delay of packets being punted and almost certainly reordering, because the non-dhcp packet that came after but was not subject to punt will be forwarded first. Technically reordering does not matter, as long as it does not happen inside a flow, but it's not desirable.

How the sequence of operation works in Junos for dhcp-relay:

Transit packet touches ingress NPU
After L2 lookup, before L3 lookup ingress NPU punts the transit packet to PFE CPU
PFE CPU, for reasons obvious to people on drugs encapsulates the transit DHCP packet with new IP, UDP and 28magic bytes header. [ip_new, udp_new, 28_bytes, ip_old, udp_old, payload]
PFE CPU sends the encapsulated DHCP packet to RE, so that JDHCPD can inspect it
JDHCPD determines if it's relevant to it or not, in our case it's not, in normal configuration it proceeds to drop the transit packet as it was not interesting to us!

I don't disagree that on some instances it may be desirable to snoop transit DHCP messages, to see what server unicasts to our client, but that is small percentage of the traffic. We know the DHCP servers we have, why can't we decide what is punted and what is not?

If operator is doing her due diligence and has exact lo0 filter which only allows things to enter control-plane which are actually needed there, all of this is broken, but in a very confusing way. Firstly these transit DHCP packets are NOT subject to hardware level lo0 filter, you cannot drop them there, which to me makes sense, they are transit, lo0 filter shouldn't affect them. However after punt and encapsulation they magically become subject to lo0 filter on the software side! In your typical lo0 filter with ultimate 'discard all' term, you'll see these x.y.z.k => 255.255.255.255 packets being discarded and you might be bit confused how on earth we're getting DADDR 255.255.255.255 from our neighbouring core routers! So perhaps you'll run 'monitor traffic interface X matching "host 255.255.55.255 detail"' to understand better what is going on, well, you won't see any of these 255.255.255.255 packets you were dropping, because Junos has added support in their own tcpdump for this DHCP encapsulation, so you're actually seeing the original embedded headers, not the new top headers (where the host 255.255.255.255 is matching). If you add 'write-file dhcp.pcap' and open that file in wireshark, you'll see the injected new headers and packet being interpreted as DHCP, including the original header portion, which makes out for a VERY confusing looking DHCP packet. If you manually pop from the packet ip_new, udp_new and 28 bytes, you'll see the expected transit packet.

Strangeness does not end here, you can easily discard/log/count these 'destination-address 255.255.255.255' in lo0 filter (in software, not in hardware), but when you change that 'discard' to 'accept' you won't see anything in counter or logs anymore! Yet it is crucial that you do accept them, because otherwise they are dropped before jdhcpd has chance to process them, and you're killing all your transit DHCP. Even after you add this confusing rule to permit transit traffic to enter jdhcpd you're still going to be dropping all transit dhcp until you configure 'set forwarding-options dhcp-relay forward-snooped-clients all-interfaces'.

Problem here of course is, now we're not discriminating in lo0 filter the transit packets hitting SW processed lo0, and all real DHCP discover packets coming from local interfaces. I usually specifically only allow DHCP discover from interfaces where I've enabled DHCP. But with this dhcp-relay configuration, I have to allow it everywhere! I'm no longer protected from customers having L2 loops and injecting wire-rate of DHCP Discover to my control-plane, I now have to accept those, because I cannot discriminate in lo0 filter if they are transit or discovers.

What should JNPR do? Continue to support bootp helper style operation, where no transit traffic is ever punted. Make dhcp-relay work like that out-of-the-box, and people who need to snoop transit, must enable it and give them tools to enable it based on various keys, saddr/daddr, interface, npu. I'm pretty sure there are now bunch of JNPR boxes which silently drop transit DHCP, because there is no documentation anywhere on how this works.

I'm not as convinced as JTAC that this isn't simply a bug, it feels odd that all this really would be the intended behaviour. The telling problem here is, that JNPR is somehow avoiding lo0 evaluation in HW, I suspect it is, because the packet is not classified as IPv4 protocol but DHCP-snooping protocol (yes, Junos has ipv4, ipv6, mpls, bridge, fibrechannel and dhcp-snooping protocol route tables!) and as it's not IPv4 it's not subject to HW lo0 filter. However they seem to drop the ball after punt, making the embedded packet subject to SW lo0 filter, I think it really should not behave like this.

I wish I could say this is only situation when transit traffic can hit lo0 filter, but that's not true. Some JNPR platforms punt transit IP options and transit IPv6 HBH through lo0 filter. So in those cases you need to match on all local addresses and ip-options and drop, then second rule to allow all ip-options, unless you want to drop also all transit ip-options (which is probably just fine). Pretty much no one knows this, so people likely don't know what is their network's policy regarding ip-options and actual policy is just determined by your network upgrade cycle.

Quick look at Trio ddos-protection with flow-detection

2014-03-17T14:47:00.000+02:00

Some things are easy to protect with iACL and lo0 ACL but others are really hard, like BGP, you need to allow BGP from customers and from core, and it's not convenient or practical to handle them separately in lo0 ACL + policer. Luckily JunOS has feature called flow-detection, you turn it on with set system ddos-protection global flow-detection

I'm sending DoS from single source to lo0, my iBGP goes immediately down. After I turn on flow-detection iBGP connectivity is restored. Looking at PFE, we can see what is happening:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 721 1400 sub 338 21 79161 c158ef22 c158ef1f 53571 179 0 1 2679 1400 sub 356 11159404 2187242988 64640102 c158ef1f 179 179 0 2 2015 1400 sub 338 29 112468 c158ef23 c158ef1f 179 65020

Pretty nice and clear, 64.64.01.02 => c1.58.ef.1f is our attack traffic and it's getting its own policer, iBGP is stable, attack traffic is policed separately. Let's check those policers more closely:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 2679 PFE: 0 Flow Record Index: 2679 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 64640102 (100.100.1.2) Dst IP addr: c158ef1f (193.88.239.31) Src port: 179 Dst port: 179 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 0 Proto idx: 79 Policer idx: 3885 Time inserted: 3484405383 Time last violated: 3484485825 Last received: 42480417 Flow Statitics: Packet Count: 42655279 Byte Count: 8360434488 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp stats DDOS Policer Statistics: arrival pass # of idx prot group proto on loc pass drop rate rate flows --- --- ----------- ----------- -- ------ -------- -------- ------ ------ ----- 79 1400 bgp aggregate Y UKERN 8378805 0 9 9 3 PFE-0 8393411 2310827652 528095 9 3 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp configuration DDOS Policer Configuration: UKERN-Config PFE-Config idx prot group proto on Pri rate burst rate burst --- --- ------------ ------------ -- -- ------ ----- ------ ----- 79 1400 bgp aggregate Y Lo 8000 20000 8000 20000 MX104-ABB-0(test13nqa1-re0.dk vty)#

Looks good, what if we increase the sources from /32 to /16, let's see:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3083 1400 sub 356 80 15484 6464e743 c158ef1f 179 179 0 1 805 1400 sub 356 217 42336 64642df4 c158ef1f 179 179 0 2 672 1400 sub 356 291 56840 6464eb64 c158ef1f 179 179 0 3 3382 1400 sub 356 256 49980 6464ba66 c158ef1f 179 179 0 4 3956 1400 sub 356 82 15876 64643b3d c158ef1f 179 179 0 5 1999 1400 sub 356 92 17836 646403a6 c158ef1f 179 179 0 6 2808 1400 sub 356 22 4116 64642f33 c158ef1f 179 179 <..> 0 3455 866 1400 sub 356 80 15484 646475a1 c158ef1f 179 179 0 3456 4096 1400 sub 356 319 62328 646470a8 c158ef1f 179 179 MX104-ABB-0(test13nqa1-re0.dk vty)#

No mas! I expected to see 65537 or so, but instead I'm seeing just 3456. My iBGP also went down! Let's poke around bit:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b39 Culprit flows: 3902 Culprit flows on scan: 3902 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588425 Free SCFD states: 4096 Free flow entries: 194 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 321 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=3902 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 191 current culprit flows: 3905 discovered suspicious flows: 123802 discovered culprit flows: 23356 deleted culprit flows: 19451 false positives: 100255 hash insertion errors: 148669170 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23353 flow timeout: 0 flow return to normal: 115 flow cleared: 19336 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Immediately what caught my attention was Free flow entries: 194, oops! We're running out of flow entries. But as we wait for few more seconds, my iBGP comes back up, why?

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows pfe idx rindex prot aggr IIF/IFD pkts bytes source-info --- ---- ------ ---- ---- ------- ------- -------- ---------- 0 0 3183 1400 ifl 338 96 220159 0 0 0 0 0 1 1439 1400 ifl 356 70851932 13886978476 0 0 0 0 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 3183 PFE: 0 Flow Record Index: 3183 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 338 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 1 Rcvd last flow op: 3 Flow state: 3 Aggr level: 1 Proto idx: 79 Policer idx: 2105 Time inserted: 3485722214 Time last violated: 3485723825 Last received: 105 Flow Statitics: Packet Count: 105 Byte Count: 221115 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 1439 PFE: 0 Flow Record Index: 1439 Flow Key: Proto-ID: 1400 Key type: 1 IIF: 356 Src IP addr: 0 (0.0.0.0) Dst IP addr: 0 (0.0.0.0) Src port: 0 Dst port: 0 Flow Context Data: Rcvd ack_add: 1 Rcvd ack_del: 0 Rcvd last flow op: 2 Flow state: 2 Aggr level: 1 Proto idx: 79 Policer idx: 2155 Time inserted: 3485637825 Time last violated: 3485792825 Last received: 81852797 Flow Statitics: Packet Count: 82308537 Byte Count: 16132473056 MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info DDOS-SCFD global context ------------------------------------------------------ FLow entry/state/hash size: 288/12/8 bytes Flow scan: Yes Send async msg to RE: Yes Send periodic update to RE: No Default enabled: No Enabled: Yes Last aggr op is: Aggr Next available flow id: 5b40 Culprit flows: 2 Culprit flows on scan: 2 Violated protocols: 1 Violated protocols on scan: 1 Violation report rate: 100(pps) Flow change report rate: 100(pps) Scan cookie: 588463 Free SCFD states: 4096 Free flow entries: 4094 Free notif blocks: 400 Free re request blocks: 400 Free flow msg blocks: 4096 Free flow policers: 4221 Socket notif queue size: 0 Has queued work state items: 0 Has queued re requests: 0 Has queued flow rate modifies: 0 Has queued flow messages: 0 Send packet size: 16384 Send batch size: 1 Last aggr op time: 2942210140 Per PFE flows: 0=2 Run out of flows: 0 Reuse an entry not freed yet: 0 Run out of state items: 0 Bad proto ID: 0 rindex changed for same flow: 0 Remove flow on an empty proto: 0 Remove non-exist flow: 0 Read ASIC failed: 0 Failed tries write flow params: 0 Failed change flow params: 0 Run out of policers: 0 Run out of msg blocks: 0 Run out of mod flow blocks: 0 SCFD stats for PFE 0 Global configuration violation report rate: 100 flow report rate: 100 Flow counters read from LU current suspicious flows: 0 current culprit flows: 2 discovered suspicious flows: 123806 discovered culprit flows: 23360 deleted culprit flows: 23358 false positives: 100446 hash insertion errors: 149071671 hash deletion errors: 0 max flow tbl scan time(ms): 0 debug values: 0 Flow reports received through PUNT policer violation: 104836 flow found: 23360 flow timeout: 0 flow return to normal: 117 flow cleared: 23241 unknown reports: 0 bad flow type: 0 Violation indication policer stats Passed indications: 21346 Dropped indications: 84304888

Wow! Only two flows left, and they've changed from sub to ifl, keys are all zero, no src, dst IP match, no src, dst port match! But my iBGP seems to work, as my attack traffic (simulating BGP customer) is coming from separate IFL than my iBGP. I guess it's lot better than nothing, lot better than overflowing the 4k policers.

Yet, waiting bit longer I see that my INET/INET6 iBGP is happy, but iBGP VPNv4/VPNv6 is unhappy, both are on same IFL, but different than attack traffic. Turns out default level for sub and ifl is 10pps and ifd is 20kpps. Our iBGP was 5 session and was periodically exceeding 10pps IFL limit, and by default, once flow-limit is hit (flow-control-mode), you don't drop just the exceeding traffic (flow-control-mode police), but you drop everything in offending flow (flow-control-mode drop), which I think is great default. But is sub==ifl==10pps great default? I don't think so. Anyway my iBGP IFL was hitting 10pps, and dropping everything for duration of violation, not sure why it affected 3xVPN4/VPN6 session, but not INET/INET6, same hold-time and keepalive, but doesn't matter misconfig on my part. Once I added set system ddos-protection protocols bgp aggregate flow-level-bandwidth logical-interface 1000 VPN4/VPN6 stabilized.

I noticed that for some reason I couldn't anymore make the platform automatically switch during DoS from sub detection to ifl detection. This mean, my configuration was again broken, as the subscriber flows congested my HW's 4k slots. Not sure why this is and how it should work. I ended up disabling 'sub' level detection ẃith set system ddos-protection protocols bgp aggregate flow-level-detection subscriber off, I guess it would be nice to keep it on, if we could reliably and rapidly switch from sub=>ifl during policers being congested, but right now it seems that if there is danger of seeing 4k of sub-level flows, it might be safest to just ignore them and protect only at ifl level. My full config is now:

{master}[edit system ddos-protection protocols bgp] fisakytt@test13nqa1-re0.dk# show aggregate { bandwidth 8000; flow-level-bandwidth { logical-interface 100; } flow-level-detection { subscriber off; } }

It seems to consistently protect my BGP from attacks which I'm interested in, quite bit of config, if you're doing that for 200 DDoS policer protocols (after researching each relevant about what you even should do). It might be nice if Juniper would have concept of 'protected flows' which always have priority, which would be those flows, which have manual hint in configuration, like in Cisco LPTS when you configure BGP, you get BGP policer for that configured session, JNPR could easily do same and configure those persistently and in case of congestion, don't break those, only break policers among dynamically detected flows. I'd also really want to figure out how sub=>ifl fallback should work, why does it work sometimes and sometimes not.

Tourist trip to MX fabric

2014-03-16T14:51:00.000+02:00

Tourist, because it's mostly original research so quality may be dubious.

You can infer lot about the fabric by looking at 'show hsl2 ...' commands. Let's start.

NPC0(test13nqe1-re1.dk vty)# show hsl2 asic mqchip(0) serdes MQCHIP(0) serdes table : MQCHIP(0)-Avago 65NM-0 [0xf300000]: 24 links 0 - 23 MQCHIP(0)-Avago 65NM-1 [0xf304000]: 24 links 24 - 47 MQCHIP(0)-Avago 65NM-2 [0xf308000]: 8 links 48 - 55 MQCHIP(0)-Avago 65NM-3 [0xf309000]: 8 links 56 - 63 MQCHIP(0)-Avago 65NM-4 [0xf30a000]: 8 links 64 - 71 MQCHIP(0)-Avago 65NM-5 [0xf30b000]: 8 links 72 - 79 MQCHIP(0)-Avago 65NM-6 [0xf30c000]: 8 links 80 - 87 MQCHIP(0)-Avago 65NM-7 [0xf30d000]: 8 links 88 - 95 MQCHIP(0)-Avago 65NM-8 [0xf30e000]: 8 links 96 - 103 MQCHIP(0)-Avago 65NM-9 [0xf30f000]: 8 links 104 - 111 MQCHIP(0)-Avago 65NM-10 [0xf310000]: 8 links 112 - 119 MQCHIP(0)-Avago 65NM-11 [0xf311000]: 8 links 120 - 127 MQCHIP(0)-Avago 65NM-12 [0xf312000]: 8 links 128 - 135 MQCHIP(0)-Avago 65NM-13 [0xf313000]: 8 links 136 - 143 MQCHIP(0)-Avago 65NM-14 [0xf318000]: 2 links 144 - 145 MQCHIP(0)-Avago 65NM-15 [0xf31a000]: 2 links 146 - 147

Avago is well known manufacturer of SerDes (SERialization / DESerialization), 65NM probably means Avago's 65nm lithography line of products. SerDes presentation here is unidirectional. But that is still quite large number of SerDes links. They are mapped/bundled to channels, looking at those channels we can get bit more clue about them:

NPC0(test13nqe1-re1.dk vty)# show hsl2 MQCHIP(0)(16) 74 channels 16 serdes 148 links Flag: log_poll_error log_ioerr rd_fn 0x40760d64 wr_fn 0x40760d10 MQCHIP(0) channel table : MQCHIP(0)-chan-rx-0 0x7200000 <= Table MQCHIP(0)-chan-tx-1 0x7201000 => Table MQCHIP(0)-chan-rx-2 0x7202000 <= Table MQCHIP(0)-chan-rx-3 0x7202000 <= Table MQCHIP(0)-chan-tx-4 0x7203000 => Table MQCHIP(0)-chan-tx-5 0x7203000 => Table MQCHIP(0)-chan-rx-6 0x7204000 <= Invalid MQCHIP(0)-chan-rx-7 0x7204000 <= Primitive MQCHIP(0)-chan-rx-8 0x7204000 <= Primitive MQCHIP(0)-chan-rx-9 0x7204000 <= Invalid MQCHIP(0)-chan-rx-10 0x7204000 <= Invalid MQCHIP(0)-chan-rx-11 0x7204000 <= Invalid MQCHIP(0)-chan-rx-12 0x7204000 <= Invalid MQCHIP(0)-chan-rx-13 0x7205000 <= Invalid MQCHIP(0)-chan-rx-14 0x7205000 <= Primitive MQCHIP(0)-chan-rx-15 0x7205000 <= Primitive MQCHIP(0)-chan-rx-16 0x7205000 <= Invalid MQCHIP(0)-chan-rx-17 0x7205000 <= Invalid MQCHIP(0)-chan-rx-18 0x7205000 <= Invalid MQCHIP(0)-chan-rx-19 0x7205000 <= Invalid MQCHIP(0)-chan-rx-20 0x7206000 <= Invalid MQCHIP(0)-chan-rx-21 0x7206000 <= Primitive MQCHIP(0)-chan-rx-22 0x7206000 <= Primitive MQCHIP(0)-chan-rx-23 0x7206000 <= Invalid MQCHIP(0)-chan-rx-24 0x7206000 <= Invalid MQCHIP(0)-chan-rx-25 0x7206000 <= Invalid MQCHIP(0)-chan-rx-26 0x7206000 <= Invalid MQCHIP(0)-chan-rx-27 0x7207000 <= Invalid MQCHIP(0)-chan-rx-28 0x7207000 <= Primitive MQCHIP(0)-chan-rx-29 0x7207000 <= Primitive MQCHIP(0)-chan-rx-30 0x7207000 <= Invalid MQCHIP(0)-chan-rx-31 0x7207000 <= Invalid MQCHIP(0)-chan-rx-32 0x7207000 <= Invalid MQCHIP(0)-chan-rx-33 0x7207000 <= Invalid MQCHIP(0)-chan-tx-34 0x7208000 => Invalid MQCHIP(0)-chan-tx-35 0x7208000 => Primitive MQCHIP(0)-chan-tx-36 0x7208000 => Primitive MQCHIP(0)-chan-tx-37 0x7208000 => Invalid MQCHIP(0)-chan-tx-38 0x7208000 => Invalid MQCHIP(0)-chan-tx-39 0x7208000 => Invalid MQCHIP(0)-chan-tx-40 0x7208000 => Invalid MQCHIP(0)-chan-tx-41 0x7209000 => Invalid MQCHIP(0)-chan-tx-42 0x7209000 => Primitive MQCHIP(0)-chan-tx-43 0x7209000 => Primitive MQCHIP(0)-chan-tx-44 0x7209000 => Invalid MQCHIP(0)-chan-tx-45 0x7209000 => Invalid MQCHIP(0)-chan-tx-46 0x7209000 => Invalid MQCHIP(0)-chan-tx-47 0x7209000 => Invalid MQCHIP(0)-chan-tx-48 0x720a000 => Invalid MQCHIP(0)-chan-tx-49 0x720a000 => Primitive MQCHIP(0)-chan-tx-50 0x720a000 => Primitive MQCHIP(0)-chan-tx-51 0x720a000 => Invalid MQCHIP(0)-chan-tx-52 0x720a000 => Invalid MQCHIP(0)-chan-tx-53 0x720a000 => Invalid MQCHIP(0)-chan-tx-54 0x720a000 => Invalid MQCHIP(0)-chan-tx-55 0x720b000 => Invalid MQCHIP(0)-chan-tx-56 0x720b000 => Primitive MQCHIP(0)-chan-tx-57 0x720b000 => Primitive MQCHIP(0)-chan-tx-58 0x720b000 => Invalid MQCHIP(0)-chan-tx-59 0x720b000 => Invalid MQCHIP(0)-chan-tx-60 0x720b000 => Invalid MQCHIP(0)-chan-tx-61 0x720b000 => Invalid MQCHIP(0)-chan-rx-62 0x720c000 <= Channel MQCHIP(0)-chan-rx-63 0x720c000 <= Invalid MQCHIP(0)-chan-rx-64 0x720c000 <= Invalid MQCHIP(0)-chan-rx-65 0x720d000 <= Invalid MQCHIP(0)-chan-rx-66 0x720d000 <= Invalid MQCHIP(0)-chan-rx-67 0x720d000 <= Invalid MQCHIP(0)-chan-tx-68 0x720e000 => Channel MQCHIP(0)-chan-tx-69 0x720e000 => Invalid MQCHIP(0)-chan-tx-70 0x720e000 => Invalid MQCHIP(0)-chan-tx-71 0x720f000 => Invalid MQCHIP(0)-chan-tx-72 0x720f000 => Invalid MQCHIP(0)-chan-tx-73 0x720f000 => Invalid asic mqchip(0) chip_flag 0x9 cookie 0x4282bf28 disaster_fn 0x0 QXCHIP(0)-chan-tx-1 MQCHIP(0)-Avago 65NM-link-146 QXCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-144 LUCHIP(0)-chan-tx-2 MQCHIP(0)-Avago 65NM-link-24 LUCHIP(0)-chan-tx-3 MQCHIP(0)-Avago 65NM-link-36 LUCHIP(0)-chan-rx-0 MQCHIP(0)-Avago 65NM-link-0 LUCHIP(0)-chan-rx-1 MQCHIP(0)-Avago 65NM-link-12 remote/unknown MQCHIP(0)-Avago 65NM-link-80 remote/unknown MQCHIP(0)-Avago 65NM-link-80 remote/unknown MQCHIP(0)-Avago 65NM-link-84 remote/unknown MQCHIP(0)-Avago 65NM-link-80 remote/unknown MQCHIP(0)-Avago 65NM-link-82 remote/unknown MQCHIP(0)-Avago 65NM-link-84 remote/unknown MQCHIP(0)-Avago 65NM-link-86 remote/unknown MQCHIP(0)-Avago 65NM-link-88 remote/unknown MQCHIP(0)-Avago 65NM-link-88 remote/unknown MQCHIP(0)-Avago 65NM-link-92 remote/unknown MQCHIP(0)-Avago 65NM-link-88 remote/unknown MQCHIP(0)-Avago 65NM-link-90 remote/unknown MQCHIP(0)-Avago 65NM-link-92 remote/unknown MQCHIP(0)-Avago 65NM-link-94 remote/unknown MQCHIP(0)-Avago 65NM-link-96 remote/unknown MQCHIP(0)-Avago 65NM-link-96 remote/unknown MQCHIP(0)-Avago 65NM-link-100 remote/unknown MQCHIP(0)-Avago 65NM-link-96 remote/unknown MQCHIP(0)-Avago 65NM-link-98 remote/unknown MQCHIP(0)-Avago 65NM-link-100 remote/unknown MQCHIP(0)-Avago 65NM-link-102 remote/unknown MQCHIP(0)-Avago 65NM-link-104 remote/unknown MQCHIP(0)-Avago 65NM-link-104 remote/unknown MQCHIP(0)-Avago 65NM-link-108 remote/unknown MQCHIP(0)-Avago 65NM-link-104 remote/unknown MQCHIP(0)-Avago 65NM-link-106 remote/unknown MQCHIP(0)-Avago 65NM-link-108 remote/unknown MQCHIP(0)-Avago 65NM-link-110 remote/unknown MQCHIP(0)-Avago 65NM-link-48 remote/unknown MQCHIP(0)-Avago 65NM-link-48 remote/unknown MQCHIP(0)-Avago 65NM-link-52 remote/unknown MQCHIP(0)-Avago 65NM-link-48 remote/unknown MQCHIP(0)-Avago 65NM-link-50 remote/unknown MQCHIP(0)-Avago 65NM-link-52 remote/unknown MQCHIP(0)-Avago 65NM-link-54 remote/unknown MQCHIP(0)-Avago 65NM-link-56 remote/unknown MQCHIP(0)-Avago 65NM-link-56 remote/unknown MQCHIP(0)-Avago 65NM-link-60 remote/unknown MQCHIP(0)-Avago 65NM-link-56 remote/unknown MQCHIP(0)-Avago 65NM-link-58 remote/unknown MQCHIP(0)-Avago 65NM-link-60 remote/unknown MQCHIP(0)-Avago 65NM-link-62 remote/unknown MQCHIP(0)-Avago 65NM-link-64 remote/unknown MQCHIP(0)-Avago 65NM-link-64 remote/unknown MQCHIP(0)-Avago 65NM-link-68 remote/unknown MQCHIP(0)-Avago 65NM-link-64 remote/unknown MQCHIP(0)-Avago 65NM-link-66 remote/unknown MQCHIP(0)-Avago 65NM-link-68 remote/unknown MQCHIP(0)-Avago 65NM-link-70 remote/unknown MQCHIP(0)-Avago 65NM-link-72 remote/unknown MQCHIP(0)-Avago 65NM-link-72 remote/unknown MQCHIP(0)-Avago 65NM-link-76 remote/unknown MQCHIP(0)-Avago 65NM-link-72 remote/unknown MQCHIP(0)-Avago 65NM-link-74 remote/unknown MQCHIP(0)-Avago 65NM-link-76 remote/unknown MQCHIP(0)-Avago 65NM-link-78 remote/unknown MQCHIP(0)-Avago 65NM-link-128 remote/unknown MQCHIP(0)-Avago 65NM-link-128 remote/unknown MQCHIP(0)-Avago 65NM-link-132 remote/unknown MQCHIP(0)-Avago 65NM-link-136 remote/unknown MQCHIP(0)-Avago 65NM-link-136 remote/unknown MQCHIP(0)-Avago 65NM-link-140 remote/unknown MQCHIP(0)-Avago 65NM-link-112 remote/unknown MQCHIP(0)-Avago 65NM-link-112 remote/unknown MQCHIP(0)-Avago 65NM-link-116 remote/unknown MQCHIP(0)-Avago 65NM-link-120 remote/unknown MQCHIP(0)-Avago 65NM-link-120 remote/unknown MQCHIP(0)-Avago 65NM-link-124

That's lot of information, but we can ignore the 'Invalid' then we're left with 'Primitive' and 'Table'. The 'Table' ones clearly mark where they are connecting to in this case LU chip and QX chip, so we can work on assumption that 'Primitive' is used to connect to fabric. We can further determine there are 8 channels in both directions, two of the channels are always next to each other numerically with lot of empty channels between, this may prove useful. But let's take closer look at one channel.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 7 MQCHIP(0)-chan-rx-7 : Up Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-80 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x8acfff19 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

What is interesting here is Sub channel 1 of 2 with 4 links, sub-channel 1 of 2? What is 2nd? But 4 links is interesting, so channel has 4 links. Let's check the next one. Also interesting to notice we have cell counter and there is 64b66b encoding overhead.

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 8 MQCHIP(0)-chan-rx-8 : Up Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7204000 first_link: MQCHIP(0)-Avago 65NM-link-84 Flag: 0x80000 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 3309287443 CRC errors: 0 Cell last : 2328887065 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

Ok, 7 and 8 are next each other, since it's logically one channel made out of two sub-channels. So the 8 channels we saw per direction, really are 4 logical channels with two sub-channels each. Looking at the other channels, everything looks the same, until we hit 28, 29 i.e. the 4th channel:

NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 28 MQCHIP(0)-chan-rx-28 : Down Sub channel 1 of 2 with 4 links link mask 0xf HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-104 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) bcr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 bcr_multiple_psr_push : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 psr_overflow : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_8b10b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_8b10b_valids : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_illegal_64b66b : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_64b66b_framing : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 psr_clock_correction : 0:0x0 1:0x0 2:0x0 3:0x0 4:0x0 5:0x0 6:0x0 7:0x0 : 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x0 14:0x0 15:0x0 : 16:0x0 17:0x0 18:0x0 19:0x0 20:0x0 21:0x0 22:0x0 23:0x0 : 24:0x0 25:0x0 26:0x0 27:0x0 28:0x0 29:0x0 30:0x0 31:0x0 sub_illegal_soc : 0:0x0 1:0x0 2:0x0 3:0x0 sub_idle_override : 0:0x0 1:0x0 2:0x0 3:0x0 sub_bcr_unsync : 0:0x0 1:0x0 2:0x0 3:0x0 sub_soc_thresh : 0:0x0 1:0x0 2:0x0 3:0x0 sub_upq_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_uflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_mem_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_obuf_cntl_oflow : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rxok_count : 0:0x0 1:0x0 2:0x0 3:0x0 sub_rt_rx_count : 0:0x0 1:0x0 2:0x0 3:0x0 Last poll: last_int_status : 0x0 last_cell_received : 0x0 last_crc_errors[0] : 0x0 last_crc_errors[1] : 0x0 last_crc_errors[2] : 0x0 last_crc_errors[3] : 0x0 last_crc_errors[4] : 0x0 last_crc_errors[5] : 0x0 last_crc_errors[6] : 0x0 last_crc_errors[7] : 0x0 last_bcr_overflow : 0x0 last_bcr_overflow : 0x0 last_bcr_multiple_psr_push : 0x0 last_sub_idle_override : 0x0 last_psr_overflow : 0x0 last_psr_illegal_8b10b : 0x0 last_psr_8b10b_valids : 0x0 last_psr_illegal_64b66b : 0x0 last_psr_64b66b_framing : 0x0 last_psr_clock_correction : 0x0 last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)# show hsl2 channel mqchip(0) 29 MQCHIP(0)-chan-rx-29 : Down Sub channel 2 of 2 with 4 links link mask 0xf0 HSL2_TYPE_T_RX reg: 0x7207000 first_link: MQCHIP(0)-Avago 65NM-link-108 Flag: 0x0 64b66b No-plesio input-width:0 Latency: hardware default Cell received: 0 CRC errors: 0 Cell last : 0 CRC last : 0 Rate thresholds: Fatal=1 Non-fatal:1 # of times crc errors exceed (0) the error_aggr_threshold (4294967296) Check base channel for rest of the sub-channel stats. last word_align: 0 max word_align 0 last chan_bond : 0 max chan_bond 0 NPC0(test13nqe1-re1.dk vty)#

It's down, and no cells have been transferred. Isn't this convenient, we have 4 channels per direction, but one of them is down, so we're left with 3 channels per direction, this being MX960, it matches our SCB count. But why two sub-channels per SCB? Because fabric actually has two Avago chips, you need to connect to both, but present it as single channel to higher level use. Why do we have four channels, if we only use 3 of them? Well, if we look this same stat in MX240/MX480 we actually see the fourth channel up and pushing cells! Clearly MX960 has 8+8+8 channels per MQ for fabric and MX240/MX480 have 8+8+8+8 per MQ, 8 more! But MX240/MX480 have only two SCB, so clearly it must be arranged as (8+8)+(8+8)

Now, we know MX240/MX480 SCB can handle MPC2 (2*MQ, i.e. 2*40G) fabric connectivity redundantly with single SCB up, so we know (8+8) (single SCB up) must be enough to run 40G, i.e. 16 must be enough to run single 40G MQ fabric capacity. This gives us bottom range to SerDes capacity as 40G/16SerDes = 2.5Gbps. But wait, we had 64b/66b encoding and we know fabric uses cells which have header, which is overhead, so 2.5Gbps can't be possible it must be more. Bit of googling we'll find that Avago uses 3.125Gbps SerDes, this makes sense. So the raw capacity available to MQ is 16*3.125=50Gbps, redundantly! Double in MX240/MX480 with both SCB are up, so 100Gbps fabric capacity, that seems like LOT! But actually multicast replication in MPC2 is 'btree', so each linecard needs to send two copies of multicast over fabric to replicate it, so 100Gbps is back again halved to 50Gbps for multicast scenario, and cell overhead can be devastating (think of small packet using two cells, but just 1 byte of 2nd cell, efficiency is 50%!).

This also means to us, that MX960 with all SCB up, has just 75Gbps of capacity per MQ when all SCB are used, so plenty for unicast, but not enough for worst case multicast btree replication. Still seems like SCB and MPC2 are good match for most any realistic/practical scenario, no one is running 100% multicast, plenty of fabric capacity for real life, in MX960 and MX240/MX480, with all SCB up or one SCB down, still full wire-rate fabric capacity for unicast.

But if we actually try to send traffic from DPCE to MPC2 in lab we'll see MPC2 MQ=>DCPE being fine, single MQ can congest the 40G DPCE card. But DPCE=>MQ in MX960 experiences just 26Gbps out of 40Gbps needed, even when all SCB are up. Wait what? Thought this was murica! We had 75Gbps MX960 capacity, and we can do just 26Gbps? What is up with that! Furthermore, if we pull one SCB out, lab test is 100% clean, DPCE=>MQ experiences 40Gbps out of 40Gbps needed, full linerate. My head hurts!

Let's ponder a bit. We know DPCE does not connect to the redundant SCB's, so it's always just 2 SCB in MX960, even though MQ in MPC2 connects to 3 SCB. So if we divide 40Gbps MQ needs nominally by 3 SCB, it needs 13.33Gbps per SCB, which multiplied by 2 is, 26.66Gbps. Looking at fabric requests and grants, we see that MQ is not giving grants to DPCE (ichip), we have fabric capacity but MQ is not allowing DPCE to send. Clearly MQ does not want to deal with oversubscription, for one reason or another, and it only gives out fabric grants at 13.33Gbps per SCB, so that in no case would aggregate amount of fabric traffic be over 40Gbps, no oversubscription. But this strategy brokes down, when MQ has 3 connection, and DPCE only 2, and again issue is remedied when MQ downgrades to 2 connection, as now MQ is giving out fabric grants at 40/2 => 20Gbps per SCB, instead of just 13.3Gbps.

Why does not MQ want to deal with oversubscription? Maybe it won't have sufficient information at ingress to drop intelligently enough to keep QoS contracts? Maybe this behavior is done to avoid software complexity? Turns out, you don't even have to pull out the SCB to force MQ to connect just to two SCB, you can use set chassis fabric redundancy-mode redundantand MPC2<=>DPCE is wire rate, and 3rd SCB in MX960 is unused, as it's not needed for MPC2, remember we have 50Gbps per MQ without it, and 75Gbps with it. So only reason we even might need it, is if we do multicast.

Interestingly, if MPC2 would be 120Gbps instead of 80Gbps, so 60Gbps of fabric capacity per MQ instead of 40Gbps. This issue shouldn't exist, as you'd be giving out fabric grants at 60/3 => 20Gbps per SCB, so even in non-redundant/capacity mode DPCE would get 40Gbps it need. Other way to fix this, is of course upgrade SerDes from 3.125 to 6.25Gbps, turns out, Avago has such chips. This means MQ can get 6.25*(8+8) i.e. 100Gbps, with just two SCBE. So enough to do btree replication, and as it's just 2 SCB, you're still giving DPCE full 40Gbps of fabric grants. So clearly SCBE is better fix, but had MQ originally been 60Gbps instead of 40Gbps, I think that would have been much better as well.

Can this issue happen in future again? Can't see why not. Imagine SCBE system with far-end linecard being 2*80Gbps (enought to connect to just 2 SCBE) and near-end linecard being 2*100Gbps MQ, with similar behavior about curbing fabric grants to 100Gbps per MQ, so you're giving out 100/3 = 33.33Gbps and far-end would be getting 66.66Gbps of fabric grants from 80Gbps needed. But looking at DPCE and MPC I don't think the problem will come again, as looking at fabric streams in MQ and DPCE (ichip) it looks like traffic coming from MQ can be classified in far-end MQ and can be dropped while retaining QoS contract. Current DPCE=>MQ is probably unfortunate combination of mis-matching bandwidths and DPCE having single stupid fabric stream so far-end cant easily determine what to drop in case of congestion, so best policy was just to make sure congestion never happens in ingress.

We can check the fabric streams and see there is now priority in MQ:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream all Input Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: DWWW---------------------------- -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H Output Streams enabled (F=FAB, W=WAN, D=DROP, L=LPBK, H=HOST) ------------------------------------------------------------- 0.. 63: ----FF-------------------------- -------------------------------- 64.. 127: -------------------------------- -------------------------------- 128.. 191: ----FF-------------------------- -------------------------------- 192.. 255: -------------------------------- -------------------------------- 256.. 319: -------------------------------- -------------------------------- 320.. 383: -------------------------------- -------------------------------- 384.. 447: -------------------------------- -------------------------------- 448.. 511: -------------------------------- -------------------------------- 512.. 575: -------------------------------- -------------------------------- 576.. 639: -------------------------------- -------------------------------- 640.. 703: -------------------------------- -------------------------------- 704.. 767: -------------------------------- -------------------------------- 768.. 831: -------------------------------- -------------------------------- 832.. 895: -------------------------------- -------------------------------- 896.. 959: -------------------------------- -------------------------------- 960..1023: -------------------------------- -------------------------------- 1024..1087: WW------------------------------ -------------------------------- 1088..1151: -----------------------------WWL L------------------------------H NPC1(test2nxb10-re0.dk vty)#

Notice there are four fabric streams (2 fabric stream per NPU in system), 4,5 and 132,133. It's worth another post, but there is offset of 128 between low-prio and high-prio. We can shortly try to prove this:

NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 4 Input Stream 4 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 4 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 4 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 stream 132 Input Stream 132 ----------------- attached : 0 enabled : 1 pic slot : 0 mac mode : 0 port : 0 conn : 0 tclass : 0 hi prio : 0 weight : 0 Output Stream 132 ------------------ attached : 0 enabled : 1 pic slot : 0 mac mode : 0 wan if : 0 port : 0 conn : 0 weight : 0 sched : 1 MQ1 l1 node : 0 queue : 132 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 4 Q node 4: allocated : true parent node : 0 guarantee prio : 3 GL excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 5 mnc in gp enable : 1 ----------------------- guarantee credits : 0 excess credits added: 0 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)# show mqchip 0 sched 1 q 132 Q node 132: allocated : true parent node : 2 guarantee prio : 1 GH excess prio : 2 EL rate enable : 0 m rate : 0 bps m credits : 0 guarantee enable : 1 g rate : 0 bps g credits : 0 adjust cmode : 0 adjust value : 0 weight : 95 mnc in gp enable : 1 ----------------------- guarantee credits : 65408 excess credits added: 1 excess credits : 0 max credits : 0 NPC1(test2nxb10-re0.dk vty)#

So traffic which is in-contract going to fabric in stream 4 gets 'GL' guaranteed-low, and traffic which is in-contract going to fabric in stream 132 gets 'GH' guaranteed-high. So provided far-end MQ has this stream information, it would be very easy for it to know what to drop in oversubscriptiption, drop 4, not 132. IChip does not have this, so maybe it's reason why it was just guaranteed that there is no oversubscription from fabric?

It's important to realize fabric bandwidth is just one issue to think, MQ has internally less than needed 80Gbps (40G fabric, 40G wan) of memory bandwidth, actual number depends on cell alignment etc, but can be 80Gbps in good climate, and maybe 75Gbps in average climate or lot less with manufactured climate.

Some bullet points for TL;DR people

MQ has 32 SerDes towards fabric
Each SerDes is 3.125Gbps gross (add 64/66, cell overhead+alignment)
MX960 allocates SerDes as 8+8+8 to 3 SCB
MX240/MX480 allocates SerDes as 16+16 to 2 SCB
SCBE upgrades the SerDes to 6.25Gbps
SCBE2 upgrades the SerDes to 16.5Gbps (matches what Avago offers, but not very sure)

Capture your fancy, part two, Trio

2014-03-15T12:03:00.000+02:00

Like with 7600/PFC3, it is possible to capture transit traffic on Juniper Trio (MPC, MX80, MX104, FPC5 etc). First decide what you know about the packet and convert that data to hex, it can be pretty much anywhere in the packet in the first 320B or so.

[ytti@ytti.fi ~]% pry [1] pry(main)> '194.100.7.227'.split('.').map{|e|"%02x" % [e.to_i]}.join => "c26407e3" [2] pry(main)> '91.198.120.24'.split('.').map{|e|"%02x" % [e.to_i]}.join => "5bc67818"

I'm using boringly IPv4 addresses but I could have used anything. Unlike in PFC3 you do not need tell the location in the packet where the pattern must occur, you just tell pattern and any packet having that pattern anywhere is triggered, let's try it:

fisakytt@mec-pe1-re0.hel.fi> start shell pfe network tfeb0 TFEB platform (1000Mhz MPC 8544 processor, 1024MB memory, 512KB flash) TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem enable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem capture 0x3 5bc67818c26407e3 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem dump Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x08 0x80 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3f 0x01 0x1b 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x11 0x71 0x00 0x00 0x60 0x80 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xaa 0x40 0x00 0x3e 0x01 0x1c 0xd9 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x8b 0xb8 0x0e 0xa4 0xed 0xdb 0xb6 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0xca 0x95 0x0c 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 <...> Received 116 byte parcel: Dispatch cookie: 0x0074000000000000 0x00 0x09 0x00 0xf0 0x80 0x08 0x5c 0x5e 0xab 0x0b 0x6e 0x60 0xb0 0xa8 0x6e 0x7c 0x60 0x52 0x88 0x47 0x00 0x00 0x01 0xfe 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3f 0x01 0x1b 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 Sent 111 byte parcel: 0x08 0xbf 0xe0 0x12 0x71 0x00 0x00 0x60 0x10 0x0e 0x80 0x18 0x9e 0x52 0x54 0x00 0x5c 0x97 0x46 0x5c 0x5e 0xab 0x0b 0x6e 0x7e 0x08 0x00 0x45 0x00 0x00 0x54 0x81 0xcb 0x40 0x00 0x3e 0x01 0x1c 0xb8 0x5b 0xc6 0x78 0x18 0xc2 0x64 0x07 0xe3 0x08 0x00 0x47 0xbf 0x0e 0xa4 0xed 0xfc 0xb7 0x0b 0x24 0x53 0x00 0x00 0x00 0x00 0x16 0x6e 0x03 0x00 0x00 0x00 0x00 0x00 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 TAZ-TBB-0(mec-pe1-re0.hel.fi vty)# test jnh 0 packet-via-dmem disable TAZ-TBB-0(mec-pe1-re0.hel.fi vty)#

So the format is test jnh (mq_where_interface_is) packet-via-dmem capture (16_bit_type_mask_in_hex) (up-to-8-bytes-pattern) (optional offset from start of the packet) you should use mask 0x3, in my test test bits 1 and 2 are production traffic, bits 10 and 15 are some crap, and others are just some types I don't seem to be using on my boxes

Unlike in PFC3, we capture many packets, and list is constantly updated until you stop the capture. This is very nice when you're not exactly sure what you're looking for and you know your trigger will also match packets you don't care for. We also easily see both packet received and packet send, so we can be sure the traffic is arriving to the box, from MAC addresses we can determine how and where and sent parcel gives us high degree of confidence the packet is leaving the box.

I don't know what the dispatch cookie means, nor what the first 6 bytes in the received parcel or the first 13 bytes in the sent parcel. I'm guessing that is some internal metadata, quickly trying to check for stream ID in MQ and IX or IFL and IFD numbers I can't find a match for them. But I'm mostly interested in just seeing that the packet came in from expected DMAC+SMAC and went out with expected DMAC+SMAC. Would be very useful to be able to at least extract somehow ingress and egress port information (MQ, IX, port, IFL, IFD, anything).

If you're capturing on box with multiple MQ and fabric then you won't see sent parcel having the rewrite information (but you'll see more metadata), you need to jump on the egress MQ to catch rewrite information. If you are having trouble reading the hexdump you can always use 'text2pcap' from wireshark to turn it into PCAP file and browse it in wireshark. Lot more playing around is needed to understand parcel types, cookie, metadata and use in multiple MQ scenario.

Capture your fancy, part one, PFC3

2014-03-01T13:32:00.000+02:00

It's often incredibly useful to be able to capture transit traffic, it's quick way to prove that you're actually receiving some frames and with any luck have good idea how and where you are sending them. It's unfortunately common, especially in 7600/6500 PFC3 to have bug where packets are not going where software FIB suggests they are. Luckily there is quite good tooling to inspect what really is happening. So we're taking a peek at 'ELAM'.

We have traffic coming in unlabeled to 7600 and going out labeled. Let's see how to capture it

psl2-pe2.hel.fi#show platform capture elam asic superman slot 5 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 help SEQ_NUM [5] QOS [3] QOS_TYPE [1] TYPE [4] STATUS_BPDU [1] IPO [1] NO_ESTBLS [1] RBH [3] CR [1] TRUSTED [1] NOTIFY_IL [1] NOTIFY_NL [1] DISABLE_NL [1] DISABLE_IL [1] DONT_FWD [1] INDEX_DIRECT [1] DONT_LEARN [1] COND_LEARN [1] BUNDLE_BYPASS [1] QOS_TIC [1] INBAND [1] IGNORE_QOSO [1] IGNORE_QOSI [1] IGNORE_ACLO [1] IGNORE_ACLI [1] PORT_QOS [1] CACHE_CNTRL [2] VLAN [12] SRC_FLOOD [1] SRC_INDEX [19] LEN [16] FORMAT [2] MPLS_EXP [3] REC [1] NO_STATS [1] VPN_INDEX [10] PACKET_TYPE [3] L3_PROTOCOL [4] L3_PT [8] MPLS_TTL [8] SRC_XTAG [4] DEST_XTAG [4] FF [1] MN [1] RF [1] SC [1] CARD_TYPE [4] DMAC [48] SMAC [48] IPVER [1] IP_DF [1] IP_MF [1] IP_HDR_LEN [4] IP_TOS [8] IP_LEN [16] IP_HDR_VALID [1] IP_CHKSUM_VALID [1] IP_L4HDR_VALID [1] IP_OFFSET [13] IP_TTL [8] IP_CHKSUM [16] IP_SA [32] IP_DA [32] IP_DATA [120] TCP_SPORT [16] TCP_DPORT [16] TCP_SEQ [32] TCP_ACK [32] TCP_CODE [6] TCP_CHKSUM [16] UDP_SPORT [16] UDP_DPORT [16] UDP_DATA [88] IGMP_VERS [4] IGMP_TYPE [4] IGMP_DATA [112] ICMP_TYPE [8] ICMP_CODE [8] ICMP_DATA [104] CRC [16] EXTRA_TYPE [4] THR_PKT_HI_NF [14] THR_BYTE_HI_NF [22] THR_PKT_LO_NF [14] THR_BYTE_LO_NF [22] PKT_CNT_NF [14] BYTE_CNT_NF [22] GEMINI_NF_SEG [3] FINRST [1] STATS [1] DUMMY [1] CRC_16 [16] well-know values for field TYPE ETHERNET 0x0 TOKENRING 0x1 FDDI 0x2 ATM 0x3 VOICE 0x4 VIDEO 0x5 INBAND 0x7 well-know values for field FORMAT IP 0x0 IPX 0x1 INBAND 0x2 OTHERS 0x3 well-know values for field PACKET_TYPE ETHERNET 0x0 RAW_IPX 0x1 802_C1 0x2 802_C2 0x3 802_SNAP 0x4 ENCAP_INVALID 0x7 well-know values for field L3_PROTOCOL IPV4 0x0 IPX 0x1 APPLETALK 0x2 DECNET 0x3 VINES 0x4 XNS 0x5 IPV6 0x6 INVALID 0xF well-know values for field L3_PT ICMP 0x1 IGMP 0x2 IP 0x4 TCP 0x6 UDP 0x11 well-know values for field CACHE_CNTRL NORMAL 0x0 RESERVED 0x1 GEMINI_MISS 0x2 DBUS_EXTRA_CICLE 0x3 well-know values for field IPVER IPV4 0x0 IPV6 0x1 psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 IP_SA=91.198.120.24 elam capture completed psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x9 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 0 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 0 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3510 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0x7F LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x3 PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0x8 DEST_XTAG [4] = 0x0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 5c5e.abee.d320 IPVER [1] = 0 [IPV4] IP_DF [1] = 1 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x44BA IP_SA = 91.198.120.24 IP_DA = 194.100.7.227 ICMP_TYPE [8] = 0x8 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 61 FD 10 EA 15 39 1B AD 11 53 00 78 0C "a....9...S.x." CRC [16] = 0x8310 RBUS data: SEQ_NUM [5] = 0x9 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 1 [GENERIC] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xC60 VLAN [12] = 3168 RBH [3] = b001 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 DELTA_LEN [8] = 4 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq 'B0 A8 6E 7C 62 42 00 27 0D C9 88 00 88 47'. insert seq '49 58 01 FF' before ofs 14. FCS2 [8] = 0x0 psl2-pe2.hel.fi#

Hooray, great success. We can only capture 1 packet, first packet hitting the trigger criteria and capture is complete and nothing else is seen. We see 'DBUS' and 'RBUS' data. DBUS is where lookup engine is getting the lookup question from linecard, the question is 64B (IPv4) or 96B (MPLS, IPv6). DBUS is 62.5MHz, which 32B per cycle, so 62.5/2 = 31.25Mpps and 62.5/3 = 20.83Mpps, those are pretty much our absolute theoretical limits how much work we can give the lookup engine, at that point our DBUS is congested and we cannot send more queries to it. RBUS is where the lookup engine sends lookup result back to linecard, by looking at this data, linecard will know where to send the frame and how to rewrite the packet headers.

In DBUS we can see the incoming VLAN (typically internal VLAN), was it recirculated (CR), QOS, IP, PORT, ICMP, etc. But it's usually the RBUS which is more interesting. Is CCC 'L3_RW' if it were something like 'L2_POLICER' packet wasn't forwarded but was punted for usually 'mls rate-limit'. DEST_INDEX+VLAN where is the packet going, here DEST_INDEX is just VLAN, but it could be some special adjacency too, which we can see in 'remote command switch show platform hardware tycho register 0 1794 | XXXX'. Say if it was CoPP drop, RBUS would likely show 7FFF, we could check that there are several register numbers which have value 7FFF, and I don't know how to check which one was used, luckily the registers are named and the names are usually very indicative.

There is interesting trick we can do for packets hitting those special adjacencies. Say DBUS gave us '7F0A' and we guessed it's really '0x03C7: RED_PKT_ERR_IDX = 0x00007F0A', or RED_PKT_ERR_IDX. Now we can reprogram 0x3c7 to send out to some physical port, instead of 0x7fa0 which is going to end up dropped. You might want to do this, to figure out more closely what maybe was wrong in some dropped packet, or maybe just to check what your CoPP policies are dropping. To reprogram the value, we need to know how ports are numbered in HW. Each port has 'LTL Index' this is calculated as SlotN-1*64 + PortN-1, so giga5/2 would be 4*64 + 1, or 257. Then we could poke the register 'remote command switch show platform hardware tycho poke 3c7 257' and 3c7 register would be diverted to giga5/2 instead of original 0x7f0a special drop adjacency. You can repoke it to 0x7f0a when satisfied.

RBH is hash result for load-balancing. And then maybe most importantly we have REWRITE_INFO, this will tell us with high confidence that packet was sent out, and we can see if it just got new DMAC+SMAC, or did it it also get VLAN or MPLS or both. Here we can clearly see 0x8847, so it's going out with label 0x495801FF, so label value 0x49580, cos+bos 0x1 (so cos 0, bos on), ttl 0xff. We can confirm that the REWRITE_INFO matches what we expect:

psl2-pe2.hel.fi#sh mls cef lookup 194.100.7.227 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(121242 ): E | 1 FFF 0 0 0 0 255.255.255.240 V(121242 ): 8 | 1 0 0 0 0 0 194.100.7.224 (A:165858 ,P:1,D:0,m:0 ,B:0 ) M(121242 ): E | 1 FFF 0 0 0 255.255.255.240 V(121242 ): 9 | 1 0 0 0 0 194.100.7.224 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 165858 detail Index: 165858 smac: 0027.0dc9.8800, dmac: b0a8.6e7c.6242 mtu: 9196, vlan: 3168, dindex: 0x23BC4, l3rw_vld: 1 format: MPLS, flags: 0x1000208518 label0: 0, exp: 0, ovr: 0 label1: 0, exp: 0, ovr: 0 label2: 300416, exp: 0, ovr: 0 op: PUSH_LABEL2 packets: 1, bytes: 80

Same MAC addresses, egress VLAN, label value, so everything is fine here. Now what if we want to capture the packet coming in from the MPLS? That is quite bit harder, as we cannot use any syntactic sugar trigger values. But ELAM will allow you to match on any bit and allows you to mask any bit. So it's just matter of patience or little script to have proper capture filter for MPLS packets too. We should try to figure out as much as we can about how the packet is coming in, especially how many labels it has (as it affects our offset) the matching begins from DMAC and continues as on the wire. We could match SMAC+DMAC+ETYPE+dont_care+SADDR+DADDR etc. Right now, I'm lazy and I'll just match mpls etype and DADDR, we were seeing SADDR 91.198.120.24 going to MPLS network, let's see if we can find DADDR to 91.198.120.24 (5b c6 78 18) coming from MPLS network. Here it is beneficial to have cursory understanding of ethernet + mpls + ip header, especially you want to know your offset to the bits you're interested in. The ELAM capture is specified in groups of 4 bytes and can be compressed to 0 if you don't care on any of them.

psl2-pe2.hel.fi#show platform capture elam trigger dbus others if DATA=0 0 0 0x88470000 0 0 0 0 0x00005bc6 0x78180000 [ 0 0 0 0xffff000 0 0 0 0 0x0000ffff 0xffff0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=OTHERS DATA=0 0 0 0X88470000 0 0 0 0 0X00005BC6 0X78180000 [ 0 0 0 0XFFFF0000 0 0 0 0 0X0000FFFF 0XFFFF0000 ] VLAN=3168 psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x12 QOS [3] = 0 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 0 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 3168 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 106 FORMAT [2] = 3 [OTHERS] PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 15 [INVALID] L3_PT [8] = 0 FF [1] = 0 MN [1] = 0 RF [1] = 0 SC [1] = 0 CARD_TYPE [4] = 0x0 ISL [16] = 0x0 DATA [592] 0000: 00 27 0D C9 88 00 B0 A8 6E 7C 62 42 88 47 00 00 ".'......n|bB.G.." 0010: 03 FE 45 00 00 54 A6 E8 00 00 3F 01 36 9B C2 64 "..E..T....?.6..d" 0020: 07 E3 5B C6 78 18 00 00 A3 FD 12 28 0A E5 2F B9 "..[.x......(../." 0030: 11 53 00 00 00 00 CF BB 01 00 00 00 00 00 42 42 ".S............BB" 0040: 42 42 42 42 42 42 42 42 42 42 "BBBBBBBBBB" CRC [16] = 0x8184 RBUS data: SEQ_NUM [5] = 0x12 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 1 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 1 [B64] FLOOD [1] = 0 DEST_INDEX [19] = 0x7FFA VLAN [12] = 4012 RBH [3] = b011 RDT [1] = 1 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 19 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 26 IP_TTL [8] = 63 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 28 IP_CSUM [16] = 0x369B DELTA_LEN [8] = 252 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 13 with seq '00 27 0D C9 88 00 00 00 00 00 00 00 08 00'. remove bytes from ofs 14 to ofs 17. FCS2 [8] = 0x0

Faaantastic. We have expected DADDR+SADDR, it's ICMP and I even set the payload now to 0x42 just be clear. But hmm, DEST_INDEX 0x7FFA that does not look good, that's something special, cursory look on the REWRITE_INFO makes me dubious. Let's dig around bit.

psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 986, bytes: 100544 psl2-pe2.hel.fi#sh mls cef lookup 91.198.120.24 detail Codes: M - mask entry, V - value entry, A - adjacency index, P - priority bit D - full don't switch, m - load balancing modnumber, B - BGP Bucket sel V0 - Vlan 0,C0 - don't comp bit 0,V1 - Vlan 1,C1 - don't comp bit 1 RVTEN - RPF Vlan table enable, RVTSEL - RPF Vlan table select Format: IPV4_DA - (8 | xtag vpn pi cr recirc tos prefix) Format: IPV4_SA - (9 | xtag vpn pi cr recirc prefix) M(159300 ): E | 1 FFF 0 0 0 0 255.255.255.0 V(159300 ): 8 | 1 0 0 0 0 0 91.198.120.0 (A:67379 ,P:1,D:0,m:0 ,B:0 ) M(159300 ): E | 1 FFF 0 0 0 255.255.255.0 V(159300 ): 9 | 1 0 0 0 0 91.198.120.0 (V0:0 ,C0:0 ,V1:0 ,C1:0 ,RVTEN:0 ,RVTSEL:0 ) psl2-pe2.hel.fi#sh mls cef adjacency entry 67379 detail Index: 67379 smac: 0027.0dc9.8800, dmac: 5c5e.abee.d320 mtu: 1518, vlan: 3510, dindex: 0x0, l3rw_vld: 1 format: MAC_TCP, flags: 0x2000208408 delta_seq: 0, delta_ack: 0 packets: 203, bytes: 20666

Yeah, definitely we're not rewriting those MAC addresses and definitely the VLAN is not correct in the RBUS. But what is that VLAN and the DEST_INDEX.

psl2-pe2.hel.fi#remote command switch show platform hardware tycho register 0 1794 | i 7FFA 0x03C1: RED_CEN_RW_IDX = 0x00007FFA [32762 ] 0x03C3: RED_IN_RW_FF_IDX = 0x00007FFA [32762 ] psl2-pe2.hel.fi#show vlan internal usage | i 4012 4012 vrf_0_vlan

Aaah, it's a recirculate. We're putting it to VRF 0 (INET) and sending 'er right back. Notice how we're chanaging the MAC addresses and we're dropping the mpls label out on rewrite. So as MPLS label is gone, we again can capture the packet with fancy triggers, just to be sure we're capturing the recirculated one, we'll also set CR=1 (if you need another recirculate, too bad, may not have any way to see the 3rd one)

psl2-pe2.hel.fi#show platform capture elam trigger dbus ipv4 if CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam status active ELAM info: Slot Cpu Asic Inst Ver PB Elam ---- --- -------- ---- --- -- ---- 5 0 ST_SMAN 0 3.2 Y DBUS trigger: FORMAT=IP L3_PROTOCOL=IPV4 CR=1 IP_DA=91.198.120.24 psl2-pe2.hel.fi#show platform capture elam start psl2-pe2.hel.fi#show platform capture elam data DBUS data: SEQ_NUM [5] = 0x7 QOS [3] = 1 QOS_TYPE [1] = 0 TYPE [4] = 0 [ETHERNET] STATUS_BPDU [1] = 0 IPO [1] = 1 NO_ESTBLS [1] = 0 RBH [3] = b000 CR [1] = 1 TRUSTED [1] = 1 NOTIFY_IL [1] = 0 NOTIFY_NL [1] = 0 DISABLE_NL [1] = 0 DISABLE_IL [1] = 0 DONT_FWD [1] = 0 INDEX_DIRECT [1] = 0 DONT_LEARN [1] = 0 COND_LEARN [1] = 0 BUNDLE_BYPASS [1] = 0 QOS_TIC [1] = 1 INBAND [1] = 0 IGNORE_QOSO [1] = 0 IGNORE_QOSI [1] = 0 IGNORE_ACLO [1] = 0 IGNORE_ACLI [1] = 0 PORT_QOS [1] = 0 CACHE_CNTRL [2] = 0 [NORMAL] VLAN [12] = 4012 SRC_FLOOD [1] = 0 SRC_INDEX [19] = 0xC1 LEN [16] = 102 FORMAT [2] = 0 [IP] MPLS_EXP [3] = 0x0 REC [1] = 0 NO_STATS [1] = 0 VPN_INDEX [10] = 0x7F PACKET_TYPE [3] = 0 [ETHERNET] L3_PROTOCOL [4] = 0 [IPV4] L3_PT [8] = 1 [ICMP] MPLS_TTL [8] = 0 SRC_XTAG [4] = 0xF DEST_XTAG [4] = 0xA FF [1] = 0 MN [1] = 0 RF [1] = 1 SC [1] = 0 CARD_TYPE [4] = 0x0 DMAC = 0027.0dc9.8800 SMAC = 0000.0000.0000 IPVER [1] = 0 [IPV4] IP_DF [1] = 0 IP_MF [1] = 0 IP_HDR_LEN [4] = 5 IP_TOS [8] = 0x0 IP_LEN [16] = 84 IP_HDR_VALID [1] = 1 IP_CHKSUM_VALID [1] = 1 IP_L4HDR_VALID [1] = 1 IP_OFFSET [13] = 0 IP_TTL [8] = 63 IP_CHKSUM [16] = 0x8774 IP_SA = 194.100.7.227 IP_DA = 91.198.120.24 ICMP_TYPE [8] = 0x0 ICMP_CODE [8] = 0x0 ICMP_DATA [104] 0000: 89 06 13 36 17 1E 5A C3 11 53 00 B0 61 "...6..Z..S..a" CRC [16] = 0x136E RBUS data: SEQ_NUM [5] = 0x7 CCC [3] = b100 [L3_RW] CAP1 [1] = 0 CAP2 [1] = 0 QOS [3] = 0 EGRESS [1] = 0 DT [1] = 0 [IP] TL [1] = 0 [B32] FLOOD [1] = 1 DEST_INDEX [19] = 0xDB6 VLAN [12] = 3510 RBH [3] = b010 RDT [1] = 0 GENERIC [1] = 0 EXTRA_CICLE [1] = 0 FABRIC_PRIO [1] = 0 L2 [1] = 0 FCS1 [8] = 0x1 IP_TOS_VALID [1] = 0 IP_TOS_OFS [7] = 15 IP_TOS [8] = 0x0 IP_TTL_VALID [1] = 1 IP_TTL_OFS [7] = 22 IP_TTL [8] = 62 IP_CSUM_VALID [1] = 1 IP_CSUM_OFS [7] = 24 IP_CSUM [16] = 0x8874 DELTA_LEN [8] = 0 REWRITE_INFO i0 - replace bytes from ofs 0 to ofs 11 with seq '5C 5E AB EE D3 20 00 27 0D C9 88 00'. FCS2 [8] = 0x0

Greeaaat! We have expected rewrite info and VLAN, so we have high confidence for both directions of packets, and we've learned the packets coming from MPLS side are getting recirculated (it's because of EXP0, but that's another story, there are many reasons why it might happen).

JunOS and ARP Glean

2014-02-25T18:51:00.000+02:00

I'm using Cisco vocabulary 'glean' here as I don't know better word for it. Glean is any IPv4 packet which is going to connected host which is not resolved. It is NOT an ARP packet, so ARP policers won't help you. They are punted, since you need to generate ARP packet and try to resolve them.

In 7600 we can use 'mls rate-limit unicast cef glean 200 50' to limit how many packets per second are punted to control-plane for glean purposes. How can we limit this in JunOS? As far as I can see, there is no way. But I remember testing this attack and was unable to break MX80, so why didn't it break?

First let's check what does connected network look like

fisakytt@test13nqa1-re0.dk> show route forwarding-table destination 62.236.255.179/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 62.236.255.0/24 intf 0 rslv 828 1 xe-0/0/0.42

Ok, fair enough. Type 'rslv', which we can guess means packet is punted to control-plane for resolving ARP. Let's try to ping some address rapidly which does not resolve and check what it looks like

Interesting, we get host/32 'hold' entry for it, no longer whole connected network as itself. Hold adjacency is programmed by software, so between 1st packet coming in and hold adjacency being programmed all packets in between are punted as well. I'm bit optimistic now that maybe there is some intelligence here. Maybe we'll get more information by checking the index (it changes quite often, so try to catch it at same time)

MX104-ABB-0(test13nqa1-re0.dk vty)# show nhdb id 830 extensive ID Type Interface Next Hop Addr Protocol Encap MTU Flags PFE internal Flags ----- -------- ------------- --------------- ---------- ------------ ---- ------------------ ------------------ 830 Hold xe-0/0/0.42 - IPv4 Ethernet 0 0x0000000000000000 0x0000000000000000 BFD Session Id: 0 Hold Status Packet: 0x4499d788 Timer: Running Flags: 0x0000000000000000

Well not too interesting. Didn't try to figure out what 'Packet' means yet, maybe it's memory location for packet which caused it to be on hold, maybe it's timer information. But curious thing is, that it's timed. So we'll know it'll go away automatically after some time. But how can we figure out what happens to hold and rslv? Maybe they are exceptions?

MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== Routing ---------------------- hold route DISC(70) 6339 532476 resolve route PUNT(33) 175 14700

Yup, the hold routes are discarded. I'm sending 10pps, so we're punting 0.27 per second. And we can extrapolate hold timer being about 3.7s. So 1 packet per 3.7s is punted. If I had been sending lot more packets, I'd see lot more punts, 10pps is just very slow rate so 'hold' adjacency will drop rest of them. Also if we have very many addresses this can add up to good number of packets. I might try to walk through the punt(33) to see if there is policer attached to it or not (There should be, but in certain situations it will be bypassed, this will be fixed and hopefully at same time we'll see ddos-policer limiters for resolve packets).

Just to be sure we've interpreted all this correctly, let's check that DISC(70) is seeing our ping packet to 62.236.255.179

MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions 70 discard MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions-trace MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh exceptions-trace [430247] jnh_exception_packet_trace: ############### [430248] jnh_exception_packet_trace: [iif:338,code/info:198/0x273,score:(0x0),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:102] [430249] jnh_exception_packet_trace: 0x00: 20 00 c6 00 27 30 01 52 00 12 00 66 80 00 00 20 [430250] jnh_exception_packet_trace: 0x10: 12 00 00 12 00 10 00 00 00 3c 8a b0 8d a6 4a 28 [430251] jnh_exception_packet_trace: 0x20: c0 da 07 c1 c8 88 47 00 00 0d 3e 45 c0 00 54 06 [430252] jnh_exception_packet_trace: 0x30: 62 00 00 40 01 84 6d c1 58 ef 21 3e ec ff b3 08 [430253] jnh_exception_packet_trace: 0x40: 00 99 6b 3a c5 16 07 53 0c c7 ed 00 00 07 cb 08 [430254] jnh_exception_packet_trace: 0x50: 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 [430255] jnh_exception_packet_trace: 0x60: 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 [430256] jnh_exception_packet_trace: 0x70: 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37

3e ec ff b3 => 62.236.255.179, yup that's our ping packet being discarded due to 'hold' adjacency

JunOS 'L3 incompletes', what and why?

2014-02-24T09:45:00.000+02:00

There is quite often chatter about L3 incompletes, and it seems there are lot of opinions what they are. Maybe some of these opinions are based on some particular counter bug in some release. Juniper has introduced also toggle to allow stopping the counter from working. It seems very silly to use this toggle, as it is really one of the few ways you can gather information about broken packets via SNMP.

What they (at least) are not

Unknown unicast
CDP
BPDU
Packet from connected host which does not ARP
Packet from unconfigured VLAN

What they (at least) are

IP header checksum error
IP header error (impossibly small IHL, IP version 3, etc)
IP header size does not match packet size

Troubleshooting

So if you are seeing them, what can you do? As it is aggregate counter for many different issues, how do you actually know which one is it and is there way to figure out who is sending them? Luckily for Trio based platforms answers and highly encouraging, we have very good tools to troubleshoot the issue.

To figure out what they exactly are, first you need to figure out your internal IFD index (not snmp ifindex)

im@ruuter> show interfaces xe-7/0/0 |match index: Interface index: 224, SNMP ifIndex: 586

After figuring out index, we can login to the the PFE and check stream counters for that IFD

im@ruuter> start shell pfe network fpc7 NPC platform (1067Mhz MPC 8548 processor, 2048MB memory, 512KB flash) NPC7(ruuter vty)# show jnh ifd 224 stream ifd = 224, Stream = 33 Stream ID: 33 (inst = 0) Cntr : 0x00c0f102 Encap : Ether Encap = 0, StartNH = 0xc040e1 lacp:+, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, dc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001 Input Statistics: 0003126353191368 pkts, 3351074223070319 bytes Detail Statistics: rx0: 0000000000000000 pkts, 0000000000000000 bytes rx1: 0000007792865413 pkts, 0000923636240746 bytes rx2: 0003118560325955 pkts, 3350150586829573 bytes drop0: 0000000000000000 pkts, 0000000000000000 bytes drop1: 0000000000000000 pkts, 0000000000000000 bytes drop2: 0000000000000000 pkts, 0000000000000000 bytes unknown-iif: 0000000000000000 pkts, 0000000000000000 bytes checksum: 0000000000625225 pkts, 0000000268883747 bytes unknown-proto: 0000000000024793 pkts, 0000000006398918 bytes bad-ucastmac: 0000000218713670 pkts, 0000034352327467 bytes bad-ucastmac-IPv6: 0000000002160892 pkts, 0000000172764339 bytes bad-smac: 0000000000000000 pkts, 0000000000000000 bytes in-stp: 0000000000000000 pkts, 0000000000000000 bytes out-stp: 0000000000000000 pkts, 0000000000000000 bytes vlan-check: 0000000000000000 pkts, 0000000000000000 bytes frame-errors: 0000000000000108 pkts, 0000000000014451 bytes bad-IPv4-hdr: 0000000000033339 pkts, 0000000012708126 bytes bad-IPv4-len: 0000000000070901 pkts, 0000000025836710 bytes bad-IPv6-hdr: 0000000000000133 pkts, 0000000000009508 bytes bad-IPv6-len: 0000000000000993 pkts, 0000000000071269 bytes out-mtu-errors: 0000000000003391 pkts, 0000000005122005 bytes L4-len: 0000000000038084 pkts, 0000000001765247 bytes Stream Features: Topology: stream-(33) Flavor: i-root (1), Refcount 0, Flags 0x1 Addr: 0x4513f3c8, Next: 0x4fdd3c78, Context 0x4513f3c0 Link 0: da40602e:32000303, Offset 12, Next: da40602e:32000303 Link 1: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 2: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 3: 00000000:00000000, Offset 12, Next: 00000000:00000000 Topology Neighbors: [none]-> stream-(33)-> flist-master(stream) Feature List: stream [pfe-0]: 0xda40602e32000303; f_mask:0x80000000000000; c_mask:0x8000000000000000; f_num:9; c_num:1, inst:0 Idx#8 iif-lookup: [pfe-0]: 0xda40602e32000303

Here we can see 'checksum, bad-ipvX-hdr, bad-ipvX-len' at least all of these are 'L3 incompletes', there may be other reasons, but that's the absolutely minimum. We can also see aggregate counters for all the interfaces in given Trio, we'll need some of this information later:

NPC7(ruuter vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== PFE State Invalid ---------------------- sw error DISC(64) 197636729 13174899216 invalid fabric token DISC(75) 68 4311 unknown family DISC(73) 24793 6398918 iif down DISC(87) 4516 337076 egress pfe unspecified DISC(19) 5857595 1900968530 Packet Exceptions ---------------------- bad ipv4 hdr checksum DISC( 2) 660667 289608849 bad IPv6 options pkt DISC( 9) 3 216 bad IPv4 hdr DISC(11) 33339 12708126 bad IPv6 hdr DISC(56) 133 9508 bad IPv4 pkt len DISC(12) 108203 33978274 bad IPv6 pkt len DISC(57) 1009 72421 L4 len too short DISC(13) 143678 6622571 frag needed but DF set DISC(22) 21915 33137575 ttl expired PUNT( 1) 51770371 3124910479 IP options PUNT( 2) 777 108006 frame format error DISC( 0) 108 14451 my-mac check failed DISC(28) 218721556 34352693971 my-mac check failed IPv6 DISC(58) 2161073 172779483 DDOS policer violation notifs PUNT(15) 2438770 326360620 Firewall ---------------------- firewall discard DISC(67) 1284437202 544450843517 firewall discard V6 DISC(101) 34130853 4143870718 Routing ---------------------- discard route DISC(66) 3740477632 552767152133 discard route IPv6 DISC(102) 3894436247 281512808097 hold route DISC(70) 471 35151 resolve route PUNT(33) 10 776 resolve route V6 PUNT(69) 818 63097 control pkt punt via nh PUNT(34) 993912636 45817932600 host route PUNT(32) 228855708 19337408523 mcast host copy PUNT( 6) 2591 422909 reject route PUNT(40) 2855554 402183663 reject route V6 PUNT(68) 9277 1901206

The counters for L3 incompletes are 'bad ipv4 hdr checksum', 'bad ipvX hdr' and 'bad ipvX pkt len'. Notice how we're missing IPv6 hdr checksum, obviously because IPv6 does not have this, because it was deemed unnecessary, but we'll shortly see this may have been bad decision.

Now why could we possibly see L3 incomplete increasing? If frame is mangled, we'll see ethernet CRC failing (which is much better than IP checksum) and we'll never even check any of these, we'll drop frame much earlier. So clearly we received packet which had correct ethernet CRC yet it was broken. We recently had issue where pretty much all egress PE boxes started logging 'l3 incompletes' because IPv4 header checksum was failing, they incremented maybe 20 times per hour, so very moderately. But how is this possible? If someone generated broken IP packet and sent it to us, we'd drop it in ingress PE box with incrementing these counters, yet packet traversed through MPLS core all the way to egress PE. So clearly we were mangling them. Obviously core is like Jon Snow, it's just MPLS frame to it, it does not need to know it's IP nor should it verify it for correctness, so core will happily pass broken packets around. It seems like complex problem to try to figure out who is mangling the packets. Luckily Trio gives us ability to capture exception packets, here we need to use the exception number we see above in exceptions counters

NPC7(ruuter vty)# debug jnh exceptions 2 discard NPC7(ruuter vty)# debug jnh exceptions-trace NPC7(ruuter vty)# show jnh exceptions-trace [1768975] jnh_exception_packet_trace: ############### [1768976] jnh_exception_packet_trace: [iif:344,code/info:130/0x0,score:tcp|(0x40),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:60] [1768977] jnh_exception_packet_trace: 0x00: 20 40 82 00 00 00 01 58 00 12 00 3c 80 00 00 20 [1768978] jnh_exception_packet_trace: 0x10: 12 00 00 3c 00 00 00 00 00 28 c0 da 07 c0 00 00 [1768979] jnh_exception_packet_trace: 0x20: 12 1e d5 97 f8 88 47 00 00 03 3d 45 00 00 28 1e [1768980] jnh_exception_packet_trace: 0x30: 4f 40 00 87 06 44 d4 XX XX 39 58 XX XX 3f 6a d0 [1768981] jnh_exception_packet_trace: 0x40: c0 e2 82 4f 0d a2 2d cc ec aa 6b fd 78 0f 10 22 [1768982] jnh_exception_packet_trace: 0x50: 60 cd 42 00 00

It should be lot simpler for us now to troubleshoot the issue, we just figure out where is that SADDR (XX XX 39 58) entering the network, with any luck, the mangling node is somewhere in the edge of the network and we will find some common theme in source addresses. If it's in core, you're pretty much out of luck, you'll replace whole network or accept that you mangle something. I wrote little script which you can give IP header, incorrect checksum, correct checksum and it'll show you what each 16b field should be, to result in correct checksum, with luck you can use it to figure out which part of the packet is being mangled, as some of those 16b fields are bound to result in impossible/invalid values to have correct checksum, which allows you to exclude them and concentrate efforts on rest of the fields. But field like ID can be anything, so impossible to exclude. SADDR is also problematic, DADDR usually not (if it were this value, it would not have been routed to my network).

Why this happens? How often this happens? I really would like to know. Obviously as CRC is correct, it's not happening because error in links/optics/etc. My guess is, this happens mostly because bad memory in forwarding logic. Packet can touch many memories on its path, ring => sram => dram => sram => ring, do all of these have ECC? Does PHY ring ever have ECC? I don't know, but clearly there can be problems, as L3 incompletes exist. How common they are? Probably lot more common than we think, as we only know about mangling when it happens to happen in IPv4 header (IPv6 header mangling would usually pass unnoticed). IPv4 header is 20B, packet lenght typically is 1500B, so are we only seeing 1.3% of the tip of the iceberg?

I guess key takeaway here is, don't use 'ignore-l3-incompletes', monitor your 'l3 incompletes' via SNMP, figure out why they are happening and fix them. Especially check right now if your egress PE is has L3 incompletes from core.

Why you should want metered INET?

2014-02-21T09:29:00.000+02:00

When people think about metered, they may think about mobile roaming or old outrageous per minute PSTN billing. Those are not fair prices, they are not what I'm talking about.

Also INET should be always on, billing should take this into consideration, maybe once you exceed your paid capacity, your connection is policed to 256kbps unless you pay for more. You could get notice when this limit is nearing by SMS and Email.

Flat-rate billing is based on assumption that on average INET is not used much at all, in such scenario it works. Consumers get flat-rate stove-gas in Helsinki, because its use is almost non-existing. But services like Youtube and Netflix which are relatively new can alone be 2/3 of all your traffic, meaning what ever average use you planned for, it's not true, average use is increasing as more services users care for appear.

1. Quality

When you pay flat rate there is financial incentive for your operator not to provide you bits, every bit not provided improves your margins. Operators today regularly keep some ports congested, because it would be expensive to upgrade, instead they try get someone else to pay for it, if they have the leverage.

If consumers pay for bits then delivering bits means more money and Internet quality is market driven fact, everyone has good Internet.

2. Equality

I may stream HD content whole day just to have background noise, my retired mother might check email 3 times per week, yet we pay the same. Clearly my mother is subsidizing my use, considering her minuscule pension is 1/5th of my still moderate wage, it does not seem very fair at all.

Should I pay same for electricity, petrol or grocery shopping as everyone else? In post-scarcity society we wouldn't pay for anything, but unfortunately we're not there yet.

http://www.telecompetitor.com/jdsu-less-than-1-of-users-comprise-over-50-of-lte-usage/

3. Free market

Eyeball providers know that providing bits is bad for business, so if you are large enough like FT or Comcast you can extort money from content owners by keeping your transit ports congested. This causes consumers to complain to content shops and forces content shops to either lose business to these consumers or pay FT/Comcast/etc premium rate to access their network. This premium rate isn't anything to sneeze at, it can be four times the cost you pay for IP transit.

So essentially consumers are paying part of their access indirectly through content shops.

In addition to the clear inefficiencies that indirect billing causes this is also highly anti-competitive, if you are small provider you lack the leverage to extort content owners making it much harder for you to compete with the market leaders, over time this creates de facto monopolies.

http://www.lightreading.com/cable-video/ott/cogent-gearing-for-another-peering-battle/d/d-id/707831?

What might happen if some eyeball operator would change to fair per-use billing over-night? I'll assume very naive model as I'm not ready to give this sufficient thought. Let's assume they create scheme which delivers exactly same amount of money from consumers as they get today, 1/3 pays more, 1/3 pays same and 1/3 pays less.

I would guess the obvious outcome is, 1/3, the heavy users, switch to alternative provider if such exists in their market. This removes huge quantity of capacity, so essentially provider can save all CAPEX spending to the network.

What would happen to the networks who got these 1/3 heavy users? Their business case would get radically poorer, as they didn't get enough regular users to pay for those heavy users, so they would struggle to deliver the bits leading to congested poorly performing network.

Heavy users who want quality, would have to switch back to a network which financially can and will deliver the service they want and other operators now struggling with broken fundamentals would need to implement same billing scheme.

my Cisco Live 2013 US program

2013-03-31T21:45:00.000+03:00

Once again I'll be attending Cisco Live 2013 in the United States. This year it's in Orlando, Florida. Below is my tentative schedule although there were some overlap with a few other session that I would have liked to attend. Hope to see a lot of the people I've met at previous Networkers and Lives there too :-)

This is my schedule. There are many like it but this one is mine.

Monday

BRKOPT-2106 DWDM 101
BRKNMS-2517 Operations Architecture
BRKCOM-3003 UCS Ethernet Troubleshooting of the uplinks to the Data Center LAN Switches

Tuesday

BRKOPT-2117 High Speed WAN Interconnections - Evaluating existing and emerging technologies
GENKEY-1295 KEYNOTE: Tomorrow Starts Here
BRKARC-3470 Cisco Nexus 7000 Switch Architecture
BRKRST-3321 Scaling BGP

Wednesday

BRKARC-3472 NX-OS Routing & Layer 3 Switching
GENKEY-1296 KEYNOTE: Unlocking the Value of Innovation
BRKSPG-2904 ASR-9000/IOS-XR hardware Architecture, QOS, EVC, IOS-XR Configuration and Troubleshooting
BRKARC-3453 Nexus 6000 - Architecture of the next-generation Switch for

Thursday

BRKRST-3371 Advances in BGP
BRKOPT-2116 High Speed Data Networks - 40G, 100G & Beyond
BRKSPG-2333 Securing Cisco ASR 9000 Routers
GENKEY-1297 Celebrity Closing Keynote
BRKSPG-2905 ISSU on high-end routers

Sorry state of JunOS control plane protection

2012-10-24T21:47:00.000+03:00

I've been looking into how to protect MX80 11.4R5 from various accidental and intentional attempts to congest control plane and I'm drawing pretty much blank.

Main discoveries so far.

ISIS always leaked to control plane, even when no 'family iso' or 'protocol isis' on interface
PVST always leaked to control plane. Even when just 'family inet' configured to interface
LLDP protocol not matched by ddos-protection feature
Essentially impossible to protect against attack from eBGP
ddos-protection feature mis-dimensioned

ISIS

This is pretty bad for anyone running ISIS, as you cannot use ddos-protection to limit ISIS, as it won't distinguish between bad and good ISIS. If you don't use ISIS, just set ddos-protection limit low and you're good to go.

ISIS is punted with different code than IP packets, but resolving the punt path it goes to the same path. This path is still seeing full wire rate, i.e. there isn't magic 10kpps limit before it

HCFPC2(le_ruuter vty)# show jnh 0 exceptions control pkt punt via nh PUNT(34) 9134818 1065269880 HCFPC2(le_ruuter vty)# show jnh 0 exceptions nh 34 punt Nexthop Chain: CallNH:desc_ptr:0xc02bbc, mode=0, rst_stk=0x0, count=0x3 0xc02bb8 0 : 0x127fffffe00003f0 0xc02bb9 1 : 0x2ffffffe07924a00 0xc02bba 2 : 0xda00601499000a04 0xc02bbb 3 : 0x3af46014fcd08810 HCFPC2((le_ruuter vty)# show jnh 0 decode 0xda00601499000a04 IndexNH:key_ptr:0x80/0, desc_ptr=0xc02932, max=10, nbits=4 HCFPC2(le_ruuter vty)# show jnh 0 vread 0xc02932 4 Addr:0xc02932, Data = 0x42f47fffff8b0010 Addr:0xc02933, Data = 0xda026014b6801004 Addr:0xc02934, Data = 0x60040740000e822f Addr:0xc02935, Data = 0x60041bc0000e828a HCFPC2(le_ruuter vty)# show jnh 0 decode 0x60040740000e822f JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x000080e8 base_ptr = 0x000e822f HCFPC2(le_ruuter vty)# show jnh 0 decode 0x60041bc0000e828a JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x00008378 base_ptr = 0x000e828a HCFPC2(le_ruuter vty)# show filter Index Semantic Name -------- ---------- ------ 46137345 Classic HOSTBOUND_IPv4_FILTER 46137346 Classic HOSTBOUND_IPv6_FILTER HCFPC2(le_ruuter vty)# show filter index 46137345 detail JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x000080e8 base_ptr = 0x000e822f HCFPC2(le_ruuter vty)# show filter index 46137346 detail JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x00008378 base_ptr = 0x000e828a

PVST

If you don't need PVST, you can just limit in ddos-protection. But still it's pretty annoying it's leaked to control-plane, especially as Trio already does support 'punt mask' for LACP, STP, LLDP etc per physical interface, but even if STP punting is turned off, PVST is still punted.

HCFPC2(le_ruuter vty)# show ifd brief Index Name Type Flags Slot State ----- -------------------- ----------- ------ ----- ------ 190 xe-2/0/6 Ethernet 0x0000000000008000 2 Up HCFPC2(le_ruuter vty)# show jnh ifd 190 stream lacp:-, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001

BGP

Problem with protecting against eBGP attack is, that policers work by bps (except DDoS policers, you can actually almost certainly make any policer in PFE bps->pps by changing its application by poking directly at memory, but it would be cleared by next reboot or 'commit full'). And you can only cope with maybe 4Mbps of traffic, so either you accept convergence issues in BGP or you accept that eBGP can bring you down. If you absolutely positively must fix this, then one way to get closer is to police <1400B BGP at very low rate and >1400B BGP at high enough rate for convergence, but you'd need separate policers per BGP so that one BGP neighbor cannot bring another down by killing hello packets.

One quick and dirty fix to protect eBGP from iBGP, but not from other eBGP would be to run all your eBGP as 'passive' and run your route reflectors as 'passive'. Then your PE would open connection to RR and your customers would open connection to your PE, this is already classified to different terms in ddos-protection filter:

HCFPC2(le_ruuter vty)# show filter Index Semantic Name -------- ---------- ------ 46137345 Classic HOSTBOUND_IPv4_FILTER HCFPC2(le_ruuter vty)# show filter index 46137345 program term HOSTBOUND_BGP_TERM1 term priority 0 payload-protocol 6 destination-port 179 then accept queue 0 policer template __ddos_BGP_aggregate_policer__ policer __ddos_BGP_aggregate_policer__-HOSTBOUND_BGP_TERM1 app_type 23 bandwidth-limit 34359738360 bits/sec burst-size-limit 16777215 bytes discard count __ddos_BGP_aggregate_pass__ ddos proto 5120 term HOSTBOUND_BGP_TERM2 term priority 0 payload-protocol 6 source-port 179 then accept queue 0 policer template __ddos_BGP_aggregate_policer__ policer __ddos_BGP_aggregate_policer__-HOSTBOUND_BGP_TERM2 app_type 23 bandwidth-limit 34359738360 bits/sec burst-size-limit 16777215 bytes discard count __ddos_BGP_aggregate_pass__ ddos proto 5120

Now only change needed, would be to put these under different BGP policer, then your customers would be policed separately to your iBGP and attack wouldn't bring core down.

Dimensioning

It's really strange how Juniper has dimensioned their boxes. MX80 goes down on 4Mbps/10kpps flood, while RE CPU (PQ3, 8572) and LC CPU (PQ3, 8544) both are 90% idle during the event, while all ISIS, LDP, BGP remain down until attack stops.

MX960 RP CPU (4xXEON) MPC2 LC CPU (PQ3, 8548) isn't faring significantly better than MX80, if attack and protected service are in same MPC, it cannot handle anywhere near stock ddos-protection 20kpps, but will bring core BGP down. Maybe MX960 can do 15kpps.

T4k RP CPU (2xXEON) FPC5 LC CPU (QorIQ P2020) can actually handle stock ddos-protection 20kpps rate, but not 30kpps, so if you can push two protocols to ddos-protection, it's still going to be down.

There clearly is some per linecard non-configurable policer which limits JunOS control-plane performance to much slower rate than what it realistically can handle, this is as stupid as 'mls rate-limit unicast cef receive' on 7600, essentially you're underclocking your control-plane, making it die under lower than max load. However, if we could control traffic in pps level, it wouldn't matter 5kpps is plenty for BGP convergence. But as we must limit in bps and prepare for worst-case scenario, policer values need to be ridiculously small. Maybe you allow from customer BGP, VRRP, DHCP, PIM, BFD. Now if you want that VRRP flood will only kill VRRP not other service, you need separate policer for each, but on aggregate they can't be over 4Mbps, so you're left with 800kbps per protocol, if you share equally, of course BGP is only capacity hungry protocol.

Solution

It's confusing why control plane protection is even user configurable feature, as it can be strictly restricted and pps limited per session dynamically as services are turned on, this is perfectly doable in Trio hardware, no user input needed.

And at very least you should be able to do L2 filters on L3 interfaces, so you could drop everything except IPv4, IPv6, ARP ethertypes and remove most hard to protect attack vectors.

If you need something useful today, put core and edge in different MPC and use ddos-protection feature so that edge cannot congest core. For single linecard system like MX80 there unfortunately isn't any really practical way today.

My Cisco Live 2012 program

2012-04-25T21:44:00.001+03:00

My Cisco Live 2012 program is below. I was able to select most of the sessions although there were some overlap with one session (something ARC overlapping with another ARC). You might note that the closing keynote is by the guys who play Mythbusters on TV - that'll most likely be a fun filled event! I'll blog about my pre-event feelings if I have the time and interest ;-)

Cisco Live 2012

Personal Schedule

Printed below is your personal schedule.

Sunday
8:00 AM 5:00 PM	TECCOM-2001	Cisco Unified Computing System Technical Day
Monday
8:00 AM 9:30 AM	BRKNMS-2658	Securely Managing Your Networks with SNMPv3
10:00 AM 12:00 PM	BRKSEC-3021	Maximizing Firewall Performance
1:00 PM 3:00 PM	BRKRST-2335	IS-IS Network Design and Deployment
Tuesday
8:00 AM 9:30 AM	BRKRST-2310	Deploying OSPF in a Large-Scale Network
10:00 AM 11:30 AM	GENKEY-4346	Keynote and Welcome Address
12:30 PM 2:30 PM	BRKARC-3470	Cisco Nexus 7000 Switch Architecture
4:00 PM 6:00 PM	BRKSEC-4054	DMVPN Deployment Models
Wednesday
8:00 AM 9:30 AM	BRKSEC-3013	Advanced IPSec with FlexVPN and IKEv2
10:00 AM 11:30 AM	GENKEY-4347	Cisco Technology Keynote
12:30 PM 2:30 PM	BRKSPG-2402	Best Practices to Deploy High-Availability in Service Provider Edge and Aggregation Architectures
4:00 PM 6:00 PM	BRKARC-3471	Cisco NX-OS Software Architecture
Thursday
8:00 AM 9:30 AM	BRKARC-2003	Cisco ASR 9000 Architecture
10:00 AM 11:30 AM	BRKARC-2007	IOS Strategy and Evolution
12:00 PM 1:30 PM	BRKSPG-2404	Energy Efficient Data Center: Best Practices
2:00 PM 3:00 PM	GENKEY-4358	Closing Keynote: An Afternoon with Adam Savage and Jamie Hyneman
3:30 PM 5:30 PM	BRKMPL-3101	Advanced Topics and Future Directions in MPLS

= Conference Event

We don't understand hashes

2012-04-06T12:15:00.000+03:00

At least I don't, nor do I understand math. It only this week dawned to me, that we consistently choose wrong hash for password hashing.

When I started using Linux DES was the standard way to hash your passwd, then it was MD5, now at least Ubuntu is using SHA. And I can bet that in 2 years time SHA-3 (will be selected this year) will be used widely for protecting passwords.

But what were design goals for MD5 and SHA? Design goals are obviously avoidance of collisions and more importantly algorithmic cheapness in terms of computational requirements and ability to implement it cheaply and easily in hardware requiring no branching. So MD5 and SHA are _by design_ simple to brute-force in hardware, even the new SHA-3. You don't want your 'git commit' or 'sha3sum /dev/cdrom' to take days, you want it to be very very fast and very very unlikely to represent any other data.

It should be quite obvious that those requirements are orthogonal to the requirements of password hash. Avoidance of collisions is not critical to password hash and absolutely opposite requirement of computational expensiveness exists for password hash, it needs to be very expensive and it needs to be poorly implementable in cheap hardware.

This only dawned to me, when co-worker was cracking password of one DWDM system and as it was DES we naively assumed it'll be cracked in seconds, but it turns out in CUDA systems DES is hundreds of times slower to crack than MD5 (it might have been that this was apples to oranges unix DES compared to MD5 instead of unix MD5, but doesn't change the fact that hash like bcrypt makes much more sense for passwords than hashes like MD5 or SHA). It was really illuminating moment for me, obviously this is good for general applications of hash, MD5 was designed to be fast (and so will SHA-3), obviously fast to calculate also means fast to brute-force. Only problem here is, that we're not understanding our application when choosing hash for protecting password, MD5 never was designed to be good for that application. Bcrypt is, I'm not familiar with it, I don't really know how good it is, but at least it's designed to be computationally very expensive and as machines get faster you can make it slower and slower without changing the implementation, you'll just give it different parameter.

Silver bullet for home QoS

2012-03-31T10:40:00.000+03:00

Rationale

I mentioned in one of the posts about how prioritizing small packets upstream is almost the proverbial silver bullet when it comes to QoS at home. I'm sure any ADSL user who uses interactive applications, such as SSH have noticed how laggy the SSH gets when you upload something from home, say your holiday pictures with scp to your web server. Also download is quite slow during upload. VoIP and online gaming will suffer too. Canonical solution is to use DSCP markings at sender end or DSCP mark based on IP address or port.

But I feel that is unnecessarily complex for typical home use scenario, since all of the important/interactive stuff are using small packets and the bandwidth hogging applications are all essentially doing MTU size packets. I've chosen <200B as small packet, which is arbitrary decision I did about decade ago when setting this up first time, I'm sure it could just as well be like 1300B. So without further rambling, I'll give IOS (ISR) and JunOS (SRX) examples how to roll this on your CPE.

IOS example

class-map match-any SMALL-PACKETS match packet length max 200 ! policy-map WAN-OUT class SMALL-PACKETS priority percent 75 class class-default fair-queue random-detect ! interface ATM0.100 point-to-point pvc 0/100 vbr-nrt 2000 2000 tx-ring-limit 3 service-policy output WAN-OUT ! !

JunOS example

ytti@gw.fi> show configuration interfaces vlan unit 0 family inet filter input FROM_LAN; ytti@gw.fi> show configuration firewall family inet filter FROM_LAN term small_packets { from { packet-length 0-200; } then { forwarding-class expedited-forwarding; next term; } } term rest { then accept; } ytti@gw.fi> show configuration class-of-service interfaces at-1/0/0 unit 0 { scheduler-map WAN_OUT; } ytti@gw.fi> show configuration class-of-service scheduler-maps WAN_OUT forwarding-class best-effort scheduler BE; forwarding-class assured-forwarding scheduler AF; forwarding-class expedited-forwarding scheduler EF; forwarding-class network-control scheduler NC; ytti@gw.fi> show configuration class-of-service schedulers BE { transmit-rate percent 5; } AF { transmit-rate percent 5; } EF { transmit-rate percent 85; } NC { transmit-rate percent 5; } ytti@gw.fi>

additional information

You need to tune in IOS your vbr-nrt to match your upstream ATM rate, modern IOS will automatically scale it down to real rate, if it's too high, so you don't have to worry about it too much. If you have very slow connection like 256kbps, you might want to put tx-ring-limit to 2. Unfortunately I've not found out how to tune tx-ring size on JNPR, and it feels bit too large by default, as IOS is somewhat more responsive during congestion.

You can test if it's working by sending large file upstream and pinging some other host. See how it looks with and without QoS applied to the egress interface. You should see high delay on ping without QoS and normal delay with QoS. In JunOS you can use 'show interfaces queue X' or 'show interface X detail' to confirm that you're seeing drops in best-efford, not in expedited-forwarding. In IOS you can use 'show policy-map interface X output' to confirm you're seeing drops in class-default not in SMALL-PACKETS.

There is one particular problem for people using ssh ControlMaster which multiplexes multiple connections under same network socket. It's really great as you only login to remote host once and further ssh/scp start without delay and without authentication, especially great if you're hopping through multiple intermediate ssh hosts can reduce delay from 4-5s to 100ms on opening ssh session. But when it comes to QoS it's quite poor, if you have interactive ssh session to your server and then you use scp to upload data to that same server, you will notice that interactive ssh is laggy even with QoS. This is of course how it should work, while your CPE will reorder packets, to send interactive (small) packets first, the far end server, will not give the unordered small packets to userspace, as TCP guarantees packet ordering, so the far end server is keeping those packet jailed until original (laggy) order is restored. Quick fix is to disable ControlMaster for scp, via scp -o ControlPath=none foo bar:'.

LLDP / 802.1AB-2009 blows

2012-03-14T12:52:00.002+02:00

If you're designing L2 discovery protocol, I suppose one of your mandatory requirements is, that you can 'machine walk' the network, after you find one box. I.e. you are able to know your neighbor devices and their ports. LLDP makes no such guarantees

You have 4 mandatory TLVs, [0123], End of LLDPDU, Chassis ID, Port ID and TTL. Chassis ID has 7 subtypes which implementation is free to choose, EntPhysicalAlias (two distinct cases), IfAlias, MAC address, networkAddress, ifName or locally assigned. Port ID also has 7 subtypes which implementation is free to choose, ifAlias, entPhysicalAlias, MAC address, networkAddress, ifName, agent circuit ID, locally assigned.

Now you can send what ever trash via locally assigned and be fully compliant implementation. It seems that it would be wise to mandate sending management address (networkAddress) in ChassisID and SNMP ifindex in PortID (and any _additional_ ones you may want to send, i.e. more than 1, which is not allowed). This way you'd immediately know what OID to query and from which node. Obviously this makes assumption that we have IP address always and SNMP implementation always. If we absolutely must support some corner cases where this is not true, we should specify different mandatory requirements for devices without networkAddress and SNMP implementation. Now because of some corner cases we can never trust LLDP implementation to be useful.

Clear sign that LLDP is not actually meeting real-world demands is that PortID often is locally assigned which is populated with SNMP index, you just have to know it that given device works like this, there is no way to programmatically know it beforehand.

Future residential INET users, I'm so sorry

2012-01-22T18:13:00.001+02:00

I never believed IPv6 will be NAT free, but as idealist I hoped there is good chance there will be mostly only 1:1 NAT and each and every connection will get own routable network, /56 or so, residential DSL, mobile data, everything

Unfortunately that ship has sailed, it's almost certain majority of residential/non-business products will only contain single directly connected network, since we (as a community, I don't want to put all the blame to IPv6 kooks) failed produce feasible technical way to do it and spent too much time arguing on irrelevant matters. I'm reviewing two ways to provide INET access on DSL, no PPPoX, as it's not done in my corner of the world, and show why it's not practical to provide the end customer routable network

Statically configure per customer interface

At DSLAM (or other access device) customer would be placed in unique virtual-circuit (Q, QinQ...) all would terminated on unique L3 logical interface in PE router. Interface would have static /64 ipv6 address and ipv6/56 network routed to say ::c/64. IPv4 could continue to be shared subnet via 'unnumbered' interface.

This is by far my favorite way of doing residential IPv6 it, it supports customer owned and provider owned CPE, it supports routing and bridging in CPE, if CPE is bridged, CPE doesn't even need to be updated and it requires no magic features in DSLAM, IPv6 will just work.

But why this in practice does not work, is because there isn't PE router supporting 100k logical interfaces in single physical port, which would be needed for most non-trivial size operators today. Of course you could always buy more L3 termination points and terminate fewer (or even just one DSLAM) per PE, but the CAPEX isn't justifiable just to produce IPv6. Maybe due to organic network upgrades L3 goes closer to the edge and this will become feasible.

Routing in CPE, DHCP-PD

Customer CPE would use shared linknet with other customers, and some magic feature in DSLAM would stop CPEs from reaching each other in L2 (like they do today in IPv4), for LAN address customer would use DHCP-PD, which would be distributed by CPE as SLAAC to LAN interface, this is supported and works today.

But there is no supported way to provide static route to CPE, there is draft-ietf-mif-dhcpv6-route-option-03 which would add needed functionality, but I don't think there is much chance it'll materialize in time, as SPs are already busy defining how their residential V6 is produced.

Problems here are, apart from not being supported, is that new CPE is needed, DSLAM vendors need to develop IPv6 specific magic features (ND, RA, DHCP) to workaround shared LAN problems.

Conclusion

So all I can say is sorry, again users need to pay for premium product to get routable network and ad-hoc network sharing from random available 3G mobile or DSL still means setting up NAT (or ND proxy to lie that you're on connected net, when you're not). It's damn shame, as it's really trivial technical problem which seems just to be completely ignored.

junos vrf-import funnies

2011-11-04T13:33:00.000+02:00

Consider this configuration:

> show configuration routing-instances VRF1 instance-type vrf; route-distinguisher 42:1; vrf-import [ VRF1-IMPORT VRF-DEFAULT-IMPORT ]; vrf-export [ VRF1-EXPORT VRF-DEFAULT-EXPORT ]; vrf-table-label; > show configuration policy-options policy-statement VRF1-IMPORT from community [ VRF1 VRF2 ]; > show configuration policy-options policy-statement VRF-DEFAULT-IMPORT term cust_routes { from protocol bgp; then default-action accept; } > show configuration policy-options community VRF1 members target:42:1; > show configuration policy-options community VRF2 members target:42:2;

If you configure this on any router on your network, it'll work, VRF will import correct and only correct routes. This will give you assumption, that VRF import in JunOS works like this:

start with empty array of routes to evaluate policy against
when you hit 'match community' push matching routes from bgp.l3vpn.0 to the list
evaluate rules normally against the list

If you create multiple of these to single router, and you only have single 'from community [ X ]' in each, it also works perfectly. However, if you have more than one community in 'from community' AND you have more than one VRF using the 'VRF-DEFAULT-IMPORT' things go wrong. If we have three routes:

10.10.1.0/24 RT:42:1
10.10.2.0/24 RT:42:1 RT:42:2 RT:42:3
10.10.3.0/24 RT:42:1 RT:42:3

VRF1 will correctly import all of these, but it will also leak #2 to other VRFs in same PE having 'VRF-DEFAULT-IMPORT', it won't leak #1 or #3. It's not actually bug, but the fact that it works at all, is side-effect of optimization when route hits exactly 1 'show bgp targets' entry. And evaluation is not done, how the results in the simple test might indicate.

no usage scenario for ssh-agent forwarding

2011-11-03T11:41:00.000+02:00

Many people, especially those in consulting business have need to access multiple different organization 'jump boxes' from which they can ssh towards the organization servers. And due to security it makes sense to have different ssh key being allowed for different organization servers. For convenience people often allow ssh-agent towards the 'jump boxes'.

Problem with ssh-agent is, that it has no idea who is requesting the key signing, it could very well be organization1 evil admin asking for organization2 key, when sshing into organization2 jump-box, and your agent would simply allow this.

One solution to the problem could be that when ever signing is requested, user gets prompt 'localhost < organization2-jump < organization2 requests sign of organization1 identity, allow yes/no, [ ] always'. Now you'd have idea if sign request is legit or not. However this would require protocol changes to ssh, as ssh-agent has no idea who is requesting signing much less of the full path, which would be absolutely needed to make this feature work.

So I asked openssh dev mailing list, how this problem should be solved. Turns out there is recently added feature in openssh, which could potentially remove need for agent forwarding completely, to access organization1-server through organization1-jump you'd do ssh -oProxyCommand='ssh -W %h:%p organization1-jump' organization1-server, now obviously this is inconvenient, especially if there are more than 1 box through which you need to jump. .ssh/config can help somewhat:

# cat >> ~/.ssh/config Host org1-ultimate ProxyCommand ssh -W %h:%p org1-secondjump Host org1-secondjump ProxyCommand ssh -W %h:%p org1-firstjump ^d

Now you'd ssh 'ssh org1-ultimate', which would really go to org1-firstjump -> org1-secondjump -> org1-ultimate. ssh key would work without forwarding it, and transit nodes wouldn't see unencrypted data. However, still seems like large overhead, what if there would be syntactic sugar do do this:

# cat >> .ssh/config Host org1-ultimate path org1-firstjump, org1-secondjump ^d # ssh org2-firstjump,org2-secondjump,org2-ultimate # ssh org1-ultimate

Playing hide and seek with JunOS

2011-10-02T19:14:00.004+03:00

JunOS has some commands which either are unsupported, do not work in platform you're using, undocumented or unnecessary for vast majority of operators, these commands are hidden in the UI so they are only accessible if you know what (and more importantly why) you want (them).

Today I was searching for a way to quiet my SRX210HE-POE as it makes annoyingly lot noise, I failed to find configuration way to force it to normal spinning speed, but I did notice that CLI exposes hidden commands. I've actually found same in IOS several years back and wrote little perl script to search for them (exec only), it proved bad idea as several of them purposely crash your system. If you want to dig deeper, in IOS difference is incomplete and invalid command, however actually some commands are truly hidden in IOS, particular example is the toggle for unsupported transceivers.

Neither the JunOS nor IOS issue are something you can blame vendor at, vendor isn't trying to stop you from using them, they just want to be very clear that if you use them TAC ain't go your back.

The code is quick 2h hack (running it takes longer, but I'm certain the search/walk can be optimized) and it depends on ssh/telnet library I've done. This library was meant for optimal way to do exec commands, not configuration commands. And best way to do exec commands in JunOS is to open new ssh channels with exec('command') per command, this way you never ever need to do screen scraping for prompt, as when ssh channel closes, command has finished. Unfortunately this approach does not work for config, and I didn't bother disabling forcing this behavior in the library, so right now it only supports telnet (if you really want ssh, hack it to assume remote is 'cisco' then it'll open shell, instead of exec, since IOS does not support multiple channels over existing ssh connection).

When should you advertise default route?

2011-08-15T16:34:00.003+03:00

Never

There are two typical scenarios when people carry default route in dynamic routing protocol, I'll address these separately and explain why you shouldn't do it, and what you should do instead.

CE (eBGP) PE

This is probably the most common scenario, maybe you're giving your customer default route, maybe it's your own firewall or really any situation where neighbor won't carry full routing table and neighbor isn't strictly same administrative domain.

Problem with default route here is, that if your PE gets disconnected from core, you're still originating the default route and CE is unaware of this and you're blackholing customer traffic until BGP is manually shutdown. You could conditionally advertise default, but that is just useless overhead, instead of default you should advertise to CE any aggregate route which is originated from multiple core boxes, such as your PA aggregate, or really any stable route originated from multiple places, but not local PE.

Customer would just add this to their router:

# ios ip route 0.0.0.0 0.0.0.0 192.0.2.0 name floating_default # junos route 0.0.0.0/0 { qualified-next-hop 192.0.2.0 { interface xe-0/0/0.0; } resolve; }

Now if your PE gets disconnected from core, you'll stop originating 192.0.2.0/24 and this ip route no longer will recurse to CE<->PE interface. If there is no more 192.0.2.0/24 route available anywhere, static route is invalid, and next available default route can be used. If there still is 192.0.2.0/24 available via alternative provider that will be automatically used.

Slight cosmetic complain is that if you add interface to the static route, IOS disables recursion, so you cannot enforce that the static route will disappear if next hop does not recurse behind that one interface. But it is purely cosmetic, as functionality will remain regardless if 192.0.2.0/24 will continue to exist or completely disappear. If it will continue to exist, customer will just need to local-pref/med 192.0.2.0/24 to have expected backup default selection.

PE router without full table

Typical solution is to have two RR iBGP peers to originate default route. This has the problem that RR probably aren't always in optimal forwarding path, especially in single fault, but in many cases never. So you'd stop iBGP from originating default, and you'd instead add this to every router having full bgp view:

interface Loopback1 description Anycast default ip address 192.0.2.0 255.255.255.255 no ip redirects no ip proxy-arp ! router isis passive-interface Loopback1

Obviously PE box would just have static default towards 192.0.2.0, this way PE would always forward packet towards nearest core box which is up and has full bgp table, so you always get best path egress forwarding, without having full bgp view and without having best path RR. Effectively it is as if every router has iBGP session to you and is originating default

Exception that proves the rule

If the end device does not support recursing routes, then obviously this won't work. And there still are such devices, though it's unsure if you want to be routing in such devices to begin with

IPv6 ACL bypass

2011-08-11T15:07:00.008+03:00

IPv6 designers recognized that IPv4 header has several faults, these were addressed to a different degree. Particularly annoying was IPv4 options which caused TCP/UDP/ICMP data to shift, as it made IPv4 header length variable. IPv6 header is fixed length, there is 'next-header' option, which will instruct how to parse data after IP header. Typically 'next-header' would be TCP, UDP or ICMP, and rest of packet would be exactly like in IPv4 (apart from mandatory checksum in UDP).

Where the complexity (some might say design fault) is that 'next-header' could be any large number of more exotic extension header, each of which have 'next-header' field themselves. Standard does not specify any limitation how many headers you could have, so you need to be able to parse packet up-to MTU length. The final extension header typically would contain TCP/UDP/ICMP and normal IPv4 style packet would follow.

Unfortunately no practical router has MTU wide view to the packet, you have 64B, 128B or 256B view, after which you are completely unaware of the packet content, it's just bits in memory which you cannot process in any meaningful way. Your PC won't have same problem, it does not have specialized hardware to quickly forward large amount of packets, so your PC will happily parse packet up-to the MTU length.

What this translates to is, that you can craft IPv6 packet where TCP port information is after view of router, so router will not know it is TCP packet nor what ports it is using, but the receiving PC will understand it normally. So if you have ACL rule where you are dropping some tcp/udp/icmp packets then allowing rest, those rules can be by-passed in very typical router. Example could be:

term my_smtp { from { destination-address 2001:db8::42/128; } then accept; term no_spam { from { next-header tcp; destination-port 25; } then discard; } term accept { then accept; }

Now this will be bypassed, because our 'next-header' is not tcp, but contains extension-header. But far end unmodified PC with unmodified software will treat it normally. Or maybe it is server where you allow ssh from management net, drop all packet to tcp/22 and permit rest. As long as you permit rest, instead of discard rest, bypass will work

How this should be fixed? Well IPv6 should have modified ICMP/TCP/UDP/etc to contain 'next-header' field, and mandated that they appear before any extension header, forcing non-extension headers to live in fixed bit places. Obviously ship has sailed for this fix. Now it is heavily platform dependent what will happen, cisco.com claims that they punt packets which they fail to parse correctly, this is sane, just be sure to police the punts and you have pretty good solution. Juniper before trio is pretty much lost cause.

Juniper trio is behaving remarkably well, but CLI is lagging behind. Trio will actually find TCP/UDP headers as long as there are fewer than 29 'destination-option' headers before TCP/UDP. If there are 30 'destination-option' headers before TCP/UDP packet is dropped in hardware by 'bad IPv6 options pkt DISC(9)' exception. Problem is CLI is unaware of this capability and you don't have 'protocol tcp' to define you want TCP, you only have 'next-header TCP' which only monitors the first next-header field in IP packet. If you omit 'next-header' and just match 'destination-port' and you have 29 or fewer 'destination-option' headers, JNPR will match correctly, you just lose ability to differentiate between tcp and udp. This is true for 10.4R4 and 11.2R1.

How trio should be fixed is by adding 'protocol' match in CLI (trio already classifies packet correctly) and 'bad IPv6 options pkt DISC(9)' exception should punt (via policer) instead of discard, so that RE can parse the packet correctly. You could ask that what /realistic/ packet would be dropped by trio parser, but I think that is beside the point, IPv6 standard allows for it, so you should parse it, even via punt with poor performance.

You can see packets failing trio parser via PFE:

# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== Packet Exceptions ---------------------- bad IPv6 options pkt DISC( 9) 24808567 26495549556

eBGP triggered blackhole for customers

2010-09-28T11:05:00.008+03:00

Very many large scale transit providers, if not most of them support eBGP remote triggered blackhole via separate multihop eBGP session. I suspect this is, because they've used for very long time single shared route-map for transit customers, and it is not immediately obvious how you can support blackholing without customer specific route-map. Requiring customer specific route-map would probably be less than minor change in their provisioning systems. However, it is perfectly doable and same idea works just the same in JunOS and IOS, here is pseudoIOShy example how to do it:

router bgp N neighbor eBGP peer-group neighbor eBGP route-map eBGP-IN in neihgbor eBGP disable-connected-check neighbor CUSTIP peer-group eBGP neighbor CUSTIP prefix-list C-CUSTID-IN in ! route-map eBGP-IN permit 100 match community BLACKHOLE set ip next-hop BLACKHOLE set community BLACKHOLE additive route-map eBGP-IN permit 200 match ip address prefix-list eBGP-TRANSIT-FULL set community full-transit additive route-map eBGP-IN permit 300 match ip address prefix-list eBGP-TRANSIT-PARTIAL set comunity partial-transit additive route-map eBGP-IN permit 400 set ip address prefix-list eBGP-PUNCHOLE set community no-export additive ! ip prefix-list C-CUSTID-IN permit 192.0.2.0/24 le 32 ip prefix-list C-CUSTID-IN permit 10.10.42.0/28 le 32 ip prefix-list eBGP-TRANSIT-FULL permit 192.0.2.0/24 ip prefix-list eBGP-PUNCHOLE permit 10.10.42.0/28

When you provision new customer, you simply add prefix to one customer specific prefix-list and another shared prefix-list. Shared prefix-list is there just to verify that we are getting correct mask, not more specific. Same route-map allows you to support multiple types of transit agreements and you can even use same route-map for customers using your PA addresses. When you add new BGP feature for customers, you update single route-map and all customers and all products benefit from it.

Egress route-map is less interesting, as it is simple, contains no features typically. You most likely have two types of egress route-map, one that matches global table route community and leaks them to customer and another which sends just candidate route. Or if you are doing it wrong, you're sending actual default route, this is wrong, and should never be done, but warrants its own post. If you'd want to have single egress route-map also, you could make one egress route-map which sends all global table routes and for candidate only customers, you'd add egress prefix-list which stops all prefixes, but single candidate route, you originate in few crucial core locations.

Consumer hardware vendors, boxes and versions

2010-07-30T22:41:00.006+03:00

Yes, this is actually a rant.

<rant>

I have a Apple Time Capsule which I love and cherish. It's about a year old. No, it isn't the latest model anymore. It claims it is able to talk IPv6... but it doesn't. Sadly, it runs something called version 7.4.2 - that works fine but where IPv6 is broken. To have functioning IPv6 I should have 7.5 or later. That would require me to pay more money for protection to Apple as it seems like it is only available on the Very Latest Time Capsules.

Now, someone explain to me:

what the fuck? over.
why must I buy a new box to get something to work which is supposedly is there already
immediate cessation in software updates upon release of incremental hardware update
if you changed the chipset then how hard can it be to make a conditional instead of drop all future upgrades
fail to communicate what's going to work and what's not

Feel free to google "time capsule 7.4.2 ipv6" for more info.

</rant>

Feels much better now. :-)

IPv6 and the enterprise of tomorrow(ish)

2010-07-23T10:31:00.006+03:00

One of the great promises of IPv6 has been to get rid of NAT, no more will IT do RFC1918 and NAPT to single public IP. But how is IPv6 going to accomplish this, what is the magical toggle for it? Let's get disappointed.

Some devices, like Cisco IOS allow you to configure IPv6 prefix as 'macro', so you could tell that macro 'ME' is 2001:db8::/32 and everywhere where you write IPv6 address, you use macro 'ME'instead. So in theory, when your prefix changes, you simply change the macro. So the great renumbering benefit is ability to always get same size network. But of course this was true for IPv4 too, you got the network size you needed. Why isn't this utilized? Because enterprises don't have one Cisco IOS devices, they have plethora of devices from different vendors, firewalls, slb, ips, ids, servers, OSS systems and so forth, you'd still need to go in all of these to change the 'macro', not all devices even have the concept and quite frankly no enterprise of non-trivial size will even know without months of work _where_ and _what_ will need to be changed for renumbering to be successful. I know industry professionals who've done renumbering costing MEURs in single company. So in practice IPv6 gives you no benefit in renumbering, renumbering always was easy for trivial network and always hard for non-trivial network and will continue to be so.

So how will enterprises run IPv6? Getting PI/ASN is actually bit harder, as you MUST be dual-homed, while many enterprises just want PI/ASN to have ability to change operator. There has also been visibility problems with /48's, but these will be remedied in due time, when even rest of the people realize there will not be magical new protocol in IPv6 for address mobility. I'm quite certain that IPv6 will be deployed exactly like IPv4, instead of RFC1918 you will use RFC4193, but with the twist that most companies will find that PRNG always returns '0', so that they will get beautiful short fd::/48 block, and lot of cursing in M&A, as with RFC1918.

Companies will NAT this RFC4193 to external provider provided IP block, no renumbering needed, you can still change operator without complex and expensive renumbering. But there is something we will win, we can easily do 1:1 NAT, instead of NAPT, which has great many benefits, no more will you need session logging to comply with legal requirements, you will also get mostly working end-to-end connectivity, apart from protocols which carry address in payload.

For those few who have been bitten by RFC1918 in M&A and who view that IPv6 address memorability and beauty is less valuable than the many hours of pain RFC1918 in M&A causes, here is one-liner to generate random RFC4193:

# ruby -e 'p ("fd"+rand(2**40).to_s(16)).scan(/.{1,4}/).join(":")+"::/48"' "fdbf:49be:e67d::/48" #

Maybe it should tuned with RFC2777 with NASDAQ as seed, so that your client can be sure you didn't give preferential treatment to another client to whom you issued much more beautiful network. Blaming the stock market can be very satisfying.

IEEE OUI address (MAC address) allocation

2010-07-21T09:13:00.004+03:00

I've recently noticed that it is becoming more and more common to see 'weird' MAC addresses, i.e. MAC addresses which do not start with numbers 00. Previously it was very easy to spot automatically mentally software defects which would cause strange MAC addresses to appear, it has helped me to diagnose several issues in the past. We've now beginning to lose that advantage, as IEEE has started to allocate MAC addresses quite randomly across the address space.

I emailed to IEEE and asked what was the motivation and perceived advantage in doing this change and reply was quite simply 'We changed our allocation methods to prevent vendors using unregistered mac addresses.'. OUI costs 1650USD one time fee, but IEEE appears to be concerned that some vendors choose not to pay it, instead allocate themselves OUI somewhere far in the address space, effectively thinking they are getting free OUI with little to no possibility of overlap. It would be curious to know if this instance who wants to save 1650USD would care about this slightly changed climate, I personally doubt the change while good-willed is completely ineffective and the slight operational benefit serial assignment had is lost. (/me starts crying over spilled milk).

In slightly related note regional IXP here is using static manually assigned MAC addresses from 4000.0X, where X is the number of the IXP site which is then followed by base10 of ASN and then free 4bits for user. So in site 1 for AS4242 would be 4000.0104.2420. Unfortunately when these were assigned someone mad mistake with bit significancy order and this MAC address is not locally assigned as was intended but normal public MAC address. I'm recommending new scheme of xEzz.yyyy.yyyy. Where 'y' is the ASN in hex (supporting 4byte ASN), 'z' is customer assigned, 'E' is static and 'x' is IXP site number. E could be also 2, 6 or a, but 'E' for exchange is kind cute.

Resolving SNMP OID in snmpbulkwalk and tshark

2010-07-09T14:46:00.017+03:00

This isn't exactly esoteric science but at least for me it has been too hard and I've rather looked up OIDs manually, which makes little sense in the long run.

NET-SNMP

First of all, you probably want to set system wide version and community, so you never need to type them on snmpbulkwalk

# cat /etc/snmp/snmp.conf defVersion 2c defCommunity supersecret #

Also I prefer to have vendor MIBs for my user only, as I might want to load lot of them, which will give lot of warnings, which other users may not want.

# cat ~/.snmp/snmp.conf mibs +ALL #

Instead of +ALL you could add named MIBs which to load, ~/.snmp/mibs/ directory is searched for users MIBs. You could start by uncompressing ftp://ftp.cisco.com/pub/mibs/v2/v2.tar.gz there. Now you should be able to snmpbulkwalk router by giving just its name, no version, no community and OIDs should be resolved.

WIRESHARK/TSHARK

To me this is more useful than net-snmp, to lab what type of traps router would send and in what situation. For some weird reason wireshark/tshark doesn't honour net-snmp settings, but needs its own settings.

# cat ~/.wireshark/prefefences name_resolve: mtC name_resolve_load_smi_modules: TRUE snmp.display_oid: TRUE snmp.desegment: TRUE snmp.var_in_tree: TRUE # cat ~/.wireshark/smi_paths "/home/WINRARISU/.snmp/mibs" # cat ~/.wireshark/smi_modules "IP-MIB" "IF-MIB" "TCP-MIB" "UDP-MIB" "SNMPv2-MIB" "RFC1213-MIB" "IPV6-ICMP-MIB" "IPV6-MIB" "SNMP-COMMUNITY-MIB" "SNMP-FRAMEWORK-MIB" "SNMP-MPD-MIB" "SNMP-NOTIFICATION-MIB" "SNMP-PROXY-MIB" "SNMP-TARGET-MIB" "SNMP-USER-BASED-SM-MIB" "SNMP-USM-DH-OBJECTS-MIB" "SNMP-VIEW-BASED-ACM-MIB" "BGP4-MIB" "CISCO-BGP4-MIB" "MPLS-LDP-CAPABILITY" "MPLS-LDP-GENERIC-STD-MIB" "MPLS-LDP-STD-MIB" "CISCO-IETF-ISIS-CAPABILITY" "CISCO-IETF-ISIS-MIB" #

Now 'tshark port snmp-trap or port snmp' should happily display resolved names as well. To decide which MIBs to add in smi_modules you can use 'snmptranslate OID' to translate given OID to its symbolic name for easy grep against your MIB dir.