2014-02-25

JunOS and ARP Glean

I'm using Cisco vocabulary 'glean' here as I don't know better word for it. Glean is any IPv4 packet which is going to connected host which is not resolved. It is NOT an ARP packet, so ARP policers won't help you. They are punted, since you need to generate ARP packet and try to resolve them.

In 7600 we can use 'mls rate-limit unicast cef glean 200 50' to limit how many packets per second are punted to control-plane for glean purposes. How can we limit this in JunOS? As far as I can see, there is no way. But I remember testing this attack and was unable to break MX80, so why didn't it break?

First let's check what does connected network look like

fisakytt@test13nqa1-re0.dk> show route forwarding-table destination 62.236.255.179/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 62.236.255.0/24 intf 0 rslv 828 1 xe-0/0/0.42

Ok, fair enough. Type 'rslv', which we can guess means packet is punted to control-plane for resolving ARP. Let's try to ping some address rapidly which does not resolve and check what it looks like

fisakytt@test13nqa1-re0.dk> show route forwarding-table destination 62.236.255.179/32 table default Routing table: default.inet Internet: Destination Type RtRef Next hop Type Index NhRef Netif 62.236.255.179/32 dest 0 62.236.255.179 hold 830 1 xe-0/0/0.42

Interesting, we get host/32 'hold' entry for it, no longer whole connected network as itself. Hold adjacency is programmed by software, so between 1st packet coming in and hold adjacency being programmed all packets in between are punted as well. I'm bit optimistic now that maybe there is some intelligence here. Maybe we'll get more information by checking the index (it changes quite often, so try to catch it at same time)

MX104-ABB-0(test13nqa1-re0.dk vty)# show nhdb id 830 extensive ID Type Interface Next Hop Addr Protocol Encap MTU Flags PFE internal Flags ----- -------- ------------- --------------- ---------- ------------ ---- ------------------ ------------------ 830 Hold xe-0/0/0.42 - IPv4 Ethernet 0 0x0000000000000000 0x0000000000000000 BFD Session Id: 0 Hold Status Packet: 0x4499d788 Timer: Running Flags: 0x0000000000000000

Well not too interesting. Didn't try to figure out what 'Packet' means yet, maybe it's memory location for packet which caused it to be on hold, maybe it's timer information. But curious thing is, that it's timed. So we'll know it'll go away automatically after some time. But how can we figure out what happens to hold and rslv? Maybe they are exceptions?

MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== Routing ---------------------- hold route DISC(70) 6339 532476 resolve route PUNT(33) 175 14700

Yup, the hold routes are discarded. I'm sending 10pps, so we're punting 0.27 per second. And we can extrapolate hold timer being about 3.7s. So 1 packet per 3.7s is punted. If I had been sending lot more packets, I'd see lot more punts, 10pps is just very slow rate so 'hold' adjacency will drop rest of them. Also if we have very many addresses this can add up to good number of packets. I might try to walk through the punt(33) to see if there is policer attached to it or not (There should be, but in certain situations it will be bypassed, this will be fixed and hopefully at same time we'll see ddos-policer limiters for resolve packets).

Just to be sure we've interpreted all this correctly, let's check that DISC(70) is seeing our ping packet to 62.236.255.179

MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions 70 discard MX104-ABB-0(test13nqa1-re0.dk vty)# debug jnh exceptions-trace MX104-ABB-0(test13nqa1-re0.dk vty)# show jnh exceptions-trace [430247] jnh_exception_packet_trace: ############### [430248] jnh_exception_packet_trace: [iif:338,code/info:198/0x273,score:(0x0),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:102] [430249] jnh_exception_packet_trace: 0x00: 20 00 c6 00 27 30 01 52 00 12 00 66 80 00 00 20 [430250] jnh_exception_packet_trace: 0x10: 12 00 00 12 00 10 00 00 00 3c 8a b0 8d a6 4a 28 [430251] jnh_exception_packet_trace: 0x20: c0 da 07 c1 c8 88 47 00 00 0d 3e 45 c0 00 54 06 [430252] jnh_exception_packet_trace: 0x30: 62 00 00 40 01 84 6d c1 58 ef 21 3e ec ff b3 08 [430253] jnh_exception_packet_trace: 0x40: 00 99 6b 3a c5 16 07 53 0c c7 ed 00 00 07 cb 08 [430254] jnh_exception_packet_trace: 0x50: 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 [430255] jnh_exception_packet_trace: 0x60: 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 [430256] jnh_exception_packet_trace: 0x70: 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37

3e ec ff b3 => 62.236.255.179, yup that's our ping packet being discarded due to 'hold' adjacency

2014-02-24

JunOS 'L3 incompletes', what and why?

There is quite often chatter about L3 incompletes, and it seems there are lot of opinions what they are. Maybe some of these opinions are based on some particular counter bug in some release. Juniper has introduced also toggle to allow stopping the counter from working. It seems very silly to use this toggle, as it is really one of the few ways you can gather information about broken packets via SNMP.

What they (at least) are not

  • Unknown unicast
  • CDP
  • BPDU
  • Packet from connected host which does not ARP
  • Packet from unconfigured VLAN

What they (at least) are

  • IP header checksum error
  • IP header error (impossibly small IHL, IP version 3, etc)
  • IP header size does not match packet size

Troubleshooting

So if you are seeing them, what can you do? As it is aggregate counter for many different issues, how do you actually know which one is it and is there way to figure out who is sending them? Luckily for Trio based platforms answers and highly encouraging, we have very good tools to troubleshoot the issue.

To figure out what they exactly are, first you need to figure out your internal IFD index (not snmp ifindex)

im@ruuter> show interfaces xe-7/0/0 |match index: Interface index: 224, SNMP ifIndex: 586

After figuring out index, we can login to the the PFE and check stream counters for that IFD

im@ruuter> start shell pfe network fpc7 NPC platform (1067Mhz MPC 8548 processor, 2048MB memory, 512KB flash) NPC7(ruuter vty)# show jnh ifd 224 stream ifd = 224, Stream = 33 Stream ID: 33 (inst = 0) Cntr : 0x00c0f102 Encap : Ether Encap = 0, StartNH = 0xc040e1 lacp:+, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, dc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001 Input Statistics: 0003126353191368 pkts, 3351074223070319 bytes Detail Statistics: rx0: 0000000000000000 pkts, 0000000000000000 bytes rx1: 0000007792865413 pkts, 0000923636240746 bytes rx2: 0003118560325955 pkts, 3350150586829573 bytes drop0: 0000000000000000 pkts, 0000000000000000 bytes drop1: 0000000000000000 pkts, 0000000000000000 bytes drop2: 0000000000000000 pkts, 0000000000000000 bytes unknown-iif: 0000000000000000 pkts, 0000000000000000 bytes checksum: 0000000000625225 pkts, 0000000268883747 bytes unknown-proto: 0000000000024793 pkts, 0000000006398918 bytes bad-ucastmac: 0000000218713670 pkts, 0000034352327467 bytes bad-ucastmac-IPv6: 0000000002160892 pkts, 0000000172764339 bytes bad-smac: 0000000000000000 pkts, 0000000000000000 bytes in-stp: 0000000000000000 pkts, 0000000000000000 bytes out-stp: 0000000000000000 pkts, 0000000000000000 bytes vlan-check: 0000000000000000 pkts, 0000000000000000 bytes frame-errors: 0000000000000108 pkts, 0000000000014451 bytes bad-IPv4-hdr: 0000000000033339 pkts, 0000000012708126 bytes bad-IPv4-len: 0000000000070901 pkts, 0000000025836710 bytes bad-IPv6-hdr: 0000000000000133 pkts, 0000000000009508 bytes bad-IPv6-len: 0000000000000993 pkts, 0000000000071269 bytes out-mtu-errors: 0000000000003391 pkts, 0000000005122005 bytes L4-len: 0000000000038084 pkts, 0000000001765247 bytes Stream Features: Topology: stream-(33) Flavor: i-root (1), Refcount 0, Flags 0x1 Addr: 0x4513f3c8, Next: 0x4fdd3c78, Context 0x4513f3c0 Link 0: da40602e:32000303, Offset 12, Next: da40602e:32000303 Link 1: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 2: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 3: 00000000:00000000, Offset 12, Next: 00000000:00000000 Topology Neighbors: [none]-> stream-(33)-> flist-master(stream) Feature List: stream [pfe-0]: 0xda40602e32000303; f_mask:0x80000000000000; c_mask:0x8000000000000000; f_num:9; c_num:1, inst:0 Idx#8 iif-lookup: [pfe-0]: 0xda40602e32000303

Here we can see 'checksum, bad-ipvX-hdr, bad-ipvX-len' at least all of these are 'L3 incompletes', there may be other reasons, but that's the absolutely minimum. We can also see aggregate counters for all the interfaces in given Trio, we'll need some of this information later:

NPC7(ruuter vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== PFE State Invalid ---------------------- sw error DISC(64) 197636729 13174899216 invalid fabric token DISC(75) 68 4311 unknown family DISC(73) 24793 6398918 iif down DISC(87) 4516 337076 egress pfe unspecified DISC(19) 5857595 1900968530 Packet Exceptions ---------------------- bad ipv4 hdr checksum DISC( 2) 660667 289608849 bad IPv6 options pkt DISC( 9) 3 216 bad IPv4 hdr DISC(11) 33339 12708126 bad IPv6 hdr DISC(56) 133 9508 bad IPv4 pkt len DISC(12) 108203 33978274 bad IPv6 pkt len DISC(57) 1009 72421 L4 len too short DISC(13) 143678 6622571 frag needed but DF set DISC(22) 21915 33137575 ttl expired PUNT( 1) 51770371 3124910479 IP options PUNT( 2) 777 108006 frame format error DISC( 0) 108 14451 my-mac check failed DISC(28) 218721556 34352693971 my-mac check failed IPv6 DISC(58) 2161073 172779483 DDOS policer violation notifs PUNT(15) 2438770 326360620 Firewall ---------------------- firewall discard DISC(67) 1284437202 544450843517 firewall discard V6 DISC(101) 34130853 4143870718 Routing ---------------------- discard route DISC(66) 3740477632 552767152133 discard route IPv6 DISC(102) 3894436247 281512808097 hold route DISC(70) 471 35151 resolve route PUNT(33) 10 776 resolve route V6 PUNT(69) 818 63097 control pkt punt via nh PUNT(34) 993912636 45817932600 host route PUNT(32) 228855708 19337408523 mcast host copy PUNT( 6) 2591 422909 reject route PUNT(40) 2855554 402183663 reject route V6 PUNT(68) 9277 1901206

The counters for L3 incompletes are 'bad ipv4 hdr checksum', 'bad ipvX hdr' and 'bad ipvX pkt len'. Notice how we're missing IPv6 hdr checksum, obviously because IPv6 does not have this, because it was deemed unnecessary, but we'll shortly see this may have been bad decision.

Now why could we possibly see L3 incomplete increasing? If frame is mangled, we'll see ethernet CRC failing (which is much better than IP checksum) and we'll never even check any of these, we'll drop frame much earlier. So clearly we received packet which had correct ethernet CRC yet it was broken. We recently had issue where pretty much all egress PE boxes started logging 'l3 incompletes' because IPv4 header checksum was failing, they incremented maybe 20 times per hour, so very moderately. But how is this possible? If someone generated broken IP packet and sent it to us, we'd drop it in ingress PE box with incrementing these counters, yet packet traversed through MPLS core all the way to egress PE. So clearly we were mangling them. Obviously core is like Jon Snow, it's just MPLS frame to it, it does not need to know it's IP nor should it verify it for correctness, so core will happily pass broken packets around. It seems like complex problem to try to figure out who is mangling the packets. Luckily Trio gives us ability to capture exception packets, here we need to use the exception number we see above in exceptions counters

NPC7(ruuter vty)# debug jnh exceptions 2 discard NPC7(ruuter vty)# debug jnh exceptions-trace NPC7(ruuter vty)# show jnh exceptions-trace [1768975] jnh_exception_packet_trace: ############### [1768976] jnh_exception_packet_trace: [iif:344,code/info:130/0x0,score:tcp|(0x40),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:60] [1768977] jnh_exception_packet_trace: 0x00: 20 40 82 00 00 00 01 58 00 12 00 3c 80 00 00 20 [1768978] jnh_exception_packet_trace: 0x10: 12 00 00 3c 00 00 00 00 00 28 c0 da 07 c0 00 00 [1768979] jnh_exception_packet_trace: 0x20: 12 1e d5 97 f8 88 47 00 00 03 3d 45 00 00 28 1e [1768980] jnh_exception_packet_trace: 0x30: 4f 40 00 87 06 44 d4 XX XX 39 58 XX XX 3f 6a d0 [1768981] jnh_exception_packet_trace: 0x40: c0 e2 82 4f 0d a2 2d cc ec aa 6b fd 78 0f 10 22 [1768982] jnh_exception_packet_trace: 0x50: 60 cd 42 00 00

It should be lot simpler for us now to troubleshoot the issue, we just figure out where is that SADDR (XX XX 39 58) entering the network, with any luck, the mangling node is somewhere in the edge of the network and we will find some common theme in source addresses. If it's in core, you're pretty much out of luck, you'll replace whole network or accept that you mangle something. I wrote little script which you can give IP header, incorrect checksum, correct checksum and it'll show you what each 16b field should be, to result in correct checksum, with luck you can use it to figure out which part of the packet is being mangled, as some of those 16b fields are bound to result in impossible/invalid values to have correct checksum, which allows you to exclude them and concentrate efforts on rest of the fields. But field like ID can be anything, so impossible to exclude. SADDR is also problematic, DADDR usually not (if it were this value, it would not have been routed to my network).

Why this happens? How often this happens? I really would like to know. Obviously as CRC is correct, it's not happening because error in links/optics/etc. My guess is, this happens mostly because bad memory in forwarding logic. Packet can touch many memories on its path, ring => sram => dram => sram => ring, do all of these have ECC? Does PHY ring ever have ECC? I don't know, but clearly there can be problems, as L3 incompletes exist. How common they are? Probably lot more common than we think, as we only know about mangling when it happens to happen in IPv4 header (IPv6 header mangling would usually pass unnoticed). IPv4 header is 20B, packet lenght typically is 1500B, so are we only seeing 1.3% of the tip of the iceberg?

I guess key takeaway here is, don't use 'ignore-l3-incompletes', monitor your 'l3 incompletes' via SNMP, figure out why they are happening and fix them. Especially check right now if your egress PE is has L3 incompletes from core.

2014-02-21

Why you should want metered INET?

When people think about metered, they may think about mobile roaming or old outrageous per minute PSTN billing. Those are not fair prices, they are not what I'm talking about.

Also INET should be always on, billing should take this into consideration, maybe once you exceed your paid capacity, your connection is policed to 256kbps unless you pay for more. You could get notice when this limit is nearing by SMS and Email.

Flat-rate billing is based on assumption that on average INET is not used much at all, in such scenario it works. Consumers get flat-rate stove-gas in Helsinki, because its use is almost non-existing. But services like Youtube and Netflix which are relatively new can alone be 2/3 of all your traffic, meaning what ever average use you planned for, it's not true, average use is increasing as more services users care for appear.


1. Quality

When you pay flat rate there is financial incentive for your operator not to provide you bits, every bit not provided improves your margins. Operators today regularly keep some ports congested, because it would be expensive to upgrade, instead they try get someone else to pay for it, if they have the leverage.

If consumers pay for bits then delivering bits means more money and Internet quality is market driven fact, everyone has good Internet.

2. Equality

I may stream HD content whole day just to have background noise, my retired mother might check email 3 times per week, yet we pay the same. Clearly my mother is subsidizing my use, considering her minuscule pension is 1/5th of my still moderate wage, it does not seem very fair at all.

Should I pay same for electricity, petrol or grocery shopping as everyone else? In post-scarcity society we wouldn't pay for anything, but unfortunately we're not there yet.

http://www.telecompetitor.com/jdsu-less-than-1-of-users-comprise-over-50-of-lte-usage/

3. Free market

Eyeball providers know that providing bits is bad for business, so if you are large enough like FT or Comcast you can extort money from content owners by keeping your transit ports congested. This causes consumers to complain to content shops and forces content shops to either lose business to these consumers or pay FT/Comcast/etc premium rate to access their network. This premium rate isn't anything to sneeze at, it can be four times the cost you pay for IP transit.

So essentially consumers are paying part of their access indirectly through content shops.

In addition to the clear inefficiencies that indirect billing causes this is also highly anti-competitive, if you are small provider you lack the leverage to extort content owners making it much harder for you to compete with the market leaders, over time this creates de facto monopolies.

http://www.lightreading.com/cable-video/ott/cogent-gearing-for-another-peering-battle/d/d-id/707831?

What might happen if some eyeball operator would change to fair per-use billing over-night? I'll assume very naive model as I'm not ready to give this sufficient thought. Let's assume they create scheme which delivers exactly same amount of money from consumers as they get today, 1/3 pays more, 1/3 pays same and 1/3 pays less.

I would guess the obvious outcome is, 1/3, the heavy users, switch to alternative provider if such exists in their market. This removes huge quantity of capacity, so essentially provider can save all CAPEX spending to the network.

What would happen to the networks who got these 1/3 heavy users? Their business case would get radically poorer, as they didn't get enough regular users to pay for those heavy users, so they would struggle to deliver the bits leading to congested poorly performing network.

Heavy users who want quality, would have to switch back to a network which financially can and will deliver the service they want and other operators now struggling with broken fundamentals would need to implement same billing scheme.