2014-02-24

JunOS 'L3 incompletes', what and why?

There is quite often chatter about L3 incompletes, and it seems there are lot of opinions what they are. Maybe some of these opinions are based on some particular counter bug in some release. Juniper has introduced also toggle to allow stopping the counter from working. It seems very silly to use this toggle, as it is really one of the few ways you can gather information about broken packets via SNMP.

What they (at least) are not

  • Unknown unicast
  • CDP
  • BPDU
  • Packet from connected host which does not ARP
  • Packet from unconfigured VLAN

What they (at least) are

  • IP header checksum error
  • IP header error (impossibly small IHL, IP version 3, etc)
  • IP header size does not match packet size

Troubleshooting

So if you are seeing them, what can you do? As it is aggregate counter for many different issues, how do you actually know which one is it and is there way to figure out who is sending them? Luckily for Trio based platforms answers and highly encouraging, we have very good tools to troubleshoot the issue.

To figure out what they exactly are, first you need to figure out your internal IFD index (not snmp ifindex)

im@ruuter> show interfaces xe-7/0/0 |match index: Interface index: 224, SNMP ifIndex: 586

After figuring out index, we can login to the the PFE and check stream counters for that IFD

im@ruuter> start shell pfe network fpc7 NPC platform (1067Mhz MPC 8548 processor, 2048MB memory, 512KB flash) NPC7(ruuter vty)# show jnh ifd 224 stream ifd = 224, Stream = 33 Stream ID: 33 (inst = 0) Cntr : 0x00c0f102 Encap : Ether Encap = 0, StartNH = 0xc040e1 lacp:+, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, dc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001 Input Statistics: 0003126353191368 pkts, 3351074223070319 bytes Detail Statistics: rx0: 0000000000000000 pkts, 0000000000000000 bytes rx1: 0000007792865413 pkts, 0000923636240746 bytes rx2: 0003118560325955 pkts, 3350150586829573 bytes drop0: 0000000000000000 pkts, 0000000000000000 bytes drop1: 0000000000000000 pkts, 0000000000000000 bytes drop2: 0000000000000000 pkts, 0000000000000000 bytes unknown-iif: 0000000000000000 pkts, 0000000000000000 bytes checksum: 0000000000625225 pkts, 0000000268883747 bytes unknown-proto: 0000000000024793 pkts, 0000000006398918 bytes bad-ucastmac: 0000000218713670 pkts, 0000034352327467 bytes bad-ucastmac-IPv6: 0000000002160892 pkts, 0000000172764339 bytes bad-smac: 0000000000000000 pkts, 0000000000000000 bytes in-stp: 0000000000000000 pkts, 0000000000000000 bytes out-stp: 0000000000000000 pkts, 0000000000000000 bytes vlan-check: 0000000000000000 pkts, 0000000000000000 bytes frame-errors: 0000000000000108 pkts, 0000000000014451 bytes bad-IPv4-hdr: 0000000000033339 pkts, 0000000012708126 bytes bad-IPv4-len: 0000000000070901 pkts, 0000000025836710 bytes bad-IPv6-hdr: 0000000000000133 pkts, 0000000000009508 bytes bad-IPv6-len: 0000000000000993 pkts, 0000000000071269 bytes out-mtu-errors: 0000000000003391 pkts, 0000000005122005 bytes L4-len: 0000000000038084 pkts, 0000000001765247 bytes Stream Features: Topology: stream-(33) Flavor: i-root (1), Refcount 0, Flags 0x1 Addr: 0x4513f3c8, Next: 0x4fdd3c78, Context 0x4513f3c0 Link 0: da40602e:32000303, Offset 12, Next: da40602e:32000303 Link 1: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 2: 00000000:00000000, Offset 12, Next: 00000000:00000000 Link 3: 00000000:00000000, Offset 12, Next: 00000000:00000000 Topology Neighbors: [none]-> stream-(33)-> flist-master(stream) Feature List: stream [pfe-0]: 0xda40602e32000303; f_mask:0x80000000000000; c_mask:0x8000000000000000; f_num:9; c_num:1, inst:0 Idx#8 iif-lookup: [pfe-0]: 0xda40602e32000303

Here we can see 'checksum, bad-ipvX-hdr, bad-ipvX-len' at least all of these are 'L3 incompletes', there may be other reasons, but that's the absolutely minimum. We can also see aggregate counters for all the interfaces in given Trio, we'll need some of this information later:

NPC7(ruuter vty)# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== PFE State Invalid ---------------------- sw error DISC(64) 197636729 13174899216 invalid fabric token DISC(75) 68 4311 unknown family DISC(73) 24793 6398918 iif down DISC(87) 4516 337076 egress pfe unspecified DISC(19) 5857595 1900968530 Packet Exceptions ---------------------- bad ipv4 hdr checksum DISC( 2) 660667 289608849 bad IPv6 options pkt DISC( 9) 3 216 bad IPv4 hdr DISC(11) 33339 12708126 bad IPv6 hdr DISC(56) 133 9508 bad IPv4 pkt len DISC(12) 108203 33978274 bad IPv6 pkt len DISC(57) 1009 72421 L4 len too short DISC(13) 143678 6622571 frag needed but DF set DISC(22) 21915 33137575 ttl expired PUNT( 1) 51770371 3124910479 IP options PUNT( 2) 777 108006 frame format error DISC( 0) 108 14451 my-mac check failed DISC(28) 218721556 34352693971 my-mac check failed IPv6 DISC(58) 2161073 172779483 DDOS policer violation notifs PUNT(15) 2438770 326360620 Firewall ---------------------- firewall discard DISC(67) 1284437202 544450843517 firewall discard V6 DISC(101) 34130853 4143870718 Routing ---------------------- discard route DISC(66) 3740477632 552767152133 discard route IPv6 DISC(102) 3894436247 281512808097 hold route DISC(70) 471 35151 resolve route PUNT(33) 10 776 resolve route V6 PUNT(69) 818 63097 control pkt punt via nh PUNT(34) 993912636 45817932600 host route PUNT(32) 228855708 19337408523 mcast host copy PUNT( 6) 2591 422909 reject route PUNT(40) 2855554 402183663 reject route V6 PUNT(68) 9277 1901206

The counters for L3 incompletes are 'bad ipv4 hdr checksum', 'bad ipvX hdr' and 'bad ipvX pkt len'. Notice how we're missing IPv6 hdr checksum, obviously because IPv6 does not have this, because it was deemed unnecessary, but we'll shortly see this may have been bad decision.

Now why could we possibly see L3 incomplete increasing? If frame is mangled, we'll see ethernet CRC failing (which is much better than IP checksum) and we'll never even check any of these, we'll drop frame much earlier. So clearly we received packet which had correct ethernet CRC yet it was broken. We recently had issue where pretty much all egress PE boxes started logging 'l3 incompletes' because IPv4 header checksum was failing, they incremented maybe 20 times per hour, so very moderately. But how is this possible? If someone generated broken IP packet and sent it to us, we'd drop it in ingress PE box with incrementing these counters, yet packet traversed through MPLS core all the way to egress PE. So clearly we were mangling them. Obviously core is like Jon Snow, it's just MPLS frame to it, it does not need to know it's IP nor should it verify it for correctness, so core will happily pass broken packets around. It seems like complex problem to try to figure out who is mangling the packets. Luckily Trio gives us ability to capture exception packets, here we need to use the exception number we see above in exceptions counters

NPC7(ruuter vty)# debug jnh exceptions 2 discard NPC7(ruuter vty)# debug jnh exceptions-trace NPC7(ruuter vty)# show jnh exceptions-trace [1768975] jnh_exception_packet_trace: ############### [1768976] jnh_exception_packet_trace: [iif:344,code/info:130/0x0,score:tcp|(0x40),ptype:2/0,orig_ptype:2,offset:18,orig_offset:18,len:60] [1768977] jnh_exception_packet_trace: 0x00: 20 40 82 00 00 00 01 58 00 12 00 3c 80 00 00 20 [1768978] jnh_exception_packet_trace: 0x10: 12 00 00 3c 00 00 00 00 00 28 c0 da 07 c0 00 00 [1768979] jnh_exception_packet_trace: 0x20: 12 1e d5 97 f8 88 47 00 00 03 3d 45 00 00 28 1e [1768980] jnh_exception_packet_trace: 0x30: 4f 40 00 87 06 44 d4 XX XX 39 58 XX XX 3f 6a d0 [1768981] jnh_exception_packet_trace: 0x40: c0 e2 82 4f 0d a2 2d cc ec aa 6b fd 78 0f 10 22 [1768982] jnh_exception_packet_trace: 0x50: 60 cd 42 00 00

It should be lot simpler for us now to troubleshoot the issue, we just figure out where is that SADDR (XX XX 39 58) entering the network, with any luck, the mangling node is somewhere in the edge of the network and we will find some common theme in source addresses. If it's in core, you're pretty much out of luck, you'll replace whole network or accept that you mangle something. I wrote little script which you can give IP header, incorrect checksum, correct checksum and it'll show you what each 16b field should be, to result in correct checksum, with luck you can use it to figure out which part of the packet is being mangled, as some of those 16b fields are bound to result in impossible/invalid values to have correct checksum, which allows you to exclude them and concentrate efforts on rest of the fields. But field like ID can be anything, so impossible to exclude. SADDR is also problematic, DADDR usually not (if it were this value, it would not have been routed to my network).

Why this happens? How often this happens? I really would like to know. Obviously as CRC is correct, it's not happening because error in links/optics/etc. My guess is, this happens mostly because bad memory in forwarding logic. Packet can touch many memories on its path, ring => sram => dram => sram => ring, do all of these have ECC? Does PHY ring ever have ECC? I don't know, but clearly there can be problems, as L3 incompletes exist. How common they are? Probably lot more common than we think, as we only know about mangling when it happens to happen in IPv4 header (IPv6 header mangling would usually pass unnoticed). IPv4 header is 20B, packet lenght typically is 1500B, so are we only seeing 1.3% of the tip of the iceberg?

I guess key takeaway here is, don't use 'ignore-l3-incompletes', monitor your 'l3 incompletes' via SNMP, figure out why they are happening and fix them. Especially check right now if your egress PE is has L3 incompletes from core.

4 comments:

  1. Great post Saku. thank you!

    ReplyDelete
  2. Do you want to join AS15169?

    We saw a case this actaully happened(And actaully caused problem), it was a bug in a library code.

    https://github.com/google/gopacket/issues/231

    Thanks for your post we identified the issue quickly.

    ReplyDelete
  3. Great Post and Explanation.. Thank you ! Keep up doing good work.

    ReplyDelete
  4. Great post, doesn't work for sub interface index numbers, which unfortunately is what i needed to see.

    ReplyDelete