2012-10-24

Sorry state of JunOS control plane protection

I've been looking into how to protect MX80 11.4R5 from various accidental and intentional attempts to congest control plane and I'm drawing pretty much blank.

Main discoveries so far.

ISIS always leaked to control plane, even when no 'family iso' or 'protocol isis' on interface
PVST always leaked to control plane. Even when just 'family inet' configured to interface
LLDP protocol not matched by ddos-protection feature
Essentially impossible to protect against attack from eBGP
ddos-protection feature mis-dimensioned

ISIS

This is pretty bad for anyone running ISIS, as you cannot use ddos-protection to limit ISIS, as it won't distinguish between bad and good ISIS. If you don't use ISIS, just set ddos-protection limit low and you're good to go.

ISIS is punted with different code than IP packets, but resolving the punt path it goes to the same path. This path is still seeing full wire rate, i.e. there isn't magic 10kpps limit before it

HCFPC2(le_ruuter vty)# show jnh 0 exceptions control pkt punt via nh PUNT(34) 9134818 1065269880 HCFPC2(le_ruuter vty)# show jnh 0 exceptions nh 34 punt Nexthop Chain: CallNH:desc_ptr:0xc02bbc, mode=0, rst_stk=0x0, count=0x3 0xc02bb8 0 : 0x127fffffe00003f0 0xc02bb9 1 : 0x2ffffffe07924a00 0xc02bba 2 : 0xda00601499000a04 0xc02bbb 3 : 0x3af46014fcd08810 HCFPC2((le_ruuter vty)# show jnh 0 decode 0xda00601499000a04 IndexNH:key_ptr:0x80/0, desc_ptr=0xc02932, max=10, nbits=4 HCFPC2(le_ruuter vty)# show jnh 0 vread 0xc02932 4 Addr:0xc02932, Data = 0x42f47fffff8b0010 Addr:0xc02933, Data = 0xda026014b6801004 Addr:0xc02934, Data = 0x60040740000e822f Addr:0xc02935, Data = 0x60041bc0000e828a HCFPC2(le_ruuter vty)# show jnh 0 decode 0x60040740000e822f JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x000080e8 base_ptr = 0x000e822f HCFPC2(le_ruuter vty)# show jnh 0 decode 0x60041bc0000e828a JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x00008378 base_ptr = 0x000e828a HCFPC2(le_ruuter vty)# show filter Index Semantic Name -------- ---------- ------ 46137345 Classic HOSTBOUND_IPv4_FILTER 46137346 Classic HOSTBOUND_IPv6_FILTER HCFPC2(le_ruuter vty)# show filter index 46137345 detail JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x000080e8 base_ptr = 0x000e822f HCFPC2(le_ruuter vty)# show filter index 46137346 detail JNH_FW_START: opcode = 0x0000000c desc_ptr = 0x00008378 base_ptr = 0x000e828a

PVST

If you don't need PVST, you can just limit in ddos-protection. But still it's pretty annoying it's leaked to control-plane, especially as Trio already does support 'punt mask' for LACP, STP, LLDP etc per physical interface, but even if STP punting is turned off, PVST is still punted.

HCFPC2(le_ruuter vty)# show ifd brief Index Name Type Flags Slot State ----- -------------------- ----------- ------ ----- ------ 190 xe-2/0/6 Ethernet 0x0000000000008000 2 Up HCFPC2(le_ruuter vty)# show jnh ifd 190 stream lacp:-, stp:-/0, esmc:-, lfm:-, erp:-, lldp:-, mvrp:-/-, smac_mcast_clear:-, vc:-, natVlan:-/4095, native tpid 0, tpidMask:0x0001

BGP

Problem with protecting against eBGP attack is, that policers work by bps (except DDoS policers, you can actually almost certainly make any policer in PFE bps->pps by changing its application by poking directly at memory, but it would be cleared by next reboot or 'commit full'). And you can only cope with maybe 4Mbps of traffic, so either you accept convergence issues in BGP or you accept that eBGP can bring you down. If you absolutely positively must fix this, then one way to get closer is to police <1400B BGP at very low rate and >1400B BGP at high enough rate for convergence, but you'd need separate policers per BGP so that one BGP neighbor cannot bring another down by killing hello packets.

One quick and dirty fix to protect eBGP from iBGP, but not from other eBGP would be to run all your eBGP as 'passive' and run your route reflectors as 'passive'. Then your PE would open connection to RR and your customers would open connection to your PE, this is already classified to different terms in ddos-protection filter:

HCFPC2(le_ruuter vty)# show filter Index Semantic Name -------- ---------- ------ 46137345 Classic HOSTBOUND_IPv4_FILTER HCFPC2(le_ruuter vty)# show filter index 46137345 program term HOSTBOUND_BGP_TERM1 term priority 0 payload-protocol 6 destination-port 179 then accept queue 0 policer template __ddos_BGP_aggregate_policer__ policer __ddos_BGP_aggregate_policer__-HOSTBOUND_BGP_TERM1 app_type 23 bandwidth-limit 34359738360 bits/sec burst-size-limit 16777215 bytes discard count __ddos_BGP_aggregate_pass__ ddos proto 5120 term HOSTBOUND_BGP_TERM2 term priority 0 payload-protocol 6 source-port 179 then accept queue 0 policer template __ddos_BGP_aggregate_policer__ policer __ddos_BGP_aggregate_policer__-HOSTBOUND_BGP_TERM2 app_type 23 bandwidth-limit 34359738360 bits/sec burst-size-limit 16777215 bytes discard count __ddos_BGP_aggregate_pass__ ddos proto 5120

Now only change needed, would be to put these under different BGP policer, then your customers would be policed separately to your iBGP and attack wouldn't bring core down.

Dimensioning

It's really strange how Juniper has dimensioned their boxes. MX80 goes down on 4Mbps/10kpps flood, while RE CPU (PQ3, 8572) and LC CPU (PQ3, 8544) both are 90% idle during the event, while all ISIS, LDP, BGP remain down until attack stops.

MX960 RP CPU (4xXEON) MPC2 LC CPU (PQ3, 8548) isn't faring significantly better than MX80, if attack and protected service are in same MPC, it cannot handle anywhere near stock ddos-protection 20kpps, but will bring core BGP down. Maybe MX960 can do 15kpps.

T4k RP CPU (2xXEON) FPC5 LC CPU (QorIQ P2020) can actually handle stock ddos-protection 20kpps rate, but not 30kpps, so if you can push two protocols to ddos-protection, it's still going to be down.

There clearly is some per linecard non-configurable policer which limits JunOS control-plane performance to much slower rate than what it realistically can handle, this is as stupid as 'mls rate-limit unicast cef receive' on 7600, essentially you're underclocking your control-plane, making it die under lower than max load. However, if we could control traffic in pps level, it wouldn't matter 5kpps is plenty for BGP convergence. But as we must limit in bps and prepare for worst-case scenario, policer values need to be ridiculously small. Maybe you allow from customer BGP, VRRP, DHCP, PIM, BFD. Now if you want that VRRP flood will only kill VRRP not other service, you need separate policer for each, but on aggregate they can't be over 4Mbps, so you're left with 800kbps per protocol, if you share equally, of course BGP is only capacity hungry protocol.

Solution

It's confusing why control plane protection is even user configurable feature, as it can be strictly restricted and pps limited per session dynamically as services are turned on, this is perfectly doable in Trio hardware, no user input needed.

And at very least you should be able to do L2 filters on L3 interfaces, so you could drop everything except IPv4, IPv6, ARP ethertypes and remove most hard to protect attack vectors.

If you need something useful today, put core and edge in different MPC and use ddos-protection feature so that edge cannot congest core. For single linecard system like MX80 there unfortunately isn't any really practical way today.

2012-04-25

My Cisco Live 2012 program

My Cisco Live 2012 program is below. I was able to select most of the sessions although there were some overlap with one session (something ARC overlapping with another ARC). You might note that the closing keynote is by the guys who play Mythbusters on TV - that'll most likely be a fun filled event! I'll blog about my pre-event feelings if I have the time and interest ;-)

Cisco Live 2012

Personal Schedule

Printed below is your personal schedule.

Sunday
8:00 AM 5:00 PM	TECCOM-2001	Cisco Unified Computing System Technical Day
Monday
8:00 AM 9:30 AM	BRKNMS-2658	Securely Managing Your Networks with SNMPv3
10:00 AM 12:00 PM	BRKSEC-3021	Maximizing Firewall Performance
1:00 PM 3:00 PM	BRKRST-2335	IS-IS Network Design and Deployment
Tuesday
8:00 AM 9:30 AM	BRKRST-2310	Deploying OSPF in a Large-Scale Network
10:00 AM 11:30 AM	GENKEY-4346	Keynote and Welcome Address
12:30 PM 2:30 PM	BRKARC-3470	Cisco Nexus 7000 Switch Architecture
4:00 PM 6:00 PM	BRKSEC-4054	DMVPN Deployment Models
Wednesday
8:00 AM 9:30 AM	BRKSEC-3013	Advanced IPSec with FlexVPN and IKEv2
10:00 AM 11:30 AM	GENKEY-4347	Cisco Technology Keynote
12:30 PM 2:30 PM	BRKSPG-2402	Best Practices to Deploy High-Availability in Service Provider Edge and Aggregation Architectures
4:00 PM 6:00 PM	BRKARC-3471	Cisco NX-OS Software Architecture
Thursday
8:00 AM 9:30 AM	BRKARC-2003	Cisco ASR 9000 Architecture
10:00 AM 11:30 AM	BRKARC-2007	IOS Strategy and Evolution
12:00 PM 1:30 PM	BRKSPG-2404	Energy Efficient Data Center: Best Practices
2:00 PM 3:00 PM	GENKEY-4358	Closing Keynote: An Afternoon with Adam Savage and Jamie Hyneman
3:30 PM 5:30 PM	BRKMPL-3101	Advanced Topics and Future Directions in MPLS

= Conference Event

2012-04-06

We don't understand hashes

At least I don't, nor do I understand math. It only this week dawned to me, that we consistently choose wrong hash for password hashing.

When I started using Linux DES was the standard way to hash your passwd, then it was MD5, now at least Ubuntu is using SHA. And I can bet that in 2 years time SHA-3 (will be selected this year) will be used widely for protecting passwords.

But what were design goals for MD5 and SHA? Design goals are obviously avoidance of collisions and more importantly algorithmic cheapness in terms of computational requirements and ability to implement it cheaply and easily in hardware requiring no branching. So MD5 and SHA are _by design_ simple to brute-force in hardware, even the new SHA-3. You don't want your 'git commit' or 'sha3sum /dev/cdrom' to take days, you want it to be very very fast and very very unlikely to represent any other data.

It should be quite obvious that those requirements are orthogonal to the requirements of password hash. Avoidance of collisions is not critical to password hash and absolutely opposite requirement of computational expensiveness exists for password hash, it needs to be very expensive and it needs to be poorly implementable in cheap hardware.

This only dawned to me, when co-worker was cracking password of one DWDM system and as it was DES we naively assumed it'll be cracked in seconds, but it turns out in CUDA systems DES is hundreds of times slower to crack than MD5 (it might have been that this was apples to oranges unix DES compared to MD5 instead of unix MD5, but doesn't change the fact that hash like bcrypt makes much more sense for passwords than hashes like MD5 or SHA). It was really illuminating moment for me, obviously this is good for general applications of hash, MD5 was designed to be fast (and so will SHA-3), obviously fast to calculate also means fast to brute-force. Only problem here is, that we're not understanding our application when choosing hash for protecting password, MD5 never was designed to be good for that application. Bcrypt is, I'm not familiar with it, I don't really know how good it is, but at least it's designed to be computationally very expensive and as machines get faster you can make it slower and slower without changing the implementation, you'll just give it different parameter.

2012-03-31

Silver bullet for home QoS

Rationale

I mentioned in one of the posts about how prioritizing small packets upstream is almost the proverbial silver bullet when it comes to QoS at home. I'm sure any ADSL user who uses interactive applications, such as SSH have noticed how laggy the SSH gets when you upload something from home, say your holiday pictures with scp to your web server. Also download is quite slow during upload. VoIP and online gaming will suffer too. Canonical solution is to use DSCP markings at sender end or DSCP mark based on IP address or port.

But I feel that is unnecessarily complex for typical home use scenario, since all of the important/interactive stuff are using small packets and the bandwidth hogging applications are all essentially doing MTU size packets. I've chosen <200B as small packet, which is arbitrary decision I did about decade ago when setting this up first time, I'm sure it could just as well be like 1300B. So without further rambling, I'll give IOS (ISR) and JunOS (SRX) examples how to roll this on your CPE.

IOS example

class-map match-any SMALL-PACKETS match packet length max 200 ! policy-map WAN-OUT class SMALL-PACKETS priority percent 75 class class-default fair-queue random-detect ! interface ATM0.100 point-to-point pvc 0/100 vbr-nrt 2000 2000 tx-ring-limit 3 service-policy output WAN-OUT ! !

JunOS example

ytti@gw.fi> show configuration interfaces vlan unit 0 family inet filter input FROM_LAN; ytti@gw.fi> show configuration firewall family inet filter FROM_LAN term small_packets { from { packet-length 0-200; } then { forwarding-class expedited-forwarding; next term; } } term rest { then accept; } ytti@gw.fi> show configuration class-of-service interfaces at-1/0/0 unit 0 { scheduler-map WAN_OUT; } ytti@gw.fi> show configuration class-of-service scheduler-maps WAN_OUT forwarding-class best-effort scheduler BE; forwarding-class assured-forwarding scheduler AF; forwarding-class expedited-forwarding scheduler EF; forwarding-class network-control scheduler NC; ytti@gw.fi> show configuration class-of-service schedulers BE { transmit-rate percent 5; } AF { transmit-rate percent 5; } EF { transmit-rate percent 85; } NC { transmit-rate percent 5; } ytti@gw.fi>

additional information

You need to tune in IOS your vbr-nrt to match your upstream ATM rate, modern IOS will automatically scale it down to real rate, if it's too high, so you don't have to worry about it too much. If you have very slow connection like 256kbps, you might want to put tx-ring-limit to 2. Unfortunately I've not found out how to tune tx-ring size on JNPR, and it feels bit too large by default, as IOS is somewhat more responsive during congestion.

You can test if it's working by sending large file upstream and pinging some other host. See how it looks with and without QoS applied to the egress interface. You should see high delay on ping without QoS and normal delay with QoS. In JunOS you can use 'show interfaces queue X' or 'show interface X detail' to confirm that you're seeing drops in best-efford, not in expedited-forwarding. In IOS you can use 'show policy-map interface X output' to confirm you're seeing drops in class-default not in SMALL-PACKETS.

There is one particular problem for people using ssh ControlMaster which multiplexes multiple connections under same network socket. It's really great as you only login to remote host once and further ssh/scp start without delay and without authentication, especially great if you're hopping through multiple intermediate ssh hosts can reduce delay from 4-5s to 100ms on opening ssh session. But when it comes to QoS it's quite poor, if you have interactive ssh session to your server and then you use scp to upload data to that same server, you will notice that interactive ssh is laggy even with QoS. This is of course how it should work, while your CPE will reorder packets, to send interactive (small) packets first, the far end server, will not give the unordered small packets to userspace, as TCP guarantees packet ordering, so the far end server is keeping those packet jailed until original (laggy) order is restored. Quick fix is to disable ControlMaster for scp, via scp -o ControlPath=none foo bar:'.

2012-03-14

LLDP / 802.1AB-2009 blows

If you're designing L2 discovery protocol, I suppose one of your mandatory requirements is, that you can 'machine walk' the network, after you find one box. I.e. you are able to know your neighbor devices and their ports. LLDP makes no such guarantees

You have 4 mandatory TLVs, [0123], End of LLDPDU, Chassis ID, Port ID and TTL. Chassis ID has 7 subtypes which implementation is free to choose, EntPhysicalAlias (two distinct cases), IfAlias, MAC address, networkAddress, ifName or locally assigned. Port ID also has 7 subtypes which implementation is free to choose, ifAlias, entPhysicalAlias, MAC address, networkAddress, ifName, agent circuit ID, locally assigned.

Now you can send what ever trash via locally assigned and be fully compliant implementation. It seems that it would be wise to mandate sending management address (networkAddress) in ChassisID and SNMP ifindex in PortID (and any _additional_ ones you may want to send, i.e. more than 1, which is not allowed). This way you'd immediately know what OID to query and from which node. Obviously this makes assumption that we have IP address always and SNMP implementation always. If we absolutely must support some corner cases where this is not true, we should specify different mandatory requirements for devices without networkAddress and SNMP implementation. Now because of some corner cases we can never trust LLDP implementation to be useful.

Clear sign that LLDP is not actually meeting real-world demands is that PortID often is locally assigned which is populated with SNMP index, you just have to know it that given device works like this, there is no way to programmatically know it beforehand.

2012-01-22

Future residential INET users, I'm so sorry

I never believed IPv6 will be NAT free, but as idealist I hoped there is good chance there will be mostly only 1:1 NAT and each and every connection will get own routable network, /56 or so, residential DSL, mobile data, everything

Unfortunately that ship has sailed, it's almost certain majority of residential/non-business products will only contain single directly connected network, since we (as a community, I don't want to put all the blame to IPv6 kooks) failed produce feasible technical way to do it and spent too much time arguing on irrelevant matters. I'm reviewing two ways to provide INET access on DSL, no PPPoX, as it's not done in my corner of the world, and show why it's not practical to provide the end customer routable network

Statically configure per customer interface

At DSLAM (or other access device) customer would be placed in unique virtual-circuit (Q, QinQ...) all would terminated on unique L3 logical interface in PE router. Interface would have static /64 ipv6 address and ipv6/56 network routed to say ::c/64. IPv4 could continue to be shared subnet via 'unnumbered' interface.

This is by far my favorite way of doing residential IPv6 it, it supports customer owned and provider owned CPE, it supports routing and bridging in CPE, if CPE is bridged, CPE doesn't even need to be updated and it requires no magic features in DSLAM, IPv6 will just work.

But why this in practice does not work, is because there isn't PE router supporting 100k logical interfaces in single physical port, which would be needed for most non-trivial size operators today. Of course you could always buy more L3 termination points and terminate fewer (or even just one DSLAM) per PE, but the CAPEX isn't justifiable just to produce IPv6. Maybe due to organic network upgrades L3 goes closer to the edge and this will become feasible.

Routing in CPE, DHCP-PD

Customer CPE would use shared linknet with other customers, and some magic feature in DSLAM would stop CPEs from reaching each other in L2 (like they do today in IPv4), for LAN address customer would use DHCP-PD, which would be distributed by CPE as SLAAC to LAN interface, this is supported and works today.

But there is no supported way to provide static route to CPE, there is draft-ietf-mif-dhcpv6-route-option-03 which would add needed functionality, but I don't think there is much chance it'll materialize in time, as SPs are already busy defining how their residential V6 is produced.

Problems here are, apart from not being supported, is that new CPE is needed, DSLAM vendors need to develop IPv6 specific magic features (ND, RA, DHCP) to workaround shared LAN problems.

Conclusion

So all I can say is sorry, again users need to pay for premium product to get routable network and ad-hoc network sharing from random available 3G mobile or DSL still means setting up NAT (or ND proxy to lie that you're on connected net, when you're not). It's damn shame, as it's really trivial technical problem which seems just to be completely ignored.