2012-01-22

Future residential INET users, I'm so sorry

I never believed IPv6 will be NAT free, but as idealist I hoped there is good chance there will be mostly only 1:1 NAT and each and every connection will get own routable network, /56 or so, residential DSL, mobile data, everything

Unfortunately that ship has sailed, it's almost certain majority of residential/non-business products will only contain single directly connected network, since we (as a community, I don't want to put all the blame to IPv6 kooks) failed produce feasible technical way to do it and spent too much time arguing on irrelevant matters. I'm reviewing two ways to provide INET access on DSL, no PPPoX, as it's not done in my corner of the world, and show why it's not practical to provide the end customer routable network

Statically configure per customer interface

At DSLAM (or other access device) customer would be placed in unique virtual-circuit (Q, QinQ...) all would terminated on unique L3 logical interface in PE router. Interface would have static /64 ipv6 address and ipv6/56 network routed to say ::c/64. IPv4 could continue to be shared subnet via 'unnumbered' interface.

This is by far my favorite way of doing residential IPv6 it, it supports customer owned and provider owned CPE, it supports routing and bridging in CPE, if CPE is bridged, CPE doesn't even need to be updated and it requires no magic features in DSLAM, IPv6 will just work.

But why this in practice does not work, is because there isn't PE router supporting 100k logical interfaces in single physical port, which would be needed for most non-trivial size operators today. Of course you could always buy more L3 termination points and terminate fewer (or even just one DSLAM) per PE, but the CAPEX isn't justifiable just to produce IPv6. Maybe due to organic network upgrades L3 goes closer to the edge and this will become feasible.

Routing in CPE, DHCP-PD

Customer CPE would use shared linknet with other customers, and some magic feature in DSLAM would stop CPEs from reaching each other in L2 (like they do today in IPv4), for LAN address customer would use DHCP-PD, which would be distributed by CPE as SLAAC to LAN interface, this is supported and works today.

But there is no supported way to provide static route to CPE, there is draft-ietf-mif-dhcpv6-route-option-03 which would add needed functionality, but I don't think there is much chance it'll materialize in time, as SPs are already busy defining how their residential V6 is produced.

Problems here are, apart from not being supported, is that new CPE is needed, DSLAM vendors need to develop IPv6 specific magic features (ND, RA, DHCP) to workaround shared LAN problems.

Conclusion

So all I can say is sorry, again users need to pay for premium product to get routable network and ad-hoc network sharing from random available 3G mobile or DSL still means setting up NAT (or ND proxy to lie that you're on connected net, when you're not). It's damn shame, as it's really trivial technical problem which seems just to be completely ignored.

2011-11-04

junos vrf-import funnies

Consider this configuration:

> show configuration routing-instances VRF1 instance-type vrf; route-distinguisher 42:1; vrf-import [ VRF1-IMPORT VRF-DEFAULT-IMPORT ]; vrf-export [ VRF1-EXPORT VRF-DEFAULT-EXPORT ]; vrf-table-label; > show configuration policy-options policy-statement VRF1-IMPORT from community [ VRF1 VRF2 ]; > show configuration policy-options policy-statement VRF-DEFAULT-IMPORT term cust_routes { from protocol bgp; then default-action accept; } > show configuration policy-options community VRF1 members target:42:1; > show configuration policy-options community VRF2 members target:42:2;

If you configure this on any router on your network, it'll work, VRF will import correct and only correct routes. This will give you assumption, that VRF import in JunOS works like this:

start with empty array of routes to evaluate policy against
when you hit 'match community' push matching routes from bgp.l3vpn.0 to the list
evaluate rules normally against the list

If you create multiple of these to single router, and you only have single 'from community [ X ]' in each, it also works perfectly. However, if you have more than one community in 'from community' AND you have more than one VRF using the 'VRF-DEFAULT-IMPORT' things go wrong. If we have three routes:

10.10.1.0/24 RT:42:1
10.10.2.0/24 RT:42:1 RT:42:2 RT:42:3
10.10.3.0/24 RT:42:1 RT:42:3

VRF1 will correctly import all of these, but it will also leak #2 to other VRFs in same PE having 'VRF-DEFAULT-IMPORT', it won't leak #1 or #3. It's not actually bug, but the fact that it works at all, is side-effect of optimization when route hits exactly 1 'show bgp targets' entry. And evaluation is not done, how the results in the simple test might indicate.

2011-11-03

no usage scenario for ssh-agent forwarding

Many people, especially those in consulting business have need to access multiple different organization 'jump boxes' from which they can ssh towards the organization servers. And due to security it makes sense to have different ssh key being allowed for different organization servers. For convenience people often allow ssh-agent towards the 'jump boxes'.

Problem with ssh-agent is, that it has no idea who is requesting the key signing, it could very well be organization1 evil admin asking for organization2 key, when sshing into organization2 jump-box, and your agent would simply allow this.

One solution to the problem could be that when ever signing is requested, user gets prompt 'localhost < organization2-jump < organization2 requests sign of organization1 identity, allow yes/no, [ ] always'. Now you'd have idea if sign request is legit or not. However this would require protocol changes to ssh, as ssh-agent has no idea who is requesting signing much less of the full path, which would be absolutely needed to make this feature work.

So I asked openssh dev mailing list, how this problem should be solved. Turns out there is recently added feature in openssh, which could potentially remove need for agent forwarding completely, to access organization1-server through organization1-jump you'd do ssh -oProxyCommand='ssh -W %h:%p organization1-jump' organization1-server, now obviously this is inconvenient, especially if there are more than 1 box through which you need to jump. .ssh/config can help somewhat:

# cat >> ~/.ssh/config Host org1-ultimate ProxyCommand ssh -W %h:%p org1-secondjump Host org1-secondjump ProxyCommand ssh -W %h:%p org1-firstjump ^d

Now you'd ssh 'ssh org1-ultimate', which would really go to org1-firstjump -> org1-secondjump -> org1-ultimate. ssh key would work without forwarding it, and transit nodes wouldn't see unencrypted data. However, still seems like large overhead, what if there would be syntactic sugar do do this:

# cat >> .ssh/config Host org1-ultimate path org1-firstjump, org1-secondjump ^d # ssh org2-firstjump,org2-secondjump,org2-ultimate # ssh org1-ultimate

2011-10-02

Playing hide and seek with JunOS

JunOS has some commands which either are unsupported, do not work in platform you're using, undocumented or unnecessary for vast majority of operators, these commands are hidden in the UI so they are only accessible if you know what (and more importantly why) you want (them).

Today I was searching for a way to quiet my SRX210HE-POE as it makes annoyingly lot noise, I failed to find configuration way to force it to normal spinning speed, but I did notice that CLI exposes hidden commands. I've actually found same in IOS several years back and wrote little perl script to search for them (exec only), it proved bad idea as several of them purposely crash your system. If you want to dig deeper, in IOS difference is incomplete and invalid command, however actually some commands are truly hidden in IOS, particular example is the toggle for unsupported transceivers.

Neither the JunOS nor IOS issue are something you can blame vendor at, vendor isn't trying to stop you from using them, they just want to be very clear that if you use them TAC ain't go your back.

The code is quick 2h hack (running it takes longer, but I'm certain the search/walk can be optimized) and it depends on ssh/telnet library I've done. This library was meant for optimal way to do exec commands, not configuration commands. And best way to do exec commands in JunOS is to open new ssh channels with exec('command') per command, this way you never ever need to do screen scraping for prompt, as when ssh channel closes, command has finished. Unfortunately this approach does not work for config, and I didn't bother disabling forcing this behavior in the library, so right now it only supports telnet (if you really want ssh, hack it to assume remote is 'cisco' then it'll open shell, instead of exec, since IOS does not support multiple channels over existing ssh connection).

2011-08-15

When should you advertise default route?

Never

There are two typical scenarios when people carry default route in dynamic routing protocol, I'll address these separately and explain why you shouldn't do it, and what you should do instead.

CE (eBGP) PE

This is probably the most common scenario, maybe you're giving your customer default route, maybe it's your own firewall or really any situation where neighbor won't carry full routing table and neighbor isn't strictly same administrative domain.

Problem with default route here is, that if your PE gets disconnected from core, you're still originating the default route and CE is unaware of this and you're blackholing customer traffic until BGP is manually shutdown. You could conditionally advertise default, but that is just useless overhead, instead of default you should advertise to CE any aggregate route which is originated from multiple core boxes, such as your PA aggregate, or really any stable route originated from multiple places, but not local PE.

Customer would just add this to their router:

# ios ip route 0.0.0.0 0.0.0.0 192.0.2.0 name floating_default # junos route 0.0.0.0/0 { qualified-next-hop 192.0.2.0 { interface xe-0/0/0.0; } resolve; }

Now if your PE gets disconnected from core, you'll stop originating 192.0.2.0/24 and this ip route no longer will recurse to CE<->PE interface. If there is no more 192.0.2.0/24 route available anywhere, static route is invalid, and next available default route can be used. If there still is 192.0.2.0/24 available via alternative provider that will be automatically used.

Slight cosmetic complain is that if you add interface to the static route, IOS disables recursion, so you cannot enforce that the static route will disappear if next hop does not recurse behind that one interface. But it is purely cosmetic, as functionality will remain regardless if 192.0.2.0/24 will continue to exist or completely disappear. If it will continue to exist, customer will just need to local-pref/med 192.0.2.0/24 to have expected backup default selection.

PE router without full table

Typical solution is to have two RR iBGP peers to originate default route. This has the problem that RR probably aren't always in optimal forwarding path, especially in single fault, but in many cases never. So you'd stop iBGP from originating default, and you'd instead add this to every router having full bgp view:

interface Loopback1 description Anycast default ip address 192.0.2.0 255.255.255.255 no ip redirects no ip proxy-arp ! router isis passive-interface Loopback1

Obviously PE box would just have static default towards 192.0.2.0, this way PE would always forward packet towards nearest core box which is up and has full bgp table, so you always get best path egress forwarding, without having full bgp view and without having best path RR. Effectively it is as if every router has iBGP session to you and is originating default

Exception that proves the rule

If the end device does not support recursing routes, then obviously this won't work. And there still are such devices, though it's unsure if you want to be routing in such devices to begin with

2011-08-11

IPv6 ACL bypass

IPv6 designers recognized that IPv4 header has several faults, these were addressed to a different degree. Particularly annoying was IPv4 options which caused TCP/UDP/ICMP data to shift, as it made IPv4 header length variable. IPv6 header is fixed length, there is 'next-header' option, which will instruct how to parse data after IP header. Typically 'next-header' would be TCP, UDP or ICMP, and rest of packet would be exactly like in IPv4 (apart from mandatory checksum in UDP).

Where the complexity (some might say design fault) is that 'next-header' could be any large number of more exotic extension header, each of which have 'next-header' field themselves. Standard does not specify any limitation how many headers you could have, so you need to be able to parse packet up-to MTU length. The final extension header typically would contain TCP/UDP/ICMP and normal IPv4 style packet would follow.

Unfortunately no practical router has MTU wide view to the packet, you have 64B, 128B or 256B view, after which you are completely unaware of the packet content, it's just bits in memory which you cannot process in any meaningful way. Your PC won't have same problem, it does not have specialized hardware to quickly forward large amount of packets, so your PC will happily parse packet up-to the MTU length.

What this translates to is, that you can craft IPv6 packet where TCP port information is after view of router, so router will not know it is TCP packet nor what ports it is using, but the receiving PC will understand it normally. So if you have ACL rule where you are dropping some tcp/udp/icmp packets then allowing rest, those rules can be by-passed in very typical router. Example could be:

term my_smtp { from { destination-address 2001:db8::42/128; } then accept; term no_spam { from { next-header tcp; destination-port 25; } then discard; } term accept { then accept; }

Now this will be bypassed, because our 'next-header' is not tcp, but contains extension-header. But far end unmodified PC with unmodified software will treat it normally. Or maybe it is server where you allow ssh from management net, drop all packet to tcp/22 and permit rest. As long as you permit rest, instead of discard rest, bypass will work

How this should be fixed? Well IPv6 should have modified ICMP/TCP/UDP/etc to contain 'next-header' field, and mandated that they appear before any extension header, forcing non-extension headers to live in fixed bit places. Obviously ship has sailed for this fix. Now it is heavily platform dependent what will happen, cisco.com claims that they punt packets which they fail to parse correctly, this is sane, just be sure to police the punts and you have pretty good solution. Juniper before trio is pretty much lost cause.

Juniper trio is behaving remarkably well, but CLI is lagging behind. Trio will actually find TCP/UDP headers as long as there are fewer than 29 'destination-option' headers before TCP/UDP. If there are 30 'destination-option' headers before TCP/UDP packet is dropped in hardware by 'bad IPv6 options pkt DISC(9)' exception. Problem is CLI is unaware of this capability and you don't have 'protocol tcp' to define you want TCP, you only have 'next-header TCP' which only monitors the first next-header field in IP packet. If you omit 'next-header' and just match 'destination-port' and you have 29 or fewer 'destination-option' headers, JNPR will match correctly, you just lose ability to differentiate between tcp and udp. This is true for 10.4R4 and 11.2R1.

How trio should be fixed is by adding 'protocol' match in CLI (trio already classifies packet correctly) and 'bad IPv6 options pkt DISC(9)' exception should punt (via policer) instead of discard, so that RE can parse the packet correctly. You could ask that what /realistic/ packet would be dropped by trio parser, but I think that is beside the point, IPv6 standard allows for it, so you should parse it, even via punt with poor performance.

You can see packets failing trio parser via PFE:

# show jnh 0 exceptions terse Reason Type Packets Bytes ================================================================== Packet Exceptions ---------------------- bad IPv6 options pkt DISC( 9) 24808567 26495549556

2010-09-28

eBGP triggered blackhole for customers

Very many large scale transit providers, if not most of them support eBGP remote triggered blackhole via separate multihop eBGP session. I suspect this is, because they've used for very long time single shared route-map for transit customers, and it is not immediately obvious how you can support blackholing without customer specific route-map. Requiring customer specific route-map would probably be less than minor change in their provisioning systems. However, it is perfectly doable and same idea works just the same in JunOS and IOS, here is pseudoIOShy example how to do it:

router bgp N neighbor eBGP peer-group neighbor eBGP route-map eBGP-IN in neihgbor eBGP disable-connected-check neighbor CUSTIP peer-group eBGP neighbor CUSTIP prefix-list C-CUSTID-IN in ! route-map eBGP-IN permit 100 match community BLACKHOLE set ip next-hop BLACKHOLE set community BLACKHOLE additive route-map eBGP-IN permit 200 match ip address prefix-list eBGP-TRANSIT-FULL set community full-transit additive route-map eBGP-IN permit 300 match ip address prefix-list eBGP-TRANSIT-PARTIAL set comunity partial-transit additive route-map eBGP-IN permit 400 set ip address prefix-list eBGP-PUNCHOLE set community no-export additive ! ip prefix-list C-CUSTID-IN permit 192.0.2.0/24 le 32 ip prefix-list C-CUSTID-IN permit 10.10.42.0/28 le 32 ip prefix-list eBGP-TRANSIT-FULL permit 192.0.2.0/24 ip prefix-list eBGP-PUNCHOLE permit 10.10.42.0/28

When you provision new customer, you simply add prefix to one customer specific prefix-list and another shared prefix-list. Shared prefix-list is there just to verify that we are getting correct mask, not more specific. Same route-map allows you to support multiple types of transit agreements and you can even use same route-map for customers using your PA addresses. When you add new BGP feature for customers, you update single route-map and all customers and all products benefit from it.

Egress route-map is less interesting, as it is simple, contains no features typically. You most likely have two types of egress route-map, one that matches global table route community and leaks them to customer and another which sends just candidate route. Or if you are doing it wrong, you're sending actual default route, this is wrong, and should never be done, but warrants its own post. If you'd want to have single egress route-map also, you could make one egress route-map which sends all global table routes and for candidate only customers, you'd add egress prefix-list which stops all prefixes, but single candidate route, you originate in few crucial core locations.