2010-09-28

eBGP triggered blackhole for customers

Very many large scale transit providers, if not most of them support eBGP remote triggered blackhole via separate multihop eBGP session. I suspect this is, because they've used for very long time single shared route-map for transit customers, and it is not immediately obvious how you can support blackholing without customer specific route-map. Requiring customer specific route-map would probably be less than minor change in their provisioning systems. However, it is perfectly doable and same idea works just the same in JunOS and IOS, here is pseudoIOShy example how to do it:

router bgp N neighbor eBGP peer-group neighbor eBGP route-map eBGP-IN in neihgbor eBGP disable-connected-check neighbor CUSTIP peer-group eBGP neighbor CUSTIP prefix-list C-CUSTID-IN in ! route-map eBGP-IN permit 100 match community BLACKHOLE set ip next-hop BLACKHOLE set community BLACKHOLE additive route-map eBGP-IN permit 200 match ip address prefix-list eBGP-TRANSIT-FULL set community full-transit additive route-map eBGP-IN permit 300 match ip address prefix-list eBGP-TRANSIT-PARTIAL set comunity partial-transit additive route-map eBGP-IN permit 400 set ip address prefix-list eBGP-PUNCHOLE set community no-export additive ! ip prefix-list C-CUSTID-IN permit 192.0.2.0/24 le 32 ip prefix-list C-CUSTID-IN permit 10.10.42.0/28 le 32 ip prefix-list eBGP-TRANSIT-FULL permit 192.0.2.0/24 ip prefix-list eBGP-PUNCHOLE permit 10.10.42.0/28

When you provision new customer, you simply add prefix to one customer specific prefix-list and another shared prefix-list. Shared prefix-list is there just to verify that we are getting correct mask, not more specific. Same route-map allows you to support multiple types of transit agreements and you can even use same route-map for customers using your PA addresses. When you add new BGP feature for customers, you update single route-map and all customers and all products benefit from it.

Egress route-map is less interesting, as it is simple, contains no features typically. You most likely have two types of egress route-map, one that matches global table route community and leaks them to customer and another which sends just candidate route. Or if you are doing it wrong, you're sending actual default route, this is wrong, and should never be done, but warrants its own post. If you'd want to have single egress route-map also, you could make one egress route-map which sends all global table routes and for candidate only customers, you'd add egress prefix-list which stops all prefixes, but single candidate route, you originate in few crucial core locations.

2010-07-30

Consumer hardware vendors, boxes and versions

Yes, this is actually a rant.
<rant>
I have a Apple Time Capsule which I love and cherish. It's about a year old. No, it isn't the latest model anymore. It claims it is able to talk IPv6... but it doesn't. Sadly, it runs something called version 7.4.2 - that works fine but where IPv6 is broken. To have functioning IPv6 I should have 7.5 or later. That would require me to pay more money for protection to Apple as it seems like it is only available on the Very Latest Time Capsules.
Now, someone explain to me:
  1. what the fuck? over.
  2. why must I buy a new box to get something to work which is supposedly is there already
  3. immediate cessation in software updates upon release of incremental hardware update
  4. if you changed the chipset then how hard can it be to make a conditional instead of drop all future upgrades
  5. fail to communicate what's going to work and what's not
Feel free to google "time capsule 7.4.2 ipv6" for more info.
</rant>
Feels much better now. :-)

2010-07-23

IPv6 and the enterprise of tomorrow(ish)

One of the great promises of IPv6 has been to get rid of NAT, no more will IT do RFC1918 and NAPT to single public IP. But how is IPv6 going to accomplish this, what is the magical toggle for it? Let's get disappointed.

Some devices, like Cisco IOS allow you to configure IPv6 prefix as 'macro', so you could tell that macro 'ME' is 2001:db8::/32 and everywhere where you write IPv6 address, you use macro 'ME'instead. So in theory, when your prefix changes, you simply change the macro. So the great renumbering benefit is ability to always get same size network. But of course this was true for IPv4 too, you got the network size you needed. Why isn't this utilized? Because enterprises don't have one Cisco IOS devices, they have plethora of devices from different vendors, firewalls, slb, ips, ids, servers, OSS systems and so forth, you'd still need to go in all of these to change the 'macro', not all devices even have the concept and quite frankly no enterprise of non-trivial size will even know without months of work _where_ and _what_ will need to be changed for renumbering to be successful. I know industry professionals who've done renumbering costing MEURs in single company. So in practice IPv6 gives you no benefit in renumbering, renumbering always was easy for trivial network and always hard for non-trivial network and will continue to be so.

So how will enterprises run IPv6? Getting PI/ASN is actually bit harder, as you MUST be dual-homed, while many enterprises just want PI/ASN to have ability to change operator. There has also been visibility problems with /48's, but these will be remedied in due time, when even rest of the people realize there will not be magical new protocol in IPv6 for address mobility. I'm quite certain that IPv6 will be deployed exactly like IPv4, instead of RFC1918 you will use RFC4193, but with the twist that most companies will find that PRNG always returns '0', so that they will get beautiful short fd::/48 block, and lot of cursing in M&A, as with RFC1918.

Companies will NAT this RFC4193 to external provider provided IP block, no renumbering needed, you can still change operator without complex and expensive renumbering. But there is something we will win, we can easily do 1:1 NAT, instead of NAPT, which has great many benefits, no more will you need session logging to comply with legal requirements, you will also get mostly working end-to-end connectivity, apart from protocols which carry address in payload.

For those few who have been bitten by RFC1918 in M&A and who view that IPv6 address memorability and beauty is less valuable than the many hours of pain RFC1918 in M&A causes, here is one-liner to generate random RFC4193:

# ruby -e 'p ("fd"+rand(2**40).to_s(16)).scan(/.{1,4}/).join(":")+"::/48"' "fdbf:49be:e67d::/48" #
Maybe it should tuned with RFC2777 with NASDAQ as seed, so that your client can be sure you didn't give preferential treatment to another client to whom you issued much more beautiful network. Blaming the stock market can be very satisfying.

2010-07-21

IEEE OUI address (MAC address) allocation

I've recently noticed that it is becoming more and more common to see 'weird' MAC addresses, i.e. MAC addresses which do not start with numbers 00. Previously it was very easy to spot automatically mentally software defects which would cause strange MAC addresses to appear, it has helped me to diagnose several issues in the past. We've now beginning to lose that advantage, as IEEE has started to allocate MAC addresses quite randomly across the address space.

I emailed to IEEE and asked what was the motivation and perceived advantage in doing this change and reply was quite simply 'We changed our allocation methods to prevent vendors using unregistered mac addresses.'. OUI costs 1650USD one time fee, but IEEE appears to be concerned that some vendors choose not to pay it, instead allocate themselves OUI somewhere far in the address space, effectively thinking they are getting free OUI with little to no possibility of overlap. It would be curious to know if this instance who wants to save 1650USD would care about this slightly changed climate, I personally doubt the change while good-willed is completely ineffective and the slight operational benefit serial assignment had is lost. (/me starts crying over spilled milk).

In slightly related note regional IXP here is using static manually assigned MAC addresses from 4000.0X, where X is the number of the IXP site which is then followed by base10 of ASN and then free 4bits for user. So in site 1 for AS4242 would be 4000.0104.2420. Unfortunately when these were assigned someone mad mistake with bit significancy order and this MAC address is not locally assigned as was intended but normal public MAC address. I'm recommending new scheme of xEzz.yyyy.yyyy. Where 'y' is the ASN in hex (supporting 4byte ASN), 'z' is customer assigned, 'E' is static and 'x' is IXP site number. E could be also 2, 6 or a, but 'E' for exchange is kind cute.

2010-07-09

Resolving SNMP OID in snmpbulkwalk and tshark

This isn't exactly esoteric science but at least for me it has been too hard and I've rather looked up OIDs manually, which makes little sense in the long run.

NET-SNMP

First of all, you probably want to set system wide version and community, so you never need to type them on snmpbulkwalk
# cat /etc/snmp/snmp.conf defVersion 2c defCommunity supersecret #
Also I prefer to have vendor MIBs for my user only, as I might want to load lot of them, which will give lot of warnings, which other users may not want.
# cat ~/.snmp/snmp.conf mibs +ALL #
Instead of +ALL you could add named MIBs which to load, ~/.snmp/mibs/ directory is searched for users MIBs. You could start by uncompressing ftp://ftp.cisco.com/pub/mibs/v2/v2.tar.gz there. Now you should be able to snmpbulkwalk router by giving just its name, no version, no community and OIDs should be resolved.

WIRESHARK/TSHARK

To me this is more useful than net-snmp, to lab what type of traps router would send and in what situation. For some weird reason wireshark/tshark doesn't honour net-snmp settings, but needs its own settings.
# cat ~/.wireshark/prefefences name_resolve: mtC name_resolve_load_smi_modules: TRUE snmp.display_oid: TRUE snmp.desegment: TRUE snmp.var_in_tree: TRUE # cat ~/.wireshark/smi_paths "/home/WINRARISU/.snmp/mibs" # cat ~/.wireshark/smi_modules "IP-MIB" "IF-MIB" "TCP-MIB" "UDP-MIB" "SNMPv2-MIB" "RFC1213-MIB" "IPV6-ICMP-MIB" "IPV6-MIB" "SNMP-COMMUNITY-MIB" "SNMP-FRAMEWORK-MIB" "SNMP-MPD-MIB" "SNMP-NOTIFICATION-MIB" "SNMP-PROXY-MIB" "SNMP-TARGET-MIB" "SNMP-USER-BASED-SM-MIB" "SNMP-USM-DH-OBJECTS-MIB" "SNMP-VIEW-BASED-ACM-MIB" "BGP4-MIB" "CISCO-BGP4-MIB" "MPLS-LDP-CAPABILITY" "MPLS-LDP-GENERIC-STD-MIB" "MPLS-LDP-STD-MIB" "CISCO-IETF-ISIS-CAPABILITY" "CISCO-IETF-ISIS-MIB" #
Now 'tshark port snmp-trap or port snmp' should happily display resolved names as well. To decide which MIBs to add in smi_modules you can use 'snmptranslate OID' to translate given OID to its symbolic name for easy grep against your MIB dir.

2010-07-06

Cisco Network^WLive! 2011 Las Vegas Conference report, part 2

Part 2... where we take a stab at the food served one gets for 1700 (thereabouts) bucks at conferences.
Breakfast
When: every day 0700-0800 (one hour only)
What's available: cereal + milk, sugary doughnuts, croissants, bottles of juice or soda pop.
Grade: C-. The time could be longer and the stuff available isn't exactly from the food pyramid either. I did this once and the rest of the time paid for hotel breakfast which was excellent.
Lunch
When: 11-13 (or thereabouts)
What's available: different menu every day, lots of choices, veggie, meat, fish, salad, etc. Drinks water, icea tea, lemonade, soda pop
Grade: B+. Certainly isn't a Michelin restaurant experience but seemed nutritious enough, wasn't all greased up, there were plenty of choice each day (beef/fish, fish/chicken, chicken, etc.) and always a veggie choice.
Snacks
When: around 1000, around 1400
What's available: granola bars, sugary coated doughnuts, coffee, tea, water, some fruit, juice, ice cream, ... depending on the day
Grade: B-. Lots of choice. Sometimes the snacks went quite quickly so I didn't get some on a few days when the presentation ended later than planned. Either there were too little snacks to start with or some people stocked up on a load of them.
That's it for the official conference included servings.
Extras
I partook in a few extracurricular events as well such as the Data Center of the Future evening party at MIX which had quite good food and the Data Center of the Future customer social event at the Foundation Room which had excellent food and beverages.
Grade: A+
Customer appreciation event
There's probably going to be a better writeup of the CAE later when I remember ;) more of it. ;-) Sufficient to say, the food there was kind of fastfoodish and a lot of people seemed to stay in line all the time to get some of it. I don't think I tasted any with the exception of mac& cheese which I liked. Drinks were aplenty consisting of various beers, some hard liquor and soda pop. The food served at CAE Cisco Live! 2009 in SFO was better I think. Then again, there wasn't any CAE at all in BCN last winter which really, really sucked.
Grade: B
That's it for now, time for lunch. :-)

Cisco Network^WLive! 2011 Las Vegas Conference report, part 1

I, once again, had the privilege to participate in Cisco Live! 2010 (formerly known as Cisco Networkers, what's up with the name change btw?) which was held in the Mandalay Bay Conference Center in Las Vegas, Nevada. For people without google maps, that's in the USA. Travel there involved sitting long distances in several flying aluminium tubes, getting searched by both private security companies and the TSA, passport control, tolerating long queues, etc. What can I say? IT IS WORTH IT.
Being a fairly early bird when registering I was among one of the first being able to select which techtorials, labs, panels, etc. to participate in. This year, I decided to mix some SP and DC stuff instead of doing only one track. The day before the conference really starts there are usually either half or full day techtorials/labtorials which either means "passively listen to the presenter(s)" or "lab like hell" - previously I've enjoyed myself immensely by "passing" the CCIE R&S prep (~ 7 hrs actual lab) and CCIE SP prep (~7 hrs actual lab) to see what the skill level and makeup of the real test would be like. 2010 I decided to skip that stuff and instead go and and get a L2VPN refresher in TECMPL-3001 (Title Layer 2 Virtual Private Networks - Converged IP/MPLS Network) as I don't tend to do much of L2VPN nowadays.
The techtorial was divided in three parts each having its own lecturer. Part 1 consisted of your basic L2VPN setup, how it works, etc. Part 2 was VPLS for dummies and Part 3 with future enhancements such as PBB/802.1ah, H-VPLS with PBB, etc. Future products such as ME3600X/ME3800X were briefly mentioned and presented as well. All three speakers were pretty experienced and I think all from a BU.
All in all a good day.

To kill a spanning tree

There is lot of hate out there for STP. One of the main problems is lack of robustness against poor planning and poor operations causing broadcast/unknown unicast storms, which when happen typically kill all traffic through given switch unlike L3 issues which typically affect portion of traffic only. And unlike L3 problems you typically cannot reach the switch to troubleshoot the issue. So rather easy to cause downtime and quite hard to fix it.
Then there is more inherent problem in extreme capacity environments with unusable links and for high availability networks with convergence time.
What should you replace your STP with?
STP
Why replace at all? STP has proven track record, it is used ubiquitously anywhere from home LAN to complex enterprise LAN to service provider metro networks, clearly it has to work.
Virtually all of the problems in robustness are caused by assumption that ethernet is plug-and-play, if you deploy STP you have to design it.
It would really need its own article to explain basics of deploying STP successfully but few key issues are that you decide before hand which port will block in any given situation, which port will participate in STP and add BPDU guard/filter and broadcast+unknown unicast storm-controls.
But of course for complex networks which are being configured multiple times per day it is rather tall order to wish for planning and operations who won't break things.
With RSTP and MST you can get convergence in the 1s area, which might satisfy your convergence budged. Only thing that you cannot really fix is the unused blocking links, especially if you need high capacity in single VLAN, which is something largest IXPs worry about. When you notice you can't increase members in port-channel you'd really want more flexibility how to drive your traffic.
REP, EAPS, MRP, FRRP, RPR
These are all typically requiring ring topology (well REP strictly is not) and guarantee lower convergence time, down to 50ms. Also there is some inherent operational robustness as they are not configured by default on ports, so you are forced to do some planning when deploying them, usually you'll at least avoid running them with your customers.
However you are still blocking link, and topology restrictions make them unacceptable to many scenarios, they are mostly usable in service provider metro.
STACKING
When Cisco introduced 3750 stackwise years ago I was very much proponent of stacking technology, I was too desperate to get rid of STP so I assumed anything and everything must be better without reviewing the option.
While I've been quite happy to level of reliability of stackwise in 3750, it has had more software issues than STP, which is quite expected as it is more complex and less mature. But even when assuming that both implementations would be perfect, for my requirements STP would remain superior.
STP is vendor agnostic, in your high availability L2 setup you can pull one vendor switch out and replace with another vendor and you are causing maybe 1s downtime. In 3750 stackwise not only you obviously can't do that, but you can't even upgrade the software, so you're causing very long downtime when ever software upgrade needs to be done, as you are kiling whole stack for a moment. Certainly ISSU could be implemented in stacks, but as long as you can't stack CSCO with JNPR, you can't get the same level of robustness you get from STP.
Large selling point to many in stacks is ability to connect server redundantly to two switches, not because they need the capacity, but because they need the redundancy. This is actually not needed, all OS you'd possibly want to run can do sort-of 802.1AX (which is not 802.1AX at all) where you connect your server to two independent switches and use ARP towards default-gw to decide that primary link still should be used. It is very simple and very robust solution for redundancy requirement. If you need higher capacity just use 2+2 links with real 802.1AX.
JNPR seems to be attempting to put whole DC under single stack, I know I won't be listening to that sales pitch.
VPLS and TRILL
This is actually why I wrote the article, yesterday once again I participated in a chat about VPLS vs TRILL, they are on high level remarkably similar, both are doing best-path, loop free forwarding based on IGP.
TRILL is bit of a flash back to 90's when you carried user routes on IGP while in VPLS you'll typically use BGP which fits today's view of best practices better.
I've been lurking on rbridge mailing list for years and I really like the idea of if. But wish as I might I don't believe in commercial success of TRILL, cheaper and cheaper chips are getting MPLS/VPLS capability which is further driving its adaptation and further driving the costs down, due to economics of scale I fully see same thing happening to entry-level switches with MPLS as did happen with L3, high volume will drive the premium you pay for MPLS/VPLS to entry/consumer level.
I can't see how TRILL could overtake this momentum, you can do much more in MPLS than just VPLS, so why would you buy TRILL which is pretty much guaranteed to be more expensive due to low volume.
I was informed that TRILL was designed to be implementable in any devices which can stack MPLS labels, but looking at the draft I can say this is not true, EoS bit in MPLS label will hit rbridge nickname in TRILL header, which I'm sure most current MPLS chip couldn't keep set when they add another label as within TRILL they would had to. And generally I do not expect this level of programmability in most ASIC implementations, say EARL7/PFC3.
TRILL really should have happened 10 years ago and should today be available in pure L2 switches which cost under 2kEUR per 48x1GE port. Seeing devices like Alcatel SAS and Cisco ME3800X makes me quite pessimistic about future of TRILL, I do hope I'm wrong.

2010-07-04

Incumbents and Value Added Service (VAS)

I'm completely at lost how incumbent telcos are able to fool themselves into delusions that they can replace high-margin voice minute revenues with new VAS, without needing to streamline operations into low-margin product future.

Most incumbents are offering residential VoIP, some kind of music-on-demand service, IP-TV and so forth. These services are marketed only to existing subscribers so potential market is from 100k's to million or two. I cannot see how they could compete against skype, itunes, spotify, tivo etc, who are targeting whole Internet and are doing it as their core business. Skype can produce the service much cheaper due to economics of scale, there simply is no way incumbent can have this market. Incumbent would be as likely to succeed in any new random enterprise, such as starting pizza chain as they would be in VAS. Not impossible, but it is much more likely that new startup can succeed it in, who don't have century of red tape around them and who are vastly more numerously attempting it.

Rather than VAS telcos should concentrate their efforts on being profitable at pushing bits around, attempt to become skypes preferred PSTN<->SIP provider, provision and manage residential DSL so that you are profitable without counting on VAS to fix your cashflow.

It would be competitive advantage for telco to realise that they are utility like electricity and water, most important thing they have is their physical infrastructure, which they can sell to end users and competitors, it is expensive long term investment to expand that infrastructure, maybe not fitting for trendy quarter economy, but putting your money on black and hoping to win house (skype et.all) is simply delusional and waste of time.

Be Profitable Bartering Bits, BPBB.

Network Neutrality

I was reading an finncrypt() article about impending violation of network neutrality in Europe and found it largely humorous. Network hasn't been neutral since introduction of QoS (i.e. ever) and only easy and fast way to make net neutral is to remove QoS everywhere, which would serve benefit of ideology only, not benefit of any single individual, some services that work today, simply would stop working.

In real life operators use QoS to prioritise VAS they offer, such as VoIP or IP-TV, if we follow network neutrality principles we should guarantee same level of service to VoIP and IP-TV provided by 3d parties also. This is not done today, but it is not to protect your competitiveness, rather it is not done because there is no obvious technical way to do it.

When network neutrality is broken in scenario above operator typically gives higher priority to certain IP addresses which are used to provide the VAS, to be able to do this in a network neutral way, operator or rather the network devices would need to know which IP addresses are important to this given end user, which should get preferential treatment. Operator nor device can magically know these and this is where the problem lies.

Obvious solution to many is TOS (Type Of Service) field in IP packet frame, here you can tell what type of treatment given packet should receive, so end user can dictate which packets are important to her and which are not, remote end will honour the TOS and reply with same TOS value set. Alas, this will not work, between different AS (operators)  often TOS byte is zeroed out, if this would not be done in pure IP backbones no QoS would work today, as single remote AS could fill the priority queues with trash, breaking not only QoS between these two operators but indeed even the QoS inside the one operator which works today.
Say end user marks skype packet as priority packet, all good, they can prioritise it over their DSL line to operator, but when it starts transiting via other operators network there is high chance that somewhere down the line the TOS byte is zeroed out, then the reply skype packet isn't marked as priority packet and it cannot be treated preferentially when sending the data over the DSL to the end user.

Technically there is no particular reason for operator to attempt stop last mile from preferring any packets end users request, only problem is the information is lost in transit and it is not a problem which there is simple and fast solution.

We can make surprisingly good prediction on what is preferential traffic and what is not by simply looking at packet size, if we would prefer packets which are say 200 bytes or less, it would have immediate and quite large effect on perceived quality of Internet.
You can observe this problem easily on your home DSL by sending large files, like pictures to Internet and attempting to use some latency critical service at same time like ssh, VoIP or online gaming, experience will be terrible for duration of upload. Even non-latency critical like download or buffered music streaming suffers due to upstream TCP ACK being dropped in queue.
In many ways this is the proverbial silver bullet, it is extremely simple, doesn't change from customer and customer and effects are immediately visible, you can continue online gaming while torrenting. Only downside with this is, most DSLAM/MSAN/CMTS devices at operator do not actually support preferring packets based on their size.

Solution might be new protocol CQSP (Customer QoS signalling Protocol) which could instruct both CPE and DSLAM/MSAN/CMTS about networks which should be prioritized, this way when you'd install skype, it would automatically use CQSP to configure required settings. This is deployable today, requires only software updates. Hoping that TOS would work across Internet any time soon is not realistic nor is updating DSLAM/MSAN/CMTS hardware to version which would support packet size based QoS.