Transcript - APOPS
Due to the difficulties capturing a live speaker's words, it is possible this transcript may contain errors and mistranslations. APNIC accepts no liability for any event or action resulting from the transcripts.
Monday, 29 August 2011.
Philip Smith: Good morning again. Welcome to the first APOPS session at this APNIC 32 Conference here in Busan.
This week, we have two operators presentation sessions for you; one today, the other one on Wednesday.
I'll explain a little bit about APOPS before we start with the presentations in the first session and also explain a little bit about the etiquette in the room here as well.
You can see on the wall behind me, we have two big screens where you'll see the presentations, and in the middle above the stage you'll see a transcript of what the speakers are actually saying, so you see my words up there on the screen.
This is to try and help people to follow some of the presentations as well as watch the slides that the presenters are putting up.
With that in mind, when it comes to the question and answer section in the presentations, if any of you have questions to ask, would you please state your name and, if you would like to, you can state your affiliation, just for the record, so that other people in the room who maybe don't know you, do get to know who you are.
Also, this session is being webcast, so it's available on the Internet for people who are not able to be present here. It also means that the remote viewers for these presentations will know who's asking the question, so they can actually relate to some of the discussions that are going on.
I'd ask anybody who is asking questions to please state their names. I suppose the same goes for the speakers.
Any of the presenters -- and this applies for the whole week -- please don't talk at 100 kilometres per hour. Please talk slowly, clearly, so that everybody can understand what you're saying.
It's also so that the stenographers can actually follow you at an appropriate speed. We've had instances in the past where the stenographers are waving frantically to slow you down when you're talking, so anybody who is speaking, please bear all this in mind.
Let's have a look at APOPS.
A little bit of background. APOPS started quite a few years ago as a mailing list and it's grown over the years. Three of us try and guide APOPS happening on, I suppose, a day-to-day, week-to-week, month-to-month basis, so myself, Philip Smith, now working at APNIC; we have Tomoya Yoshida now working at Multi-Feed -- and we have Matsuzaki Yoshinobu who is still working at IIJ. Some things change; some things don't.
There is a website which doesn't really get much attention, I'm sorry to say, but there is a mailing list which gets a little bit more attention. You're welcome to subscribe to that and take part. Not many of the discussions are actually taking place. It's a relatively quiet mailing list compared with many, but there is a mailing list which I have mentioned on the screen there.
APNIC 32, it's part of the regular APNIC program, it's where we put the operations content that's volunteered for this Conference.
We did a general call for contributions that started over a couple of months ago. The program committee was formed from the general community, so we did a call for volunteers, to people to help with the program. Jonny Martin chaired the PC, I was the co-chair for it. We did the usual model of submission and review as we did at APRICOT in Hong Kong in previous events like this.
So I have put the agenda URL up on the screen, if you haven't already found it.
Before we get into the presentations, we have lightning talks and I think this is the first time I advertize the lightning talks, apart from to the newcomers on Saturday evening. They are 10-minute presentations on current topics, so if you had a brilliant inspiration and it was too late for the call for presentations for the APNIC Conference, the APOPS session itself, you have an opportunity to put this presentation forward for the lightning talks sessions.
No slides are needed, as long as you can give us an idea of the title and abstract, roughly what you want to talk about, that would be appreciated.
There will be two sessions: one on Tuesday from 4 to 4.20 for IPv6 topics and that will be part of the IPv6 day that's taking place on Tuesday, and one on Wednesday, from 6 pm to 7 pm or 1800 to 1900, for general topics.
We're accepting submissions now through the submission system. It's pretty much first come, first served, because the lightning talk subjects that we get are always pretty good subjects, so if you want to get a chance to speak about something, nice, quick, then please put your submissions in now. You don't need to prepare slides, just a title and an abstract is really all that's required.
As for today, I'm going to chair this first session.
Tomoya is going to be chairing the session on Wednesday at the same time. So we'll have three presentations here. I have two of my speakers up on the stage, the other speaker is sitting down in the audience.
Three presentations: The first one looking at 6rd, the second one looking at the v6 Day effects in NTT and the third one comparing the performance of NAT64 and NAT44, so all three of them are quite interesting presentations. I would like to ask Shishio Tsuchiya from Cisco Japan to start us off.
Shishio Tsuchiya: Good morning. My name is Shishio Tsuchiya, consulting systems engineer, Cisco Systems Japan.
I will explain my slide. 6rd implementation and case study.
Here is today's agenda. 6rd overview; 6rd and related technology standard update; and implementation and case study.
I think most people already know what 6rd is, but to remind you, I would like to quickly explain 6rd overview.
6rd sometimes is compared with another tunnelling technology, especially prototype 6rd modified from 6to4.
So I would like to explain 6rd comparing it with 6to4.
6to4 is IPv6 automatic/stateless tunnel technology in IPv4 network. It is defined RFC 3056.
6to4 use 2002::/16 in IPv6 network as 6to4 prefix.
6to4 prefix is made by converting decimal public IPv4 address to hexadecimal and adding 2002.
In 6rd, we can use IPv6 prefix which is assigned to service provider, so we can control routing same as native IPv6.
This slide describes 6rd in one slide. This is 6rd customer edge, it's called the 6rd CE.
This is 6rd border relays. It is called 6rd BR.
6rd need IPv6 Internet reachability, but core IPv4 network doesn't need upgrade to IPv6.
IPv6 service in the home is essentially identical to native IPv6 service.
IPv6 packets follow IPv4 routing.
6rd BR is traversed only when exiting or entering a 6rd domain.
6rd BR are fully stateless, no limit on number of subscribers supported.
BR may be placed in multiple locations, addressed via anycast.
The subscriber's IPv6 prefix is built based on subscriber's global IPv4 address and ISP IPv6 prefix.
Next on the agenda is 6rd and related technology standard update.
As I mentioned, 6to4 is defined RFC 3056 and 6rd has two RFCs. RFC 5569 is informational RFC, which described Free's 6rd deployment.
RFC 5969 is standard RFC which defined protocol specification of 6rd.
In IETF80 there are active proposals about 6to4.
6to4 to historical, this draft request to move 6to4 to historical status. RFC 3056, it is 6to4 RFC and RFC 3068, it moved to historical status.
Also, that draft requested IPv6 node behaviour over 6to4. IPv6 node should treat 6to4 as service of last resort.
Implementation capable of acting as 6to4 routers should not enable 6to4 without explicit user configuration.
6to4 should be disabled by default.
Additionally, the draft requested to IANA, request 2002::/16, and this is 6to4 IPv4 prefix and 6to4 reverse DNS and 6to4 anycast address request to obsolete.
That means the draft target to 6to4 timeline completely.
It has been working group LC, that means many of the people agreed with the draft, but some people disagree against the consensus. ISC judge this draft could not reach final consensus and now it is dead status. The author proposed a new draft. In the new draft, IANA request was removed, but request to IPv6 node 6to4 behaviour.
What I would like to say in this section, 6rd has two RFCs. If you would like to study 6rd, please refer to RFC 5969.
6to4 has a lot of issues. Most people would not like to use it. If you would like to know 6to4 issues in detail, please refer to RFC 6343. Like the guideline for 6to4, it will be useful.
6rd looks like 6to4 as mechanism (automatic/stateless), but it is completely different in security, efficiency and so on.
So the third agenda item is 6rd implementation and case study.
Here is 6rd deployment and trial customer list who had announced in public. SoftBank is nationwide Internet service provider in Japan. Free is France, Swisscom is Switzerland, NextGenTel is Norway. They are pretty large service providers in their countries. And ComCast, Charter and Video Tron are cable operators in United States and Canada. SAKURA Internet is a data centre provider in Japan.
Here are 6rd supported platforms: Cisco already supported 6rd since IOS-XE 3.1S. Most of the customers selected this platform as ...
CGS-3, which is the module of CRS-1 or CRS-3, they already supported it. IOS platform already supported 6rd in latest terrain.
IP Infusion, they are already supporting 6rd. This company is top one out of 6rd development.
I'm not sure about Juniper and Alcatel, maybe they already support 6rd. Anyway, the most significant supporter is Linux Kernel which has supported 6rd since 2.6.33. A lot of Linux distribution platforms such as home gateway and server operation system already support 6rd today as a result.
Here is case study number 1, Free. Free is a developer and the first deployment service provider of 6rd. The reason of 6rd development and deployment was DS LAM did not support IPv6. So Free could not provide IPv6 service to a lot of areas users.
About four years passed from the development of the prototype of 6rd. How did they employ Free's IPv6 service?
Here is IPv6 adoption over Google statistics.
Google has been continuously measuring IPv6 connectivity among Google users since September 2008. September 2008, native IPv6 user was 0.04 per cent against IPv4 users.
Today, 0.32 per cent, it is eight times increase, but only still 0.32 per cent.
IETF 81 Technical Plenary, the participant of World IPv6 Day, people explained IPv6 statistics. Google says still 0.3 per cent, but France is 3.4 per cent.
Yahoo! 0.229 per cent user visits to Yahoo! IPv6, but France led the pack with over 3 per cent.
RIPE also measured IPv6 connectivity on their site and reported it. Native IPv6 client capability in France of over 4 per cent. This is mainly caused by free.fr. That accounts for 70 per cent of the native IPv6 client measured. France free.fr is completely IPv6 leader in the world now.
This is case study 2, SAKURA Internet. Most service providers deployed 6rd with their residential gateway in the access network. But SAKURA Internet is a data centre provider. They are providing housing service, hosting service, dedicated server service and VPS service.
The issue is their layer3 switches are pretty old, and so they needed cost and network downtime to support IPv6.
To support IPv6, SAKURA Internet considered server-based 6rd, because most of today's operational systems are in Linux distribution and Linux already supported 6rd in 2.6.33. FreeBSD and CentOS could not provide 6rd in default, but the patch exists.
SAKURA Internet provide IPv6 Internet reachability and 6rd BR and information of server-based 6rd.
Case study 3. NextGenTel. NextGenTel is the second largest Internet service provider in Norway. They are planning 6rd in their residential network.
This is normal. Same as Free. But I'm interested in NextGenTel's 6rd address assignment. As I mentioned before, 6rd address made from ISP IPv6 prefix and subscriber's global IPv4 address. If ISP has large, wide IPv6 prefix space and it's assigned 6rd, subnet ID will be larger.
6rd address allocation method in each of the ISPs.
Please pay attention to SoftBank. SoftBank use /24 as 6rd SP prefix. So customer can use /56 as IPv6 prefix and customer has 256 IPv6 networks.
This is Swisscom and SAKURA Internet and NextGenTel.
Swisscom is the same as Free and SAKURA Internet only provides one IPv6 network, but it is enough, because SAKURA Internet solution is server-based 6rd and NextGenTel use /40 for 6rd SP prefix but provide /56 to customer, same as SoftBank.
Focus on IPv4 prefix length, this is 16.
This is NextGenTel difference points among another ISPs. Back to the RFCs. RFC 5569 is defined IPv4 address, it is 32.
But RFC 5969 does not define length of IPv4 address, so common IPv4 block can abbreviate, shorten, in a 6rd domain.
Next on the agenda is this character of 6rd and IPv6 prefix.
This is our implementation of 6rd. We can specify the prefix length and suffix length of IPv4 transport address common to all the 6rd routers in a domain.
This is a potential solution. If 6rd deploy enterprise customers, enterprise network which use private IPv4 address, has many common parts, for example, this case, 10.100 is common. And host address is also common.
So we can specify 6rd common prefix length 16 and suffix length 8.
So ::/56 would be enough to deploy IPv6 to large enterprise networks.
Next is also potential solution. IPv6 on mobile Internet. 3GPP supports IPv6 PDP, but it needs multiple PDP sessions if customer requests IPv4/IPv6 dual stack.
LTE supports IPv4v6 PDP which supports dual stack on a single bearer. But most operators still not ready for LTE service.
6rd provides IPv6 Internet even if mobile operator does not support multiple PDP and IPv4v6 PDP.
We did a test using our 3G router and separate Internet 6rd BR on DoCoMo 3G network.
NTT DoCoMo already supported IPv4v6 PDP for their LTE users, but our test did as 3GHSVPA users.
This is the result. Ipv4-test.com, 2001:e41/32 is a ... 6rd BR address prefix and here the IPv4 address is Mopera. Mopera is ISP over NTT DoCoMo. IPv6 address made ... 6rd address and IPv4 address.
Summary. 6rd support platforms are expanding since Linux Kernel has been supporting 6rd from 2.6.33.
Managed 6rd (6rd with home gateway).
Free, SoftBank, Swisscom, ComCast, Videotron and NextGenTel.
User on demand 6rd, 6rd BR with server and so on.
SAKURA Internet is here and Chater is also here.
If you use common prefix and suffix, you can provide a lot of network to a customer even if they have a small IPv6 prefix range. This we can study from the NextGenTel case.
Of course, we should consider migrating from IPv4 to IPv6, but even if there is equipment impossible of upgrade to IPv6, you can rapidly deploy IPv6 by 6rd.
We can study from all the user cases.
This is the end of my presentation.
Any questions or any comment?
Geoff Huston (APNIC): I'm curious as to the business case that would motivate 6rd. What it appears is that 6rd is making the implicit claim that it's more efficient and effective to do a complete software upgrade of every single customer CPE router, because that's cheaper than upgrading the tens, at most hundreds, of ISP switches.
But there are hundreds of thousands, if not millions, of customer CPE devices. How can it possibly be cheaper? This just seems to be putting it backwards.
Certainly in those kinds of environments, where the customers own and operate and manage their own CPE, how do you get them to upgrade it? In those rare environments where the provider manages the CPE, logistically, how do they upgrade hundreds of thousands, if not millions, of CPE devices?
It's a cute technical solution, but what on earth is the business case that could make this viable as compared to upgrading ten, maybe a hundred, switches?
I'm interested in your answer. I just don't understand 6rd in that respect.
Shishio Tsuchiya: I think it depends on the user service style, for example, Free and SoftBank is completely managed CPE, so 6rd is a very acceptable solution for this. Maybe Cisco and ComCast, mostly manage CPE, so capability also is good, but as you mentioned, some service providers do not manage users' CPE, so it is difficult. But for example, SAKURA Internet and Chater's case provide 6rd BR information and a convincing example of 6rd server and CPE. It is a possible solution, I think.
Geoff Huston (APNIC): Is making the number of choices available to customers larger more confusing or better for transition?
Shishio Tsuchiya: I like to consider.
Geoff Huston (APNIC): Thank you.
Erik Kline (Google): I had a response to Geoff, actually.
Geoff, you have to upgrade the CPE, no matter what, even if you do native v6. There is an email I saw recently about an ISP that said, "We have rolled out Ipv6 to 100 per cent of our customers." And if we take them at their word, I went and looked at our data for the v6 data that we see it hasn't moved since March.
I'm perfectly willing to believe them, that they did the work to enable IPv6 for all of their customers, but all of those CPEs need to be upgraded. You have to upgrade them to get native v6, you have to upgrade them to get 6rd, you have to upgrade them to get anything at all.
Geoff Huston (APNIC): Do you mind if I conduct this conversation with him, because it's an interesting point? If you have to upgrade all those CPEs to do 6rd just for transition, you have to do it again to do 6 some time down the track, do you not?
Erik Kline (Google): No, I hope not, not if 6rd and 6 are in the same thing.
Geoff Huston (APNIC): So that tunnelling protocol through the ISP is perfect? Cool, I never knew tunnelling was that good, so thank you.
Erik Kline (Google): I never claimed it was perfect.
Owen DeLong (Hurricane Electric): Geoff, you shouldn't have to upgrade the CPE again. The CPE should be capable of supporting 6rd or native in the CPE in one software upgrade, you shouldn't need more than one. As they said, you do have to upgrade the CPE regardless, so you might as well upgrade it to dual capability CPE, use 6rd until you can get your switches upgraded and move forward.
Philip Smith: I think in the interests of time, we probably should move this to break-time conversation and discussion and move on with the next presentation.
Thank you very much for your interesting presentation. Thank you.
Philip Smith: Next up we have two presenters from NTT talking about summary of IPv6 Day. Let me just get the presentation up.
We have Daisuke Yamada who is going to talk about IPv6 Day.
Daisuke Yamada: Good morning. My name is Daisuke Yamada from NTT East Japan. I appreciate having this opportunity to have presentation today.
I'm going to talk about a summary of IPv6 Day on the NTT-NGN.
This is the agenda of my presentation. First of all, I will briefly talk about NTT-NGN and then about IPv6 Day.
I will talk about some concerns with IPv6 Day in Japan. We investigated to deny the concerns, so I will talk about the investigation and results of analysis.
Then I will talk about NTT measures for IPv6 Day, technologies and consideration.
Let me introduce NTT-NGN briefly. I will explain the structure of NTT-NGN. NTT-NGN is a huge closed network which is constructed by using IPv6 mainly.
We provide our own services on NGN, for example, video delivery and QoS based IP telephony and so on.
They are provided by IPv6.
On the other hand, we ... to access rights for Internet connection by creating tunnel using IPv4. In a word, when a user connects to the Internet using IPv4, NGN is access network for ISPs. From this June, NTT hub started Internet connection by IPv6.
Internet connection to the Internet using IPv6, we have provided two methods. One is creating tunnel, same as Internet using IPv4.
And the other is NTT coordinated with IPv6 connection to the Internet as layer three. This one is creating tunnel. We call it tunnel method.
This one is routing quasi-native. We call it quasi-native method.
Quasi-native method is only provided by IPv6 connection. Like this, NTT hub started to provide the Internet connection service by IPv6. Now we are providing IPv6 so that many users can use it.
Let me start to talk about World IPv6 Day.
In Japan, we have some concerns for IPv6 Day. It was said users could not communicate because TCP fallback did not work correctly. And also it was said when one site upgrade to IPv6, some users cannot browse the site for some reason.
Some companies indicated 5 per cent of all users cannot access the website correctly.
For this reason, there were concerns that a lot of users cannot browse web at IPv6 Day in Japan.
For this program, we have some work around in each position. We respond to communication problems as following and NTT build up TCP resetter. I will talk about the TCP resetter later.
ISPs. Introduce AAAA filter at the DNS server and I will talk about AAAA filter that at a later time.
Customers apply RFC 3484 tool which ... policy table to their client so that they can select the correct route. For this, they don't need to do fallback.
NTT actually was going to build up TCP resetter for IPv6 Day, but before we did it, we investigated it to search for grounds of concern which are a lot of users cannot connect to the Internet. In a word, we investigated to confirm there is no anxiety around NTT-NGN.
In the first place, if users select IPv6 Internet connection service, fallback problem would not occur basically, but we have a lot of users who use IPv4 Internet connection service, so we have to deal with this program and keep the Internet connection. So we concentrated on respondents building up TCP resetter.
I talk about our investigation in the next slide.
Before talking about the investigation, let me explain our definition of terms, which I will use during this presentation. This term is beacon. The beacon means web beacon. I'll talk about the beacon first.
This time we used the beacon to check whether users can access some site. When users receive the beacon, browsers access the site which beacons indicate automatically. So checking access at the site, for example, we can survey which site users can access or not. Using this mechanism, we can trace users' access and investigate reachability to any site.
Let me start to explain the investigation.
NTT had investigated with IPv6 Promotion Council of Japan. We investigated the behaviour of users' access accessing to IPv4 and IPv6 website using NTT-NGN.
When users access to a certain web server, the server responds to information with five kinds of beacons and we measure the reachability of the site which beacons indicate. The five kinds of beacons are these.
Beacon A is for checking A record. Beacons B and C are for checking AAAA record. These beacons are classified users, whether they can access servers which beacons indicate.
So we can confirm users have reachability to IPv4 by beacon A and we can confirm users have reachability to IPv6 by beacon B and beacon C.
We assumed the access may be like this, beacon A is like this and beacons B and C are like this (demonstrates on slide).
The IPv6 address which beacon C indicates is special address. Only the NTT users can access this IPv6 address. We need this beacon because we can check whether this access is from NTT user or not.
Beacons D and E are for checking whether users can do fallback.
If users who need to do fallback access site with beacons D and E, we assume the access moves like this, beacon D (demonstrates on slide). And beacon E (demonstrates on slide).
We assume the access moves like this, where I show one example. For example, one user cannot access server which beacon B indicates. This user can access server which beacon D indicates. What does this mean?
This means this user do fallback correctly. We assume it's access like this (demonstrates on slide).
So we need to focus on beacon D mainly. As I said before, if users don't have reachability to the site which this beacon indicates, basically this means fallback doesn't run correctly in this user's environment. We mitigate it like this.
Result of investigation. We proved fallback can work correctly over Internet environment. The users total 21 million. As a result, we found 0.2 per cent of users could not access the web.
We consider the users who could not access the web in regulation time to be a failure.
Focus on NTT East users, 68 per cent of users who couldn't access website normally used applications which have fallback problem.
This graph shows relation of OS and applications which are used. Please pay attention. This is fallback ratio graph in NGN users by OS type. 95 per cent of users used Windows. Mac and other OS are little.
Next is browser type. 66 per cent of users used Internet Explorer. Firefox is second and third are Chrome and Opera, a comparable rate. I think the ratio of this graph shows it's almost the same of market share in Japan.
By this investigation, NTT confirmed there was no serious problem in the connection to the IPv6 Internet over the NTT-NGN environment. NTT investigate it and confirm this problem and faced IPv6 Day.
Now, I explain the measures of NTT for IPv6 Day.
All NTT-NGN when users received AAAA response, we are realising the connection to the Internet by doing fallback, IPv6 to IPv4.
When users cannot access to dual stack site by using IPv6, they have to try fallback to IPv4. NTT-NGN support their fallback quickly by sending TCP reset packet to users.
So NTT has built up the SYN which send TCP reset packet, they are called TCP resetter. As a result, we can do fallback quickly without waiting for the time out.
It is as shown in this figure. First, client reserve a ... to DNS server. Second, DNS server responds AAAA record and A record. Next, client communicates to the AAAA address by IPv6.
As I said, in the beginning, communications goes to NGN because NGN is constructed by IPv6. But, NGN is closed network, so there is no destination in NGN. Then TCP resetter respond TCP defect. Last, as a result users can do fallback quickly and can access to the site by IPv4 address.
Thinking of NTT users, TCP SYN traffic to TCP resetter to be nearly all to IPv6 communication traffic.
So NTT build up this TCP resetter and prepared for IPv6 Day.
I will introduce IPv6 Day solution in the next slide.
First, please look at this figure. This is the number of reset packets on the TCP resetter. In Japan, IPv6 Day started from 9 am JST on June 8.
We can see the number of reset packets increased rapidly at 9 am. Comparing the packet number as before and after 9 am, we can see it increased about seven times as usual. After 9 am, the number of TCP packets increased little by little. The peak time was about 10 pm.
After 10 pm, the number of TCP reset packets decreased and from June 9, morning, the number increased only a little.
During v6 Day, NTT-NGN have received many TCP SYN packets, but NTT-NGN could respond to TCP reset packets perfectly. We confirm it to check all.
This is another data. This figure shows reset response ratio of TCP resetter. Please look to the day before and after IPv6 Day. June 7, the blue line; June 9, the green line.
The reset response ratio on June 9 was more than on June 7. This means that there were websites which still responded to AAAA.
Then I talk about the result from IPv6 Day until the end.
At first I talk of NTT-NGN. NTT has built up TCP resetter and taken measure for IPv6 Day. As a result NGN customer did not have any serious problems at the IPv6 Day.
TCP resetters responded to all TCP SYN, so there was no serious network trouble and no serious users' reports.
But having experienced the IPv6 Day, we could confirm NGN don't become barrier when the Internet to adapt IPv6 and on IPv6 project work like NTT-NGN the action using TCP resetter also proved it is effective to deal with fallback problem.
Last, I talk about what we should do before the next IPv6 Day.
Probably it takes a long period to adapt IPv6 for all users. Of course, it is necessary to correspond IPv6 not only end users. However, it is not easy to realize all IPv6 work. So we can assume fallback will occur for several years.
On the network side, it is very important to support fallback correctly and quickly.
So on the network side, for example, we can support fallback quickly by using TCP resetter. We show it in this presentation.
On NTT situation, we have also prepared IPv6 Internet connection service which doesn't need to do fallback basically.
Through our investigation, we have some applications unstable implementation around TCP fallback. On the application side, some applications have to be improved.
We have to note the one thing. It is we don't have to give any pressure to our work customers. In this IPv6 Day, some users apply RFC 3484 tool for avoiding the fallback program and some users operate their browsers and applications.
However, not all users can take a correct way when they are faced with the fallback problem.
So we have to consider the method to correspond to fallback problem which have no pressure to customers.
In a word, we have to improve network and applications without users' load.
Network: for example, we can update TCP resetter in our network. On application, we have to check and update around fallback implementation.
Taking this action towards next IPv6 day or week, we will be able to construct comfortable network to IPv6.
We at NTT will cooperate with every organization to take some measure to realize IPv6 Internet.
That is the end of my presentation.
Philip Smith: Are there any questions?
Lorenzo Colitti (Google): Can you go back to slide 8?
That red thing that says fallback, you didn't mention that it takes one second.
Daisuke Yamada: In one second we can do fallback.
Lorenzo Colitti (Google): So every user that goes to a v6 website will have to wait one second?
Daisuke Yamada: We don't check all users' behaviour, but normally we can do fallback users, we can do fallback within one second.
Lorenzo Colitti (Google): So we have been measuring this data and we see that for all our users in Japan, all Google users in Japan, the average latency impact of announcing IPv6 is 870 milliseconds. So if we turn on IPv6 for a day, that's millions of users with 870 milliseconds. If we turn it on for a week, it's millions of users with 870 milliseconds. If we turn it on forever, it's still millions of users with 870 milliseconds latency penalty.
As you can imagine, this is quite a barrier to actually enabling IPv6 on a large website. One of the things that we have to consider is how do we turn on IPv6 for the world without giving big problems to Internet users in Japan?
One of the things that we thought of is that if a question comes from a DNS server in Japan, we'll just remove the AAAA. We'll just say no, there's no AAAA.
So we could turn it on for the rest of the world but say, well, if one of these DNS servers asks us for a AAAA from Japan, we'll say no, we'll just remove it, we'll filter it out.
Daisuke Yamada: We want to consider.
Tomoya Yoshuda (NTT Com): Yes, I think it's true, one second to the IPv6 global Internet from the NTT-NGN currently sitting situation, but this year we started IPv6 native service and also the tunnel service to the IPv6 global Internet, so after that, no one second, just going to the --
Erik Kline (Google): When they called, they told me 2013, September.
Randy Bush (IIJ): NTT tells people -- I live in some strange little island over there and, you know, 2013, maybe.
This is a disaster. This is not a little problem.
This means that consumer IPv6 is dead in Japan. Any reasonable content provider will block AAAA records queries from Japan because Amazon and Google and Facebook et cetera do not want their users and by the way, one second is minimal, Geoff has seriously worse measurements over there for people with things like Windows XP where you're looking at two and a half seconds. This is a disaster.
If NTT wishes to do it, that's on ... but what was really horrifying is NTT had said to other large global providers that other people should deploy this. Nobody should deploy this.
Tomoya Yoshuda (NTT Com): I'm sorry, I don't have any measurement for me, so I don't have any detail information to you, but ...
Randy Bush (IIJ): So you deployed this without measurement?
Tomoya Yoshuda (NTT Com): No, no. From which, I can't tell you, but, yes.
Lorenzo Colitti (Google): I suggest that we take this discussion into tomorrow. Erik will be presenting some data that we have, you'll see what the graphs look like, you'll see what they have looked like in the last few months, you'll see how much v6 adoption we measure, you'll see the latency impact, so please, look at his presentation and the data are there.
Tomoya Yoshuda (NTT Com): OK. Thank you.
Philip Smith: Again, in the interests of time, and as Lorenzo suggested, let's carry on this discussion in coffee break and tomorrow. We have a whole IPv6 day happening tomorrow, so lots of opportunity to talk.
Thank you very much for your presentation. Very interesting. Thank you for that.
Next up we have Kenneth Llanto from the Philippines who is going to talk about NAT64 and NAT44.
Kenneth Joachim Llanto: Good morning. Today I'm going to talk about our research regarding the performance of NAT64 versus NAT44 in the context of IPv6 migration.
This is the overview of the presentation. The facts first, I'll give you first a short overview of the situation. I know you're all familiar with it, so I'll just try to make it fast. Since the late 1980s, it has been a concern that IPv4 addresses will run into exhaustion. Just recently, IANA has allocated the last /8, five /8 blocks of assigned IPv4 addresses to the world's RIRs. At the rate we are going, we are probably expecting the IPv4 addresses to run out next year.
We know that there has been an imbalanced distribution of IPv4 addresses. So in the Philippines, we were forced to live with an IPv4 shortage mentality by implementing technology such as NAT, CIDR, private addresses as a measure to compensate for this lack of addresses.
Currently, most networks in the Philippines are still using NAT. Most of them are hidden inside networks. Though it has lengthened the life of IPv4, we know that it's going to run out.
IPv4 addresses will soon run out. That's why we are all here in the Conference in the first place.
Not to mention APNIC, the RIR which we are under, will become the first registry to be fully depleted, based on the number of remaining IPv4 addresses.
We all know that the solution is IPv6, however the path to translation, transition or migration is still unclear to some of us or to most of us.
Transitioning to IPv6 and IPv4 to some still seems daunting. Its acceptance and adoption has not been what has been expected.
Technology such as NAT, CIDR, private addresses have delayed its acceptance. Moreover, there are some migration concerns like cost, network performance and, of course, fear.
Some people especially in our country have not started because of mostly fear. It's unknown to some of us.
This is why the research would like to prove that such a migration technique is feasible without sacrificing network performance and is also cost effective.
In line with this, I'm proud to present our study.
This is actually a two-phase study. I'm going to present the first part only for today, the performance of NAT64 versus NAT44 in the context of IPv6 migration.
The objective of this research is to provide network operators a simple and clean path towards IPv6 migration, show that the performance of NAT64 is comparable with the current NAT44 that we are using.
We aim to answer the following question: Is there a significant difference in the performance of IPv4 NAT network versus the IPv6 NAT64 network?
Also, in this research, we plan to tackle the following specific questions.
What are the possible measures that we could use to compare NAT44 with NAT64?
How do we implement IPv6 NAT64 implementation? What considerations must we successfully do in order to perform a successful transition from IPv4 NAT to NAT64?
Will this NAT64 implementation be comparable to the performance of NAT44?
How do we plan to do it? We plan to do a comparative study using first an experimental test bed network. We created three test beds. The first one would be a NAT44; then a native IPv6 network, this is to show or to give us a view of what the future seems; and lastly, a NAT64 network.
The second part of the study will be the actual networks. The first one would be a laboratory in Atneo, Manila using NAT44; then the second one we will implement NAT64 in that laboratory and perform testing and measurements of the data.
For that statistical treatment, we are going to use T-test, two samples assuming unequal variances with an alpha of .05. This is the logical diagram of the IPv4 NAT44 experimental network.
In this scenario, IPv4 clients connecting to the Internet don't have connectivity to IPv6 Internet. The clients are issued non-routable IPv4 addresses and would need to go through a NAT router to connect to the IPv4 Internet using the router's IPv4 routable address.
The next experimental network is a native IPv6 network diagram.
This kind of set-up would not require address translation. This is mainly because we all know that IPv6 has a huge amount of IP addresses.
This is how it's going to work. At this point, in the IPv6 scenario, transition strategies are no longer required because complete migration has been achieved.
This is our final goal. But I believe we're still far from it. So we have to take the small steps necessary to start going to IPv6.
The next experimental network is the NAT64 network diagram.
In this scenario, IPv6 clients are issued with routable addresses. A client can access an IPv4only website or a website with support for IPv6. This may be a website with IPv4 or IPv6 address or pure IPv6 website.
In the case that the client would like to access an IPv4 website only, first, it will go through a DNS 64 proxy daemon. This daemon will contact an actual DNS server and ask for an A or AAAA record. Since the website only supports IPv4, it will only return an A record. This A record will pass through the NAT DNS 64 daemon and the daemon will add an IPv6 prefix to that A record and in turn, return a AAAA record to the clients.
This will allow the IPv6 client to think that the IPv4 website supports IPv6 and then using the NAT64 router, we will be able to communicate to the IPv4 Internet via translation.
If a client would access a network which supports IPv6, what happens is there is no need for translation.
The client can connect directly to the IPv6 enabled Internet using its assigned IPv6 address.
This is a summary of the software and daemons that we use in order to implement the set-ups. For NAT, we use Linux web server, then a Linux client DNS server, using router and Linux IP tables. For the web server, we just added the RADVD to provide the client's IPv6 addresses.
For the NAT64, we used TAYGA and out-of-kernel stateless NAT64 implementation for Linux, and for DNS 64, we used Trick or Treat Daemon.
For the test we used ping and ping 6 to send 21 packets to the web server to test for connectivity and round-trip time, while we used the ApacheBench tool to test using 1,000, 2,000, 3,000 requests, at concurrency levels, 10, 100, 200 and 300. The web page that was accessed had the size of 173,225 bytes. We were limited to these concurrency levels because the hardware that we used was not able to handle higher levels.
This is the initial results of the ping test. As you can see, overall performance, IPv6 is better in all aspects compared to NAT44 and NAT64. What is not shown in here is, as you can see that the average, it seems that NAT64 performs a little better compared to NAT44.
This is for the test not actually the case, because the first ping packet, in order to ping, we use the domain names, the first ping packet wouldn't require DNS resolution, and it took longer compared to other ping packets. So that the first ping packet increased significantly the average of NAT44.
So in general, NAT44 performed a little better compared to NAT64 in terms of ping.
For the ApacheBench tool, we compared total time difference, total byte transfer, successful Keep Alive packet difference, transfer rate difference, request per second and time per request.
We compared both networks in terms in comparison with NAT64, because this is what we're actually testing for.
In the comparison of the NAT64 versus NAT44, it can be seen that the results are very close. There is very little difference in terms of speed, total time, total bytes, transfer rate and other metrics.
For the NAT64 versus native IPv6, we also implemented the same measurement techniques, the same metrics.
The results were as follows.
For all the metrics we were able to find out that IPv6 performed better. This is a good thing, because it gives us a preview of what the future brings.
This is the initial results of the statistical treatment T-test for all the same metrics.
As a result of the significant difference, we tested for an alpha of .05. A result less than .05 would mean the difference is significant. For this one, it is only the comparison of NAT64 versus NAT44. We no longer compared the results of IPv6 with NAT64 because we use it just to give us a preview of what the future might bring.
For total time, total bytes transferred, successful Keep Alive packets, request per second and time for request, we were able to find out that there was no significant difference between IPv4 NAT44 and IPv6 NAT64. But for the transfer rate, there was significant difference. Although there is a significant difference, it can be said that on the previous slide, you were able to see that NAT64 yielded a higher throughput compared to NAT44. This is a difference that we look forward to, because it's a positive difference.
Based on phase 1 of the research, these are our initial recommendations. Use of NAT64 as a translation strategy for IPv6 migration is easy and viable.
IPv4It provides IPv4 and IPv6 connectivity.
Open source technology can be used to lessen the financial constraints brought about by this migration.
And we must also take the first step now towards IPv6 migration.
For future work, this is the configuration of the laboratory networks. We already are done with the set-up of NAT44 laboratory and currently, we are also done with the configuration of NAT64 laboratory.
We are performing ongoing testing on this laboratory.
After that, we will gather the data and perform comparison and analysis of the results.
The laboratory network is a live test of the experimental network and will be applied to more machines with actual connection to the Internet.
In the laboratory classroom we were given operating systems of Windows XP and Ubuntu 8.04, but we also test this set-up using several Windows 7 devices and Ubuntu 10.04 clients.
Here are the logical diagrams for the NAT44 set-up and this one for the NAT64 with connectivity to IPv4 Internet. At this point, we are still requesting connection to the IPv6 Internet outside in the laboratory.
In closing, I would like to leave you with this. In the end, we are often hindered with fear but with adequate information and motivation, we are assured of our path towards the next step.
Questions and suggestions and comments.
You can contact us through email.
Masataka Ohta (Tokyo Institute of Technology): With the configuration of NAT64, you have a server behind the NAT, don't you?
Kenneth Joachim Llanto: Yeah.
Masataka Ohta (Tokyo Institute of Technology): Then what's wrong using NAT44 with servers behind NAT?
Kenneth Joachim Llanto: OK, yeah.
Masataka Ohta (Tokyo Institute of Technology): Then we don't have to migrate to IPv6 because there are a lot of port numbers, 16 bits available. So we have 48 bits of addresses.
Kenneth Joachim Llanto: For IPv4, you are saying?
Masataka Ohta (Tokyo Institute of Technology): OK?
Kenneth Joachim Llanto: In this set-up, this is just a micro set-up. In the future, we are planning to implement it on a larger scale, so this one is just the first phase. So actually, the campus has many clients, about 5,000, and hopefully if we get the chance, we will try to implement it in a bigger network, but you're right, for small networks, there are a lot of port numbers.
Masataka Ohta (Tokyo Institute of Technology): No, no.
Port number space is 10 times larger than 5,000 clients, that I think IPv4 with NAT and servers behind NAT can scale well, but we may disagree, OK.
Geoff Huston (APNIC): I was actually wondering how you have tested it under load, because the real difference as far as I can see between NAT64 and NAT44 lies in the way it behaves when you really load it up. Part of the difference is that they behave differently under different kinds of load. In NAT44, the problem is that when you get a new connection as distinct from an existing one, you have to find a new binding port and it's really the algorithm that looks up the port table to find the next port. If that's a linear search, you will be waiting forever.
If they are actually intelligent and do some kind of balanced tree structure inside the implementation, it will be quicker, but even so, the more active bindings, the slower it gets. So it's not transfer rate, it's simultaneous connects. Whereas NAT64 stateless, as far as I can see, what you're doing is a direct table look up, one address is equivalent to another. It doesn't matter how many simultaneous connections you have, because you keep on servicing them until you run out, because it's a static table. But every packet costs because you have to rewrite the header and redo the check sum, so there it's a case of packets per second.
So rather than just a simple measurement, this fast, have you looked at measurements under the two kinds of load, more connections, more packets per second, because I suspect that's the real insight there into these effective performance differences of the two kinds of technologies.
Kenneth Joachim Llanto: Thank you, Geoff. Actually, it's a good suggestion. At this point, we haven't done that kind of testing, but in the future, hopefully we can implement something like that. Thank you.
David Farmer (University of Minnesota): Following up on Geoff, do you have any theories on the difference in performance yet?
Kenneth Joachim Llanto: Not that much. We're still investigating.
David Farmer (University of Minnesota): Thank you.
Philip Smith: Thank you very much for the questions.
Thank you very much for your presentation. Before you go away, a small thank you gift, Kenneth.
I think we'll wrap up this session now. I apologize for being 10 minutes late. We did actually start 10 minutes late, so we have used the 90 minutes that we had.
Before we go, there is the NRO NC on-site election being held during the Policy SIG on Wednesday, so that's from 11 o'clock to 4 o'clock on Wednesday.
The online election for APNIC members for the NRO NC, that's going to close at 9 am in this time zone, so 9 am here in Korea on Tuesday. That's tomorrow.
That's just the reminder about the NRO NC election if you are taking part in that.
Let's have our tea break right now. Unfortunately it's only 20 minutes, so please go out, quick tea or coffee break and if you can come back here at 11 o'clock for the opening plenary of APNIC 32.
Thank you very all very much. Thank you to the speakers, thank you to people who asked questions and thanks to you, the audience.
Wednesday, 31 August 2011.
Tomoya Yoshida: Good morning. Last night was a very good social event. For me, my throat is not in good condition.
Anyway, today we have APOPS 2 session. My name is Tomoya Yoshida of Internet Multifeed.
As we had APOPS 1 session on Monday morning.
Anyway, so I will give some information to you about the lightning talks.
Yesterday, we had lightning talks for the IPv6 topics, but today we have another lightning talk. The time is 6 o'clock to 7 o'clock this evening.
We have four speakers, but still one is vacant for you. So if you would like some topics or idea, please submit, you can see the URL, APNIC.net, so just 10 minutes for each. If you would like to have some talk, please.
This morning, we have three presentations, one is from Raphael, carrier ethernet exchange, the second is one is from Randy Bush, BGPsec, and the last one is from Masataka-san, regarding path MTU discovery.
Raphael Ho: Good morning, everybody. Hope you are all bright and early and awake.
My name is Raphael Ho, I'm director of engineering operations for Equinix Asia. Robert Huey is my colleague in the US, he's the lead engineer for the carrier ethernet exchange, so I'm using his presentation from NANOG.
Firstly, what is a carrier ethernet change? It is basically very similar to an Internet exchange, where multiple service provider would interconnect with a single point to exchange routing information and to exchange the IP packets. So a carrier ethernet exchange is a place where multiple carriers would interconnect and exchange ethernet frames with each other for things like MPLS, VPN services, for example, the pseudo wire type services, and the local loop services, the metro ethernet services, essentially any ethernet packets on this, they can exchange and basically, the goal here is similar to an Internet exchange and we build a vibrant carrier community and service provider community, so that customers can have a fast access to like access ethernet type services all over the world.
Why do we have this ethernet exchange?
Imagine if you are a network service provider and you receive a tender for a 100-site MPLS VPN network service all around the world. You only have direct access to a very small proportion of customers and even then, you would need a local loop provider to help you to bring that service from your POP to the customer's location.
And you would also need to find carrier partners to deliver services to other locations that you cannot reach and they in turn would require interconnecting with local service providers to deliver that end service on the far end.
So basically, the way this ethernet exchange tries to assist in delivering that goal, is to provide free major services. The first one is really a marketplace, local loop service providers can upload their building list into a portal and international carriers can upload their country lists into a portal. So as a buyer of network services, within these carriers, they can actually go to one place and identify who has reached the particular building, because -- I mean, it is a very painful job sending quotes to five different local service providers, three of them come back, some come back later and it's very difficult to do that and a lot of carriers have a very large team of people just doing that, just getting quotes. So that's part number 1.
Part number 2 is we actually operate an ethernet exchange fabric which is essentially an MPLS switch running pseudo wires to interconnect V-LANs of different carrier partners and, thirdly, is the NOC to actually support the ongoing troubleshooting and the demarcation of the services. So I'll go into a bit of that in the next few slides.
As I mentioned, one of the key components is the ethernet switch fabric. For the Equinix carrier ethernet exchange, we are actually using Alcatel-Lucent's 7450 and we have configured it in a redundant topology and the exchange portal, as I mentioned we have the marketplace which actually matches buyers and sellers, we actually integrated this portal with the service manager for Alcatel-Lucent, the management system, which allows us to provide automated service provisioning, SLA monitoring, as well as some troubleshooting features.
So this is the Equinix carrier ethernet exchange portal architecture, so again multiple service providers would be uploading the building list, on that building list into the portal, so that other carriers can actually search that and identify them as a provider of local access services or international access services.
Basically, when they identify, let's say, two or three carriers can provide that service to that particular location, they can actually generate a quote request on line which would go into the network trading team of those particular companies and they will come back with a quote.
Once all the legal and financial stuff is agreed between these carriers, they can just click a button and the cross-connect will automatically be provisioned at the meeting point that they're in. So if they can connect in Hong Kong, for example, then they would obviously get a quote from Hong Kong to the local customer's premises and the cross-connect will be placed in Hong Kong, for example.
How would a carrier interconnect to this exchange?
There are obviously many types of ways to connect.
In this particular example, we have a metro area where you have three data centres, IBX-A, IBX-B and IBX-C, so the carriers can choose to connect via a single circuit, type A, via link aggregation onto the same switch, so that's carrier B, via multi-chassis lag, so within the same location and by multi-chassis lag with a redundant location, so that's carrier D.
The current service we are offering is gigabit ethernet and 10 gigabit ethernet and these different types of interconnection services, modes.
So within the ethernet switch fabric and within the community, as I mentioned earlier, this is a VPLS pseudo wire type service, so for example, between carrier A to carrier B, they would interconnect via a particular V-LAN and we would connect those two V-LANs using an MPLS point-to-point pseudo wire.
Another service type would be a multi-point type service. If you look at the blue line, you've got carriers B, C and D interconnected using basically VPLS and under metro ethernet forum language, this would be called E-lines and E-LANs.
That's about it. So each carrier would interconnect with an 802.1Q trunk and we would interconnect virtual cross-connects on to particular V-LANs, essentially.
So that doesn't sound very exciting. That just looks like a switch with lots of V-LANs and that's true, but a lot of the challenges with carrier ethernet is that, you know, the providers might not have the same V-LAN settings, it will be almost impossible to go through different parts of the network on the same V-LAN, so we do offer the translation service on the carrier frame, so for example, the V-LAN frame type, the V-LAN ID, et cetera, and we also help maintain the QoS between the different providers. I'll go into a little bit more detail on that.
Unfortunately, even with all the guidelines that the community has provided and MEF has defined, there is really no standard service for ethernet circuits internationally or domestically. Some people are offering, like, three class of service, some people are offering five class of service and some people offer no class of service.
A lot of times, these services are designed for the corporate who will be demanding end-to-end service level guarantees. So the good news is all of these service levels, all of these guarantees are written inside the service provider description.
What we are also doing is really to support re-mapping of different carriers, service level between networks and between the customers' end points.
How do we do that?
Most of the carrier ethernet services that we have come across, a lot of them are marking the service via what they call P bits, which is actually a field inside the 802.1Q frame and what we do is we mark those P bits and we map them into an internal class of service based on the MPLS EXP bits.
On ingress, we mark those traffic packets and on egress we again re-mark them based on the egress provider's service profile.
So this is like an example standard translation table that we have. So for example, we have a possible single class of service, up to a possible eight class of service based on the 802.1Q P bits and also the MPLS EXP bits that we are mapping them to. If a service provider offers eight class of service, great. If a service provider offers one class of service, they will be marked under EXP bit zero or the first EXP bit.
Again, on the egress, what we have is a similar mapping table, so if a service provider on the other end of a five class of service, then this is how we'll map the service to.
So what this means is if let's say, one service provider on one side offers three class of service and they are interconnecting to a provider that has five class of service, as you can see, if you just take a look, under "3", the orange box, so the first class of service will be mapped to the first class of service on the other side, the second class of service would be marked as EXP 4 on the egress to the service provider and the expedited forwarding would be marked with EXP 5 on egress.
For example, obviously, this can be customized, if necessary, if a customer wants certain class of service to be mapped differently, we can do that. But this is the default setting that will actually work with the carriers on each end to do.
The P bit setting is one thing. Obviously we need to adhere to the QoS policies that are provided by the customers. So each class of service, the green bits, the blue bits and the red bits, are obviously part of an ethernet virtual circuit and basically, those queues would be transmitted based on the priority and then within each of the physical circuits there will be multiple ethernet virtual circuits, the different V-LANs. Again, there will be a port scheduler to make sure that the QoS are adhered to before sending it on to the network. These are generally checked on ingress.
On egress, this would be very similar to the ingress QoS policy. Again, based on the EXP bits that were mapped, we will have the settings, the transmit settings, the priority settings on the different queues.
Then all the port scheduler would do is go round in the priority base and send that to the egress port.
This is probably the biggest challenge with international ethernet service delivery. Ethernet was designed to be originally a local access network protocol, and throughout the years it has been extended and extended to deliver services for wireless area networks up to carrier high-speed backbones.
In order to continue to use that properly, we actually need to work on some OAM services, the operational, administration and management, because traditionally, end-to-end ethernet services, it either works or it doesn't. So if there was a packet loss, if there were errors, you won't know which path introduced that error and it's actually very difficult to troubleshoot.
The good folks at IEEE and ITU have worked very hard on this and have come up with a few standards on this, so 802.1ag as ITU-T Y.1731. Basically what this provides is the full sectionalization of capability that would be required in an international network under a hierarchical maintenance domain. What that means is on the lower part of the diagram, as you can see, there are different levels that you can set. Basically, for example, you set the management domain level to 4, you can actually manage the end to end from the net, from the AN net to the ZN net. The net is network interface device, it's kind of like a mini-ethernet tester that carriers generally deploy on the customer's premises in order to make it easier to troubleshoot and not need to send out personnel to the customer site to do testing.
For example, if you set the maintenance domain level to 1, then that little green arrow reduces the scope of that testing domain to a particular service provider.
Based on the MEF recommendations, each service provider would be on a different maintenance domain to allow sectionalization and troubleshooting of the end-to-end service.
In addition to that, ITU-T has made the Y.1731 that actually provides frame loss measurement and frame delay measurements within each management domain.
Some of the lessons learnt: Equinix has deployed this service for about a year and a half. A lot of the difficulty is really with the initial profile setting, the determination of how the QoS should map. Once we have actually set that up, it's actually to test against that defined policy. That is a process that we continue to work with our carrier partners on.
In the industry, the general standard testing for ethernet services is RFC 2544 and we found that standard to be fairly limiting in terms of the international ethernet services. Even though it tests for throughput, burst rate, frame loss and latency, it is not OAM aware, so it doesn't actually know which parts of the network were generating the frame loss or the latency.
It is also not multi-service aware. So because RFC 2544 tests basically ethernet frames, it doesn't test any services. For example, let's say you are using TCP/IP and with a 10 MB physical service delivered over a 100 MB line, so the provider would rate limit you to 10 MBs. However, if the burst sites are not set correctly, you are going to have a very low TCP window.
You might be only getting a TCP window of 1 and your throughput would be far less than 10 MBs, although if you stuck a tester on both ends, you would get that 10 MBs.
Again, this is some provisioning standards that are not necessarily standard throughout the whole industry.
The stacked NIDs problem -- if you go back to this particular diagram, the local loop provider on the right, for example, would probably want to deploy a NID to help troubleshoot for them and also the end-to-end service provider might also want to drop a NID at both end points to help them troubleshoot.
At the customer premium, you might have two NIDs or three NIDs depending if this circuit was a resale circuit or you might have a lot of NIDs.
I believe the NID vendors are working on certain things, so that one NID can actually do the different management domains, but the NID vendors obviously want to sell more NIDs as well. This is an issue that the industry is working on.
Even OAM is a standard, but I think we all understand how standards work and there are differences, there are compatibility issues, there is certain equipment that you would have in your network that is not managed and therefore would not respond to OAM sales and could potentially cause trouble in the network.
Anyway, these are some of the lessons learned. This is an overview of the service that carrier ethernet exchange offers and I hope it was informative to all of you.
Tomoya Yoshida: Thank you, Raphael. Any questions?
Raphael Ho: Sorry, these are some of the reference documents that you can find on line to describe the ENNIs for network service providers, et cetera and you can download this slide from the APNIC website.
Tomoya Yoshida: Thank you. The next speaker is Randy.
Randy Bush: My apologies. I lost my laser pointer in giving a talk in San Juan, Puerto Rico. I ordered a new one, it came last week and they sent me a military weapon. I hope I don't hurt anybody or burn a hole in the screen.
Let's assume we have the RPKI, the resource public key infrastructure which allows us to have formal definition of who has -- this is going to be fun -- what address space all the way down to who owns -- yes, it is a little bright. I warned you. It even has a safety, remove it and it won't go on at all so it doesn't burn your luggage.
Let's assume we have what we call route origin authorizations which say, hey, this piece of address space may be announced by that AS number. When I say assuming this, this is actually running code to maintain this infrastructure and the code in the routers is now in test by Cisco and Juniper and will be delivered about the end of the year in production code.
So we assume that we have this global RPKI infrastructure and we can gather it, et cetera, and we can stick it in routers. This whole chain is actually demonstratable and I run workshops showing it and taking users through it.
You can actually see that a router can tell which are valid origins and which are invalid origins based on those data.
Therefore, essentially, as an operator, I can say if it's valid, set a local preference, if it's not found, set a low local preference and if it's invalid, drop it.
As an operator, I can now prevent the YouTube incident, can prevent the 7007 incident, prevent the accidental mis-announcements. So we have now cleaned up 95 per cent of what we see as origin errors on the Internet.
The problem is there is a gap. There is no cryptographic assurance of the AS path or of the origin.
In other words, when I got that announcement, yes, there was an ROA that said 4128 could announce it, but the AS path itself, let's find an AS path, there we go, that 3130 itself was not signed. So we don't know that somebody did not cheat. So we can stop accidental misconfiguration, but a malicious router could announce as the AS. They could forge the origin.
Somebody who is a black hat, OK. This would pass ROA validation. How do we stop that kind of nonsense?
The way we do it is formal path validation for the full path for the entire AS path, so we can protect against origin forgery and monkey-in-the-middle attacks on the AS path, the classic one being what Kapela and Pilosov demonstrated at DEFCON a few years back.
So we can show not only that the received AS path is not impossible, but that it is formally this is the path that it followed.
We cannot know the intent of an announcement.
I don't know whether Mary should have announced the prefix to Bob. But I should be able to test whether Mary actually did announce the prefix to Bob and somebody is not lying to me about this.
Right? So policy is whether Mary should. I can't know what policy is, because of new circuits being brought up, because of peering relations changing, et cetera, policy on the global Internet changes, you know, faster than I can change slides.
We already have a protocol to distribute policy.
It's called BGP. So the BGPsec validates that the protocol has not been violated.
OK? It's not about testing whether Mary should have announced it, it's testing whether and making sure that Mary really did announce it.
So you can't cheat the protocol, not I can't make sure the business model is correct.
Now we'll have an example and this is a trivial example of an attack. It's called a path shortening attack, because Z is the attacker, B announces her prefix to W announces her prefix to X and Z and X announces it to A. And Z announces to X that he connects directly to B. He lies. He shortens the effective path.
So Z lies to X, X believes the lie, announces it here, the traffic flows this way and as it passes through, Z takes the money out.
How can I stop that? I do something called forward path signing. When ASN signs that it is sending it to n+1 by AS number, in other words, when I'm announcing to Mas, I sign cryptographically that you can verify that I'm sending it to Mas so that Peter cannot lie and say that I sent it to him.
Here we have AS0 signing that it is sending it to AS1. So this is hashed and signed so anybody can validate that it went to AS1 and AS1 says it's sending it to AS2. So AS3 cannot say, "Oh, I received it from AS1," because there's no cryptographically signed block saying so.
In this case, Z cannot say that he received the route directly from B. Because B never signed to Z. B signed to W and W signed saying I received it from B and I'm passing it to X. Similarly, X signs it to A.
So this doesn't work, it is easily detected by X as a lie, X ignores it and the money flows this way all the way to B. This is good because B stands for Bush.
We have spent the last couple of years designing a protocol that actually started out in about 1999 and it is similar to a protocol that started out in 1999 called SBGP out of BBN and it's called BGPsec. It's been developed outside of the IETF by a large and diverse group. It assumes that consenting routers will use capability exchange to say, yes I can do it. This is a change to BGP, so it has to be a new capability.
Since it holds signatures, things could get a little big, so we have to remove the 4096 byte PDU limit for updates and that's already progressing through the IDR working group. If it's not agreed, then only classic BGP data are set.
So every router can have a key. They don't have to.
If you want one AS per key, you can do that. But it means if I use a separate key for every router, if my Busan router is compromised by bad guys, the rest of the routers in my network are not threatened.
Key distribution is going to be a little more complex. There is a cute thing that the router could generate its own key pair and send the certificate request up to the RPKI for signing, per router keys, they look like this, we already have a certificate that says that I own this AS, so the AS can sign all the keys for its router and the name of the certificate, kind of, is AS and then the router ID. So every router has a certificate, every router has a public/private key pair is what's more important.
So when I originate this AS, from this AS, this used to be the standard BGP announcement. Prefix, AS number in the path, hand it to the next person.
Now, I also have to tell you which router it is and what the forward AS is, who I'm handing it to, I name them specifically and I sign it.
As you have seen before, here is the next one in the path. So now we have an assured path. When I receive an announcement that says it started at A, went to B, went to C, went to D, went to E, nobody can lie about that.
I only need to test this at the provider edge.
I only need to test this at my border routers. It's not in my IGP. I don't really need it in my EI BGP except to transport it across my network and to deal with it in route reflectors.
So it's used inter-provider only and note that I can upgrade my edges incrementally. In other words, I can get a nice new router that has this feature in Busan and, six months later, install it in Seoul. What's nice is your enterprise customer has to do far less.
An enterprise customer is multi-homed, because they're BGP speakers we presume, so they are multi-homed, so they're trusting their traffic to these two people, so they can trust that these people are validating the route. They don't need to validate it, which means they don't need to have the entire crypto-database or any of the validation stuff. All they have to do is sign their prefixes and that protects their prefixes and their BGP as it goes up.
So that means they only need to have one key, their private key, sign the announcements and this can be done with current hardware. No hardware upgrade. So you can do this on a 7200, at a customer edge.
It's meant to be incrementally deployed and it does not require a flag day.
It is specifically designed not to increase operational data exposure by ISPs. Many ISPs are sensitive to publishing their peering information. So they don't really publish everything in the IRR. So this scheme gets around that problem. It just uses BGP normally, does not add to public data.
Confederations and your iBGP, confederations actually look like eBGP internally, so route reflectors and confederation boundaries, if you wanted to carry through your network, then those routers have to be BGPsec capable.
It only checks the prefix and the AS path. We don't understand the security threats and what protection mechanisms we want for things like communities, MED, so forth. Since we don't know what the security threat is or what protections we need, we don't sign them. Note that things like no export are not supposed to be transitive anyway.
The proposal as it sits in the IETF today is quite unoptimized. This is intentional, because we wanted to get the semantics correct and we wanted it simple, so we could understand that it was correct and prove it's correct.
We presume that there will be plenty of hacking and improving things and optimizing things over the next years as the design is finalized.
One example I like to give for optimization is how to deal with prepend counts. A lot of people stack a lot of preps and you don't want to repeatedly sign 25 prepends. So I have proposed that we just have a prepend count in what's signed as another attribute.
What's cute about that -- I hope I have it in the next slide, yes -- is that there is a 1 byte prepend count.
Transparent route servers -- the problem with route servers is, if they are in AS, they can't sign for me with my key because it's my private key. The reason you want them to be transparent is so that somebody downstream calculating the AS path length does not see the increase.
So let the transparent route server sign with a prepend count of zero. BGPsec speakers will calculate the path length by summing the prepend counts and so the route server will still be transparent as far as AS path length's calculations go. When a BGPsec speaker passes a signed announcement to a non-speaker, it has to strip all the signature data anyway and it expands the prepend.
This is an optimization, as I said, but it's not yet in the formal documents in the IETF because we're trying to get those right first.
This uses the global RPKI. So the way the RPKI data are delivered to routers will get bigger and more complex, but that's between the RPKI and the routers, et cetera. You won't see it.
Origin validation is still assumed, as it works today and is being delivered and tested now, so you can fall back if you don't speak BGPsec, you still have origin validation and if you do have it, the fact that the ROA is in the router already, you don't have to include it in the signatures. You get it for free.
This is just another BGP decision. The router should not do anything automatically. The router should just mark this as valid or invalid and let me with my local policy, decide what to do with the result. I'm sure there will be plenty of knobs provided by the vendors.
So what are some of the consequences? Everything can't be perfect. It's going to require faster hardware. Because of the cryptography, and because the router is going to have to store essentially much of the RPKI, so memory utilization is going to go up. You don't know this, but almost all the routers in existence, the code is 32 bit model, not 64. We're talking about something that's a few years out, because of that.
But all along this path, we have found that things actually turn out -- router vendors, the programmers like Ker and Hanis, et cetera, seem to be very good. It turns out that for origin validation, for any prefix to be validated, in a Cisco 7200, is taking 10 microseconds. That's the length of time testing one prefix in an access list. So if you have an access list route filter of 100 entries, this is on the average 30 times faster.
The amount of time it takes to dump the entire set of ROAs for a full BGP table into a router, 350,000 ROAs into a router, we started testing this a couple of weeks ago, we wanted to draw graphs, we can't get the data fast enough because the whole thing only takes about 4 to 5 seconds.
This stuff turns out to be faster than we think, but we just really don't think it's going to fit in the current generation of hardware. Maybe in an ASR 1K.
That is a 64 bit model.
The size estimation on the American National Institute of Standards and Technology did some modelling. I wouldn't bet a lot of money on this model, but it gives us a rough feeling. They're assuming a deployment model that's low. They're assuming deployment starts in about 2014, three years from now.
Nobody has developed the code yet, et cetera. We are talking about before we break the 4 gig model memory, we're some years out, so this is thought not to be a major problem. I'm not that confident in this model, but it gives us a feeling.
Among other things, I don't know if you know it, but in a BGP announcement, especially at start-up, address prefixes which all have the same identical attribute set, the same identical communities, the same AS path, et cetera, you can put in one BGP packet a list of prefixes that share that attribute.
With this, you can't do that because when it gets to the next router, that list of prefixes is going to be broken up and distributed differently under policy, which would break the signature. So you can't have PDU packing any more. It's going to be one prefix per announcement.
We have done some serious measurements on real routers, et cetera. The penalty is not large. It's less than 50 per cent.
It turns out, incidentally, from those experiments, that the vast majority of the time of when you receive a full routing table is nothing to do with path calculation, it's nothing to do with the transmit time, it's how long it takes to go from the routing table, the rib, to the line card on the fib. That's where you're spending your time in your router. So nothing here is going to improve that.
Proxy aggregation is dead. That's OK. Nobody uses it.
Lastly, this does not lock the data plane. In other words, the routing table could say, do this, but you can have default pointing out over there. That's your choice.
As it turns out, as you can see in the 2009 paper we did, we measured -- and I think I have done this presentation at APNIC or APRICOT -- that 70 per cent of the ASs in the default-free zone in fact have some form of default. So really what we call the default-free zone is not that default free.
These are the people that are to blame for this.
Some of them are in this room. Notice they're from places like Google, vendors, government, operators, et cetera, even some academics, from Asia and from the States and from Europe. So we hope that the fact that vendors and operators and academics were involved, the security will be reasonable and it will be deployable.
I hypothesize that the hell that we have gone through trying to use DNSSEC was because of a lack of operational and vendor involvement in the original design and it was utterly non-deployable. We believe this will actually be reasonably deployable.
Masato Yamanishi (SoftBank): Thank you for giving the presentation. I have a comment about the slide for new hardware generation. I think you mentioned that uploading for IPv6 support is a very good chance to --
Randy Bush: This was essentially said by somebody from a vendor. The routers today that you and I have or that you and Mas have running IPv6 are not doing everything in reasonable hardware. In Juniper, they're often going around the TCAM twice. In Cisco, we don't want to think about what's happening.
If v6 actually takes off, and we pray it does, then we're going to have serious v6 traffic. To handle that traffic, we're going to be upgrading our routers.
So what this vendor said is this v6 router upgrade cycle will be about the same time you're going to want to do this cycle and so they will try to plan to give us both, because both of them are new ASICs and new code.
Masato Yamanishi (SoftBank): I understand your point, but just I want to say also uploading to support 100 gig E is also a good chance to implement this technology.
Randy Bush: Good point. Actually, no. We need 100 gig E sooner. We need 100 gig E now. We don't need full speed v6 now.
Masato Yamanishi (SoftBank): Not now for me.
Tomoya Yoshida: Still Cisco and Juniper are the only vendors currently support --
Randy Bush: Well, actually, I believe there are Linux UNIX based versions also, but no other hardware-based vendor that I'm aware of is doing the origin validation code.
I have been approached by -- no, no, no. Huawei has a team on it now and I have been approached by some others. If Cisco and Juniper are doing it, the industry will do it. End game.
Masataka Ohta: Did you say 40 seconds was for signature generation?
Randy Bush: No, I did not.
Masataka Ohta: Then how much time does it take to generate signature?
Randy Bush: We have data on that. I don't believe the data.
Masataka Ohta: OK. I also think that it is necessary to confirm, verify signatures which is a lot more time consuming than generating signatures.
Randy Bush: That is generally correct, yes.
Masataka Ohta: So it's, I think, a huge workload.
Randy Bush: No. Well, here we go down a rat hole. There are chips that cost less than your Tengui that can do this way fast enough. I don't think this is going to be a big blocker.
Steve Kent (BBN): A couple of observations. Whether or not validation is faster than signing depends entirely on the algorithm you choose.
Randy Bush: Correct.
Steve Kent (BBN): It is the opposite for RSA versus DSA or ECDSA, for example. As Randy pointed out, there is hardware available that certainly can go ahead and keep up with this, can go a lot faster, at the moment, that hardware is not, you know, the fastest versions of that hardware are expensive enough that you probably just wouldn't want to throw it in every piece of equipment, but over time, Moore's Law is on our side.
So I feel fairly comfortable, like Randy does, that we will be able to have adequate hardware support for the generation and validation. It also turns out that there are some optimizations as Randy said, we haven't worked really hard at optimizing this. We haven't worked very hard at all and there are optimizations that are being explored by router vendors that allow them to put off the validation processing if they're really busy because in many cases, they might not have to perform it at all.
Let's face it, when you're getting a lot of updates from various peers, only one of them could possibly offer the best new path, if any of them changes your idea about what the best new path is for a given prefix.
So if you validate all of them cryptographically in a sense you're wasting some time now, because there could be only one winner. If the code structure allowed it, and you could go ahead and run the selection algorithm first and say, gee, if this were valid cryptographically under BGPsec, would it change my mind about the best path? If the answer is no, don't spend the time doing the crypto validation. There are a number of potential ways to address the concerns about the performance hit that this takes from a crypto processing standpoint.
Randy Bush: This is one of the areas where we have disagreement in the design team. I think doing fancy stuff like that is going to cost more in the long run than it will gain. Tony Hoare said, "Premature optimization is the root of all evil."
The figures -- I don't want to show you because I don't really trust them -- show that, for instance, the Intel Westmere chips have enough crypto acceleration that you don't even have to buy another chip on the board. That's just commodity processors.
So it's something, of course, one has to measure, one has to keep an eye on and one has to deal with in the design phase. But there are 36 other problems also.
This one is not one that particularly scares people.
Tomoya Yoshida: Any other questions? Right. Thank you.
Randy Bush: Thank you.
Tomoya Yoshida: The last speaker is Masataka Ohta. He will talk about the path MTU discovery program.
Masataka Ohta: I will talk about how path MTU discovery does not work with IPv6.
It's one of several protocol programs of IPv6; because it's operational, I'm presenting the issue here.
Path MTU discovery is a method to measure path MTU, the minimum MTU of a path by ICMP Packet Too Big method.
Path MTU is set to the value contained in the ICMP packet. But as you know, path MTU discovery does not work if ICMP Packet Too Big is filtered by intermediate routers or not generated by target routers.
A problem of path MTU discovery is that path may change and if the path changes, path MTU may also change and thus, path MTU discovery designed to periodically send larger packet than current path MTU to detect MTU increase by possible path changes.
According to node requirement, path MTU discovery should be supported by all nodes or by IPv6.
The problem is whether ISPs not filter ICMP Packet Too Big or not. My thought is that it will be filtered.
This is path MTU discovery for IPv6 and the draft standard specifies path MTU discovery supports multicast as well as unicast destinations.
It means that a lot of multiple ICMP may be generated against a single packet. It is documented.
The local presentation of the past multicast destination must in fact represent the potentially large set of paths, so a potentially large set of ICMPs will be generated. How large is a potentially large set is a problem.
With this typical configuration, where access network use PPPOE or 6 over 4, MTU is a little smaller than 1,500 byte. The backbone MTU is 1,500 byte. Then the sender first sends the packet with current path MTU but it will periodically increase MTU so that it will periodically send 500 byte packets.
It causes generation of ICMP Too Big messages and it cause access network the amount of ICMP message generated will be proportional to the number of subscribers, which can be large, very large, millions or more.
Some multicast routing protocols allows for source address spoofing. ICMP may be used for DOS amplifier to source addresses. Maybe this is not a problem, because almost all ISPs do not enable multicast routing protocols.
Multicast interim multicast is not used very often, but ISPs do not allow ordinary users to send multicast packets, maybe, but it is still a problem because rational ISPs want to avoid to rely on rational operations of other ISPs.
Instead, the multicast PMTUD problem is yet another reason for ISPs to disable multicast. It is ironic because multicast PMTUD was introduced to promote both multicast and MTU discovery, which resulted in killing multicast and MTU discovery.
So RFC 2463 requires a Packet Too Big -- must be sent by a router in response to a packet. It's a must.
There is another ICMP parameter problem message which should also be generated against multicast packet.
So there are two types of ICMP which will cause implosion.
To prevent ICMP implosions, which must violate RFC 2463 to stop generating ICMP Packet Too Big, the parameter problem against multicast packet. In addition, we should or must filter ICMP Packet Too Big and parameter problem for multi-generated against multicast packet.
That is the minimum thing we should do but as it is already a violation some ISPs may, perhaps will, simply stop generating any ICMP and filter all the ICMP, which means unicast path MTU discovery won't work. It's a rational behaviour against multicast path MTU discovery. You can't argue that it's against an RFC because RFC is broken.
Fundamental solution, of course, is to update RFC 2463 to prohibit generation of ICMP against multicast packets and write BCP to force ISPs not to filter ICMP. But perhaps it will take another decade or two, during which time we can't use path MTU discovery.
Without path MTU discovery, according to RFC 2460, simply restrict itself to sending packet no longer than 1280 octets.
So we can't send larger packets. However, IP over IP tunnels and then almost all packets, most of the packets will have a length of 1280 octets long, perhaps especially those for TCP, but then if such a packet is tunnelled over IP over IP tunnel, then the packet will be fragmented because if tunnel MTU outside tunnel MTU is 1280 byte, so as a realistic compromise, not for tunnels, must use MTU a little larger than 1280 bytes.
To conclude my presentation, multicast PMTU discovery is broken, to cause ICMP implosion, and ISPs should filter ICMP packets to be at least against multicast packet, but maybe all or there may be intermediate filtering policy. But we can't expect unicast path MTU discovery work anyway and we shouldn't send the packets larger than 1280 bytes, except for tunnels.
Tomoya Yoshida: Thank you. Any comments, questions, suggestions?
Owen Delong (Hurricane Electric): Breaking path MTU discovery is a longstanding issue on the Internet.
I think most of us are familiar with it, and in IPv4 it doesn't help. I agree that the ICMP implosion is basically analogous to the IPv4 Smurf attack that we all got rid of through getting rid of directed broadcast and we probably need to block multicast related path MTU discovery packets until we can come up with a better solution.
Blocking unicast ICMP Packet Too Big packets is a very, very bad idea. It breaks things, it does harm, it should not be suggested or recommended.
We actually have proof in the form of draft Weil and draft BDPKS or something like that going through the IETF process now, that if the operator community organizes, develops a presence at IETF and pushes for policy we need, in the form of RFCs, we can actually get it moved forward relatively quickly. The draft Weil has already gone through working group last call and is moving forward. I expect both to become RFC status or draft RFC status very soon, possibly as early as the next conference call.
We, as operators, need to engage more in the IETF and push the policies in the IETF that we need to make things actually work in the real world.
The reason we get these ivory towered solutions that don't actually work is because they look great on the drawing board and the operators aren't there to say this doesn't work in the real world and here is why.
Participate, change the IETF, change the rules so that we have something that works. Don't break path MTU discovery to adapt to a broken standard; fix the standard.
Masataka Ohta: But I'm afraid that it will take a decade or two, so that all the equipment will follow the new RFC.
Owen Delong (Hurricane Electric): It will take a decade if the operator community doesn't get involved. We have proof in the form of the draft going through for the shared space /10 that if the operator community organizes and makes a presence at IETF heard, the IETF moves quickly in our favour.
Masataka Ohta: No, I'm talking about replacing old equipment.
Owen Delong (Hurricane Electric): It's software, not equipment.
Masataka Ohta: But then in your theory, I think IPv4 path MTU discovery should also work today.
Tomoya Yoshida: Regarding this topic, do you have any other opinions?
Masataka Ohta: Then my final comment that I want to make about this issue is, before IPv6 was finalized, and working group watered down it -- no one cared. OK?
Tomoya Yoshida: Comments?
Martin Levy (Hurricane Electric): Can I just confirm what you're saying on this slide, the material of what you want to change and I'll go back to your summary, the slide you were on.
I'm maybe not going to disagree on the multicast stuff, but I have a real problem with the unicast statement on that. The statement we can't expect unicast PMTUD to work, if I read that by itself, I may have to accept it. But the reality is if I take the reverse of that statement, we know it works in a majority of situations and we know that when we filter it, things break. So we do our best in the operator community to make sure it isn't filtered and with the large-scale testing that went on with World IPv6 Day, we saw people who had realized that this was the one thing that they had forgotten, fix their filters and they were happy, happier. Sorry. I don't want to be absolute.
So if I take everything that you have presented, that one statement is the one I have a problem with, the one that I would like you to redress. Because leave the unicast part alone. There has been a lot of discussion about whether it works or not works, but we have to keep telling the end users and the operators to make sure that at least they can do their best to make sure that those ICMP messages go through. I wouldn't want the message to come away from this that that should change.
That's my point, if you understand, that's my focus on the unicast.
Masataka Ohta: OK. Yes. The majority of operators may not filter unicast ones, but if some operators filter, it will still be a problem. You can't say it's against an RFC yet.
Martin Levy (Hurricane Electric): OK. Interesting comment. If you believe that, then I just did a trace path 6 against your university which worked quite perfectly, thank you, very good, and against my laptop sitting here, it worked great from California. Feel free to filter your university and see how long you can survive with using v6. It won't work. It will have problems.
My comment is leave it alone. It's not perfect, but leave it alone for the moment. That's the point I'm trying to make.
Masataka Ohta: The reality is that IPv4 path MTU discovery is not working. How can you expect --
Martin Levy (Hurricane Electric): This v4 stuff, I don't understand what that is. All right. I'm talking about v6 only. That may be the clarification I should have put there.
Masataka Ohta: But same operator operates v4 and v6.
Martin Levy (Hurricane Electric): Yes, but my comment is about v6.
David Farmer (University of Minnesota, ARIN AC): Do not break path MTU discovery for unicast. It more or less works in v6. Does it work in v4? No, it doesn't. But v4 is a legacy protocol now, guys, let's get over it and move forward. But let's not break v6 because we have brokenness in a legacy protocol.
I find the multicast v6 ICMP thing very interesting and I'll have to think about it and I wouldn't argue with your conclusions on the multicast only, but for unicast, do not break it.
If you want to research why unicast path MTU discovery has problems, that would be really good, but don't just jump to the assumption that because it's broken, we should break it more. Leave it unfiltered in v6 and work on why it doesn't work in some cases.
I have a few clues on that. In v6, we have some operational -- one moment, take a step back. In v4, we have some operational practices that we have moved forward into v6 that don't work in v6. One of those is it doesn't matter what the two MTUs on a link are as long as they're both above some minimum. In v6, the two ends of a link have to have an identical MTU for path MTU discovery to work correctly. This has not been the operational practice in v4. We need to correct that operational practice in v6.
If you start filtering ICMP Packets Too Big, you're just masking the underlying problem, you're giving up on path MTU discovery and masking the true underlying problems that if we can solve those, we will actually have functioning path MTU discovery for IPv6.
Masataka Ohta: I understand you hope so, but path MTU discovery has another problem that it will load routers because larger packets are periodically sent, so it increases router load. That is another reason that path MTU discovery should be disabled.
Geoff Huston (APNIC): My comment is actually in response to David, and I hope he's listening here. I observe in the v6 path coming into this particular venue that the MTU of packets leaving in v6 is different from the MTU of packets incoming and you said there the two ends of a link have to have an identical MTU for path MTU discovery on a link basis.
I was trying to say, it's actually not working at the moment. It's not identical for us and I was going to ask if you're having problems with this site's configuration. Because I'm looking at this asymmetry and thinking this is actually fascinating, have we broken anyone by this inadvertent asymmetry in MTUs on this link? I haven't seen anyone complain and I was going to give anyone the opportunity to put their hand up and say, "I have a v6 problem," because we are running different MTUs, by accident, on the connections into this room.
David Farmer (University of Minnesota, ARIN AC): Let me expand on my comments. That was on a link basis. We have had the operational practice of setting the MTU of a router to its maximum MTU. If you have routers that have different maximum MTUs, you have different MTUs on each end of a piece of wire. The Packet Too Big ends up with what I'll call an impedence mismatch in one direction when you do that. On a link-by-link basis, the link, the routers on both ends of the link have to have an identical MTU for path MTU discovery to work correctly. Thank you.
Tomoya Yoshida: Other comments, suggestions?
All right. So did you already share your opinion to the other NOGS or other operational community?
Masataka Ohta: No, this is my first presentation, save my previous presentation more than 10 years ago at IETF.
Tomoya Yoshida: Some key part is very important, yes.
I think that's key discussion.
OK. Any other comments? All right. So thank you very much to the three speakers. Let's give a big hand.
Tomoya Yoshida: The next session will be at 11 o'clock.
We will take a coffee break now. Thank you.