Transcript: APOPS - Session 2

Disclaimer

Due to the difficulties capturing a live speaker's words, it is possible this transcript may contain errors and mistranslations. APNIC accepts no liability for any event or action resulting from the transcripts.

Wednesday, 25 August 2010, 11:00-12:30 (UTC +10)

Yoshinobu Matsuzaki OK.

YOSHINOBU MATSUZAKI:I'm chairing the second APOPS session.

And we are a little behind schedule. Our first speaker, again, we have three speakers. The first is George Michaelson from APNIC talking about a second look at measuring IPv6.

GEORGE MICHAELSON: So I'm back again. I was lucky enough to once go and see University Challenge being filmed, and they do two episodes back-to-back in the studio at the same time. The presenter changes his jacket to make you think that it is happening on another day. I've put on a jersey!

This is another activity that has been the prime focus of work for the R&D group in the year. This is a body of work Geoff has been doing since the beginning of the year, and also some time before, in fact.

But, I was lucky enough to collaborate on this, as did Byron, and also Emile Aben from RIPE NCC. We have had a collaborative relationship with the RIPE NCC research group for a number of years now. It has been enormously fruitful, very rewarding for us in terms of having people to talk with, bounce ideas off and this is one where we both independently came to the same activity at the same time. And it's just been a really lovely exercise of collaborative research.

So no presentation that Geoff ever does about the nature of the address pool is complete without this diagram. You have all seen this picture many, many times before. Usually, when we do this, you know, people are focused on the nature of the curve. But it's just so noticeable. I've drained the battery on this baby!

Yep, it is so noticeable how close we are to the red circle on the right. There we are beyond the point where this is something we are talking about - we are living it.

We kind of expect things that are going to happen around mid-next year for the initial rundown on the IANA range. We expect that by the beginning of 2012 we are going to hit our own process outcomes that face the same activity. But we are not certain. If you look at this picture, which is showing how the data prediction has varied over time - so time is across the bottom - this is a very variable measure. It has been high, low, sometimes it is accelerating, sometimes decelerating. This is not a smooth, measurable activity.

Doesn't really matter how you look at it, though, the strong sense that you are going to take is that we're running out. This is a downward curve, the exhaustion dates are trending down.

The rate of the trend might change, but we are all going one place.

So there this standard view of what we thought we were going to. We had an idea 20 years ago that the size was going to grow. IPv4 pool size was going to grow down and we are going to reach somewhere kind of ugly. The idea was that we would do the transition, smooth, gently in the mid-stream life of the v4, we'd get to a happy place.

So this kind of goes to the perpetual question: How are we doing? Are we on track? But that goes to the second-order question, what are the measures that really tell us if we are on track?

We have thought up a number of questions over time, how much is supporting v6, how much is running v6, how quickly is the Internet becoming capable of doing the whole end-to-end dialogue, how long is the lifetime of dual-stack. Then you get to questions of whether you are sampling in the space, or are you doing the whole data capture - via components or via metrics? Snapshots or time series?

So you get all kinds of mathematical questions you can put in, but I'm more comfortable walking away from that one. So we decided - Geoff had this focus that if we go to the end-to-end model, if you take the view of two systems - a client and server trying to construct an end-to- end dialogue - all of the infrastructure in between has to work to make that work.

So if you measure the end-to-end, you get the improvisioning, the routing, the DNS. You get all the systems that are going to be aspects of this story by invoking an end-to-end transaction, doing what people have to do to make the web work for them, mail work for them, news work for them - whatever it may be. It measures the whole system.

So some kind of analysis looking into the weblog servers starts to inform you about the complete phase of all of this technology, not just the application in this specific instance.

So you take a dual home server, we took APNIC.net, you count the number of distinct v6 and v4 per day. You are just taking the ratio of the distinct entries that come from IPv4 or IPv6. Not the number of web hits, just the ratio of the populations of the distinct clicks that come from v4 or v6. So the number of unique addresses you see, you have to do that, because you get some impacts from the automatic machine-level behaviour that is going to get in the way if a robot hits your site, that is 10,000 hits scanned in one link. The impact of a single robot happening to use IPv6 could skew your figures if you didn't do this.

That takes us to a diagram you may have seen a number of times. This is a measure that actually has quite a long life. We have been doing this right the way back to 2004. It was at the sub-single digit percent figures for a very, very long time. If you went into this in fine detail between 2004 and 2007, this was a story of increase, but everything was suppressed down below 1%. We get to 2008 and you will see a little tick up there, just underneath letter 8 of 2008. That is one of our conferences. We can actually tell when you guys come on site and start doing things on-site and get exposed to IPv6 because we see a rise in the number of IPv6 hits on our website.

So then we have that wonderful kick from New Zealand - do you remember when we had this game of the Kiwis and the sheep on the web page for the conference?

Well, some guys in China posted it on a slash dot type link. They were saying, "Wow, look at this amazing interactive java game you can play on the v6." Our hits went through the roof.

A long tail down. We a had to do some work to cut it off. There were some peaks showing up. This is the meeting cycle impact, when people are released into a IPv6 environment and things start happening. But the really lovely thing here is that in 2010, we have seen that numbers have consistently raised up to a point where we think we know there is a real v6 audience out there we're seeing real traffic numbers. We are out of the sub digits and into the single digits.

So we thought we would refine this. And instead of just looking at ratios, there this is idea of actually pulling a little bit more out of the client - maybe doing a script on them that tickles a little bit more traffic out of them and see what we can find.

So the basic model is that you are expecting a client to do a fetch off the web page, but you do some scripted activity and you get that data onto a measurement server that gives you fine-grained information about what is really going on.

So the script forces the client to undertake five different fetches of an invisible image, and each of these fetches reflects a different kind of situation in both the name resolution and the nature of the transport to fetch the image. So we have v6-only image, but you could get the name over v4. Or the images available on both v4 and v6 and the DNS name is available on both. All of those combinations. And we also include this idea that the name that you fetch something under is random - it is uncacheable. That has two interesting properties.

If the name is different for everyone who comes to your site, then you can really correlate every individual user's actions in DNS look-up and web look-up. If it is uncacheable, there is no risk that because the information is put into the system, somebody else's timings are going to vary. We decided to be a bit nicer to the user. In the context where people with looking at the website, we were worried about the impacts on the site visit, we only look at you once. We have a cookie visit to make sure that we don't look at you at multiple times. This gets you a grid of different patterns of access. If you are able to fetch on v4 and you are offered a v6 but you don't take it, and you are offered a dual, so the image you're offered is available on dual-stack, and you only take it on v4, that tells us that we tried in DNS and in web two different ways to say to you, "Come on 6 " and you didn't do it.

You were told only on 6, you were told on both, and you only came on 4, we can be comfortable that you have no true end-to-end v6 capability. There is the opposite case. We give it to you on v6 and we don't give it to you on 4, we give it to you on both and you're unable to fetch it, and you come on v6, we say, "Wow, you are a v6 only client? You are not even dual-stack, you really only have v6 connectivity." We can go to the preference issues. If we see that you can fetch on v6, you do fetch on v6, that means that you are preferring v6 something about you makes you prefer that source, or v4. And we can even measure this rather odd corner case. If we tell you that it is available on v6, you don't fetch it at all.

Even if it is available on v4 - that is kind of scary. This gets us to a diagram that we feel is a fairly new and quite substantive offering into this measurement exercise. We know that the rate of uptake of v6 is something that has become strategically important in regions, the European Union has funded a measurement exercise. The OECD is collecting data. We feel this is the beginnings of new information that we would like to share with the community.

This slide is showing what happens when you tickle just a little bit more of v6 out of the user. That number sitting around 2% lifts to somewhere around 5%. 2% to 5% - that is doubling.

So there is double the amount of v6 capability out there than we have been natively measuring.

Now, there is a second message on this slide. If you go to the March to April aspect, that was a lovely scenario of an increasing activity and a growth in v6. Then between April and July, all of the ISPs, all of you guys, you all went on holiday. And you didn't add any more v6 into the system. We have gone to a flat region. So, there are some aspects to this trend that are a little worrying for us, that we are not seeing constant rising behaviour. We think there may be a need for more initiatives to boost v6 uptake, to boost the distribution of v6 capability. The evidence is there that users are interested in using it, and twice as much as we have been telling you.

So if around 5.5% are capable, and if that is about double the preference, and if that is an artifact coming out from looking at dual-stack behaviours, that is getting us quite a good place.

This graph is looking at the instance of dual-stack compared to v6-only. You get this really weird situation that we actually have a measurable quantity. It is around a half a percent, heading up towards 1%, of people who are happy to be v6-only in our community. I might add that our sample set of IPs that are coming to the website is somewhere around the 5,000 to 10,000 unique IPs a day. It is statistically significant. Yes in the context of a global net, this is a small number. But there is a statistically significant number of people that are v6-only in the universe.

That is pretty weird.

So if we go into the subtype under v6, if we drill down a bit here and start to look at tunnelling technologies, like Teredo versus unicast, you see differentiation happening in how people are prepared to use v6. You get the signal that 6 to 4 is able to offer a really quite comparable service to v6 unicast. In fact, if can you look carefully at these lines, what you're being told that there are occasions when a tunnelled v6 service is better than native v6. Now, that is an interesting statement. Because there is a really strong religion out there that tunnels are bad. Tunnels are bad! Don't engineer tunnels! Go native! Yes, it is true, it is a lot better to have native v6. But we have to acknowledge in the current scenario, tunnels can actually present a better v6 experience.

The second thing to observe is that Teredo, which is the mechanism that is built into current spec Windows system and is on, we believe, by default, Teredo has a really low preference in this framework. So it is the manually configured 6 to 4 that is showing up in our models. It is not the Teredo.

So if we go there and we look into this, you can actually see that there this preference for v6 unicast if it is available, if someone has a true v6, they are going to use it. They only go on tunnels if they have no choice. If they are going on tunnels, there is a preference for 6 to 4.

So the auto tunnelling capability kind of takes you to this view that if you are presented with an automatically configured tunnel, you will wind up prefering v4 in your dual-stack. That seems to suggest that the preference waitings the software developers have made that the preferences are influencing the ratios that we are seeing in use of the technology.

So Teredo versus 6 to 4, 3% using 6 to 4, 0.3% using Teredo. Hang on - that is kind of strange. Because we think most domestic use lies behind v4 NATS, and you can't do 6 to 4 behind a v4 NAT. The only technology that is going to work is Teredo, because it has all of that hunting to find the path through. So we really thought we should have seen more Teredo. This was really counterintuitive for us.

We're still working around what the issues are here. We don't entirely understand if this is about a filtering problem, if it is about disabling of systems, or if there is some other system level aspect that is interfering with deployment of Teredo. We actually have a suspicion that it may relate to some of the problems that go on here.

So we also have noticed that 6 to 4 has a weekend peak effect. And if we go back to something I said earlier in the morning about diurnal behaviour, if you have a diurnal pattern that is human-centric, you tend to differentiate day to day and weekends. If you can see weekends and it is more, you know it is domestic. You see weekends dropping, you know it is business, because that's people at work doing it. When you see it rise at the weekend, it is a domestic effect. It seems to suggest that 6 to 4 is not yet something that corporate communities, corporate networks are actively deploying. People are able to do this in the home network where they are interested, because they can avoid the policy restrictions.

OK, so the thing with tunnels - now the classic model is that tunnelling is going to extend the packet path, that you get the addition of the relay and you also get some asymmetry issues. If you look at the classic diagram of 6 to 4, blue is the outbound path where the client is using a v4 network that can get to the end point. But what happens is tunnel traffic goes to the well known relay. He is using the upper half of this diagram to send the packets in. Then the dual-stack server is sending the respond via a relay on the v6 address. You have an asymmetry in the path and you don't have a lot of control over what those delays are. There is a mechanism that you can use that will mitigate that, where you can bind a route on 2002 /16 onto the dual-stack server. At that point, you have collapsed half of the round trip delay to being the same as v4.

It is entering a v4 tunnel directly at the host. So you have really only got that tiny bit of timing excess that comes from the tunnelling capsulation, the transit is time is totally the same as v4 for half of this journey.

Teredo, you get this other situation that there is a negotiation going on, there is a whole sequence of probe packets that gets sent out saying, "Am I behind a NAT, are you behind a NAT, am I behind a double NAT? Which way is the cone pointing?" Then having done this negotiation, what happens in Teredo is that you prefer the server's choice. So where normally you would expect that the Teredo solution is going to take you something local to the client, the negotiation has pushed you to something that is considered what the server is able to use.

So we do this wildcarding to make sure that we get no DNS effects happening here, to get rid of any possibility that you have cached state in the DNS.

We have includes models in the Pashi server to get microsecond timing in our logs. We also have a made a deliberate decision that we are going to stick to a sequence of objects that is order-preferencing v6. We have done that because we been concerned that some of the v6/v4 effects of knowing that something is out there and fetching it changes your behaviour. We want to stand out distinct from the RIPE NCC who have gone along a path of randomizing the order of fetches, so there are differences between our measurement techniques. We have gone quite deliberately with the constant order fetch. We are using v4 as the benchmark to say if it is as good as v4, that is a plus. If it is worse than v4, it is a minus.

So this diagram is showing you the relative cost of doing things set against v4, where 0 means it is pretty much the same. You will notice that v6 unicast is sitting along zero line.

Most of the time, it is actually not a heck of a lot worse than doing v4. That is a really positive statement. v6 is already getting close as good as v4 for most fetches. So then you can start to see the 6 to 4, you can start to see the Teredo costs. These are adding around the 1 and 2 second delay on that fetch. You will notice on Teredo it is hugely variable. Very, very variable.

So RIPE, on the other hand - and they have used a very similar technique to display this data. It is a slide from Emile Aben. The URL there is where a lot more information is available on the RIPE NCC measurement activity. We are comfortable that we are seeing confirmation of behaviour.

In this particular instance, the v6 measure is broadly the same. We get the same thing with unicast, true v6 - that it's consistent with v4. Their tunnelling figures are just a little bit different. We are interested in why that might be.

So if the tunnelling is adding somewhere around 0.6

of a second, that is a long time, that is enough to do two round trips up through a satellite, and that is the amount of extra cost that is being put on. And we don't think there is a lot of satellite in the global network any more. We have a feeling that this is coming down to some issues that relate to the deployment of the 6 to 4 end-points. We have had a look at some of the work that Hurricane Electric has been doing deploying 6 to 4 anycast. There are a lot of nodes in America and Europe. But not that many nodes in the Asian footprint. The RIPE NCC are doing their measurements predominantly in a community that have access to a lot of 6 to 4 routers, which may account for that difference in measure. We think that might be a signal that one of the approaches we should take as a community is to encourage more ISPs to look at deploying 6 to 4 end-points inside their own infrastructure.

The evidence is that you can make a significant change to the end-to-end delay if you do this. That is what we are seeing quite consistently.

Teredo is adding 1 to 2 seconds. Now, I don't know about you, but my experience of the average user is that if a page hasn't started to load in 30 seconds, they're gone. But I think 1 to 2, while it might be a lot less than 30, is actually a long time. My suspicion is that they are gone too. These guys are probably not staying around when they see that extra delay. We think that delay has a cost consequence that is both the set-up phase with all that extra packet-hunting, and also the possibility that there is not good distribution of the Teredo relay. We are not 100% sure, but we think there may be congestion issues and availability of this service. So what we've decided is we will extend this experiment and include local Teredo relay similar to what we have done with 6 to 4, to try to minimize the effects and get a better handle.

If you are thinking Teredo is interesting, bear in mind that initial establishment cost is very, very high.

So unicast looks to be as fast as 4, auto tunnelling does have overheads but you have to understand the context. If we can get more relay servers globally, this is going to help a lot. There are a lot of devices that do v6. There is a fairly good chance that quite a few of you have one of these devices. If you take this out and open your Safari browser and connect to www.apnic.net or arin.net or ripe.net, you may be pleasantly surprised to see that you connect over a v6 address. They have included v6 in the last release, and we think there is a lot more v6-capable client out there than there used to be. Possibly as many as 50 million more v6 clients.

OK, so this interesting corner case. v6-v4 object but they will not go dual-stack. If anything tells you that there is v6 out there, you don't come at all.

That is a distressingly high number, somewhere around the 0.8% to 1% are persistently saying, "No, if you tell me 6, I'm not coming." If you notice, that curve is pretty much the same curve. So this is predominantly a problem of Windows machines. And we think that this may be something that is happening about misconfiguration in time-out behaviour, in application stack behaviour around this, but something is not looking good for some class of machines, around 1% doing dual-stack behaviour.

Could also be that the users are bombing out. We just don't know yet. We have to do more work on this.

Now, this measurement exercise is valuable to us. We think that there is information here that we are going to carry into the community about the rate of v6 uptake, but we desperately need more data. We need your help in this process. We need high-hit sites, heavy sites, we need sites in interesting locations in different economies. We would love to help collect this data. We have thought quite hard about the scripting. We have put a lot of energy into making this a low-impact technology. It is comparable to the Google analytics. If you run up FireBug or one of those web analysis tools, you would see our fetches happening on the network, but we assure you that has a very low impact on page view time, on render time. So this is not something that's going to be highly intrusive on your user base. But we would love to get you involved in a measurement. And this is where you go to see the stuff.

YOSHINOBU MATSUZAKI: Thanks George, any comments or questions? OK, sorry about that.

OWEN DELONG: Just so you know, every place we've got a 6to4 server, we have a Teredo server sitting next to it.

GEORGE MICHAELSON: That's very interesting info to get, Owen. We think that what you guys have done has certainly had a visible impact on reachability, but it is not yet showing up in terms of Teredo, and I suspect that may be the Asian footprint thing. We think you've only got three or four nodes in that region.

SHEENG WEI KU: To also look at the IPv6 since next year. We also find the same to why it is there, so much smaller than 6to4. We find it as part of most person use, when we start in the 7. When you see the open 6 to 4 Teredo IPv6. But if the hosts are there with the IPv6 address, the DNS client only send...

GEORGE MICHAELSON: So you think that there might be DNS impacts happening. So that may also be the nature of the request side of the DNS, as well as not sending. We're very interested in the work that you've been doing. I'm sorry that Geoff isn't here, but we would love to talk to you more, because we're interested in the convergence of these measurement issues. Perhaps we can all find some common ground here and think about resolutions, so thank you for doing the work.

ARTURO SERVIN: Arturo Servin from LACNIC. How useful are the log files, for example the web servers or apache. Are they useful to analyse?

GEORGE MICHAELSON: The qualification is that a native Apache logs to one second granularity. So if you're attempting to measure microsecond behaviour without modifying Apache, it is hard to do it, because the best you can do from clever mathematics is the half a second resolutions inside multiple events within one second, so that's my understanding of maths. You can usually fudge twice the resolution out of something with enough samples but the timings here are really tiny, so you need to modify your Apache to include microsecond time. So Apache have got a position that says in general, microsecond timing is not a good idea because there are so many variables on the web server to give you an additional microsecond of delay. So in the general case, that's true. But in this specific case where you know that the client is coming to you, you know that the client is coming to the DNS and you can do that combination of data, it is valuable and it does work. You've got to modify the Apache to get the microsecond. If you don't do that, you actually still get data so yes, you can get data out of this but it is not quite as useful. You boost it significantly if you do that. But if you kept the web logs, they are incredibly valuable, going back into time and back-projecting the figures. That's valuable.

ARTURO SERVIN: I ask the question because I have logs from 2006 with the behaviour of IPv6, so it might be interesting.

GEORGE MICHAELSON: I think that I could be persuaded to buy you beer for those logs!

ARTURO SERVIN: And also, we have started doing - well, not the same research, because we are much smaller, but we have that there and we want to work, get some information from it.

GEORGE MICHAELSON: I'd like to say that I've often been in postion of presenting research group material about v6, and the message is often fairly downbeat and we're saying it isn't happening. It is nice to be able to say that we're seeing something that doubles the amount of IPv6 that we see. I think it is a good message for us and something that we want to work on.

ARTURO SERVIN: Thank you.

YOSHINOBU MATSUZAKI: Thank you, George, and thank you. The next speaker is Shingo Kudo from Softbank. He is talking about IPv6 addressing for consideration.

SHINGO KUDO: Hi, everyone, my name is Shingo Kudo from Softbank Telecom. Today I would like to talk about considerations for IPv6 addressing plan. I have prepared three agendas.

Configuration infrastructure. Reverse DNS setting and addressing the reverse DNS Softbank IPv6. What is iACL and motivation. It is denying packets from outside to infrastructure devices. Example, deny any xxx.xxx.xxx/26 infrastructure address and border routers. And motivation to hide ISP devices and topologies from outside by not responding to Ping/traceroute. And to prevent high CPU utilization caused by packets to devices directly. The IPv4 situation is a simple example. This figure has eight devices and 14 point-to-point link. And to look at this with the community indication and four HCGs. And the slash 32 and the link address /30. For the infrastructure addresses. /26. iACL is to deny any xxx.xxx.xxx.xxx/26. When the network will be expanded, the iACL will be changed again. To deny any yyy.yyy.yyy.yyy/26. Over and over.

In IPv6, if we choose /64 for link addresses, same figure, the 8 loop back addresses (/128) and 14 link address for the /64. The infrastructure address with the slash 60. It is to deny any xxx:xxx:xxx::/60. So it is enough for future expansion in your network. If it is not enough, you should add another line and line to the iACL group again.

IPv6 addressing for infrastructure considering iACL. IPv6 implementation is now ongoing. If you are planning to use slash 64 for point to point link and /48 for customers, first allocated /32 may not be enough.

The IETF 6-man working group, draft the kohno IPv6 prefix lend point to point should be considered. If you will assign /127 for point to point link, /64 should be enough for the infrastructure space, even after considering future expansion. I can't read! Sorry, many, many, /127 /64.

And the reverse DNS. IPv6 reverse DNS delegation is allocated to 00 1: And the delegation to 0.8.0.1.008 /36. And going to 2001: Db 8: /40 there.

One side effect of reverse DNS in IPv6. This is in our experience. For example, you are planning to assign first /36 for your infrastructure for the customers. They get the reverse DNS for the first /36 to the NOC team and the other to the service provisioning team. The service provisioning team use all assigned space and want more address space from unused space in the first /36. But the NOC team has a different policy for reverse DNS. The service provisioning team can't set the reverse DNS.

Another side effect of reverse DNS in IPv6. In Japan, typical IPv4 reverse records for customer service are, example, 192.168.255.255 go to Softbank 192168255255.bbtec.net. But some xsp doesn't set IPv6 reverse records because the zone becomes very huge. Assigning /48 means... I can't read! It takes more than one minute to read it!

I think we don't need to set IPv6 reverse records for consumer service.

NEW SPEAKER: Exactly, correct.

SHINGO KUDO: What do you think? Softbank IPv6 deployment. Softbank set forth the plan for IPv6 for everybody. Yahoo broadband started to their 6 service from June this year. We are using 6rd technology. It is a border relay which is deployed and operated by BBIX. Currently using the FTTH user. And the specification is now the proposed standard. 6rd is the most reasonable solution, it's now the host standard.

Addressing and reverse DNS... sorry, we assigned /24 as a prefix for 6rd. And IPv4, subnet ID. We can't set reverse DNS hosts for there. Because... OK, /24 is very huge, I can't read!

And addressing in Softbank IPv6 deployment infrastructure. The backbone is divided to about 50 regions. Over 300 devices and 1,000 links in the biggest area. We should consider IGP aggregation. We used the area ID division and the area ID in the backbone IPv6 address range. Area ID here.

In summary, we should consider various factors in IPv6 addressing design. One example is addressing for infrastructure devices. Without considering future network expansion and minimizing address space for each link. iACL should be changed frequently. Reverse DNS delegation and IGP aggregation are other examples of such factors. So, make a solid addressing plan to deploy.

Thank you, any questions or comments.

YOSHINOBU MATSUZAKI: So, any questions?

OWEN DELONG: Hurricane Electric. ARIN Advisory Council. Doesn't it potentially, rather than using /127s for rather than point to points and ending up with all these odd-size prefixes, doesn't it make more sense to start issuing say /28s to ISPs instead of /32s? If the ISP is large enough to justify that?

SHINGU KUDO: Apologies about the discussion. Sorry.

YOSHINOBU MATSUZAKI: Any other comments or questions. OK, thank you.

YOSHINOBU MATSUZAKI: The next speaker is Mike Jager, who's presentation is titled "Monkeying Around on the APE".

MIKE JAGER: Sorry about that.

Good afternoon.

I'll try to keep this as short, because I know that we are running a bit late. This is the results of a how I spent a weekend poking around some routers at my local exchange point back home in New Zealand.

Maybe!

And in the middle of last year, we had a new port I plugged my laptop in, to make sure that it was working, as it should be. I got four v6 RAs on my laptop. It was saying, "Hey, I know how to reach the entire v6 Internet, send your packets to me." I thought that if that happens on an IXP, which I thought it shouldn't, everything should be configured using a routing protocol on an RA like that, what else happens?

So I'm Mike Jager. I'm a senior network engineer at Web Drive in New Zealand. We're one of the largest hosting providers in New Zealand, so we have a fair share of the market. We do the content side of things. A lot of you guys do access networks and stuff. If anyone knows how to solve the v4 exhaustion thing with a CGN, but in reverse, please let me know!

So the general idea of this is pretty simple. When you are connecting to a shared network, you need to take steps to ensure that you are protecting your network from malicious activity. So it is nothing new. These problems have existed since day one of IXPs, but at least on the exchange point that I tested on, people seem to ignore them. So a quick IXP recap. It is a shared layer 2 network. The IXP operator will assign IP addresses to the members. The members stand up BGP either between themselves or between route servers at the IXP operates. So they exchange routes and packets flow.

So there's holy wars around whether pairing is worth it. I understand in the US it is often cheaper to buy transit than it is in operational time to peer. In New Zealand transit is a lot more expensive, it's pretty much a given that everyone will peer. Keeps local traffic local, reduces the amount of transit that you need, so you save money. You might get more bandwidth between peers across IXPs than across transit links, reduced latency. All that good stuff. It basically means that you get easier access to the networks of other members of the IXP. But it also means that those other members of the IXP have easier access to you. A lot of the other members are probably your competitors. So how much access to other parts of your network do you give to your competitors?

Not very much at all, right? You want to make sure that they can't see what you're doing, they cannot get commercially sensitive information out of it, or whatever.

So, because transit is typically delivered over a point-to-point link, there is a router each end, and it is delivered across metro Ethernet or some other sort of back wall. It makes it harder to send a nasty packet into someone's network. You need to convince every router in the path to forward that packet for you. And because every router has its own routing table, it may make a routing decision not in your favour. It might route the packet to a different router than you want to it go into.

But on an IXP, there is no routers between you and another - and a target network. It is all Ethernet. If you send an Ethernet frame into the IXP with a destination MAC address of whoever you are trying to attack, the Ethernet fabric forwards it. You don't have to convince a router to make a decision to forward it into you. It's quite cheap to do, it is a few hundred dollars a month, if that, for an IXP port. For a low cost, you have a direct layer to access to all your competitors' peering routers. When I saw these RAs, I thought, Hmmm, well, I wonder what else is going on? I bought a I put a packet bumper on APE.

So any nasty packets or misconfigurations that your router is sending, I will be able to see them. So because it is a switched Ethernet, the only thing that I should see is unicast traffic destined for me. So I had sessions opened with the route servers. I should see non-unicast traffic required for operation of the exchange and enable solicitation of v6, and maybe multicast.

What did I see? I saw a lot of ARP - a huge amount of ARP. And a huge amount of ARP for addresses which were not in the APE /24. Who knows what this is? I cannot tell what is being stood up and what is that neighbour has discovered. Perhaps it is people peering across the APE fabric using address space which is not within the address APE /24. Perhaps people are selling transit across the APE using address space for the BGP sessions which is not in the APE /24. I saw DHCP requests, IPv6 RAs, multicast discovery floating around, DECNET MOP, which is interesting. That is enabled via default in various IOS versions. If that is enabled on a box by default, it means that somebody hasn't turned off. What else is also enabled on that box by default that has not been turned off?

I have OSPF crossed out. The one box that was speaking OSPF stopped about five hours after I first presented this in NZNOG, so someone was paying attention. Maybe I can send packets into another network and use their networks in ways that maybe they don't want me to be doing.

So normally, when a router routes, it picks a path based on the best path in the forwarding table. It will have the next top IP address. It will resolve that IP address to a MAC address, put that MAC address in the Ethernet header. What happens if someone, like me, ignores the routing table and just puts MAC addresses in Ethernet frames and sends those out into the IXP. Pretend the routing process doesn't exist.

So there's a few examples of this. And how you could make use of this. So the first is you have the IXP. IXP, three members, IXP, this is me over here. Normally, what would happen is ASX would announce prefixes to me over BGP and I will send packets back to this way. Let's say ASY won't pair with me. There is nothing stopping me finding the ASY's MAC address and putting frames across the IXP fabric and go straight into ASY. This is quite relevant for New Zealand. Six or seven years ago, one of the two big telcos decided that they were not going to peer any more. They didn't let anyone know. Peering sessions got turned off, even with people, depending on who you ask, had a peering arrangement with them. Some other members on the exchange went well, hang on, this is rubbish.

You cannot just turn it off. They started doing this. They learned prefixes from, say, ASY, via transit, and next hopped them into ASY directly so then they don't have to use their transit path for that.

Extending - this one is not particularly useful where I can send a frame destined to ASY into ASX. So, I mean, why would I do this? I could send it straight into ASY. It shows that I'm using an intermediary network to reach the destination that I actually want to get to.

But now, taking that example, if I extend that and say, "Well, maybe ASY is not the IXP that I'm at. Maybe they're at another one, or a private peering link between X and Y, so I cannot use that, unless I get packets straight into ASX and across to Y."

So, you know, extending further, instead of reaching one AS, I can reach a lot of other ASes through ASX. Maybe they are at an IXP which I don't want to build out to, can't afford to build out to, don't send quite enough traffic to justify building out to, but I would rather avoid using transit if possible.

I just send a bunch of packets for a bunch of destination ASes and ASX forwards it on.

At this point, ASM could be running transit-free from an egress point of view. If it set a default route pointing at ASX, all of ASM's outbound packets will go into and follow ASX's best path out into the Internet. This is basically - it is transit theft. ASM has no commercial relationship with ASX, but ASX is carrying ASM's packets.

So, well, what if we try this?

How many networks will move a packet outside of the IXP, or outside of the IXP members' networks with no money changing hands? Some people said, "Just try it." Others went, "You cannot do that, that is wrong." My answer to that is that you cannot break into my house and steal my TV either, however, to stop people trying to steal my TV, I'll still lock the door when I go out."So

This ARP scans the A /24. I think it has cut the side off.

So it has found 78 hosts on the APE. Now, it is sending a packet into each of those routers, destined for an international destination. Because this is a New Zealand-based IXP, it is extremely unlikely that this international destination will be within any of the peers' networks.

So of my 78 hosts that I've found, I've successfully sent a packet via 46 of them. A packet has gone into their network, across the Internet, across the transit links, out to my destination, which is in the US.

So depending on the volume of transit that you buy in New Zealand, it can cost - it depends on who you ask. But for the volumes that most APE participants will be buying it in, it will be between

$100 and $500 US dollars per meg per month. That could be quite costly. If I start using megabits and megabits of outbound bandwidth and you don't know it is happening, and then you dimension your transit circuits, you are effectively paying for my egress transit. It is not just small networks that do this.

I went through and checked which routers were which were carrying the frames. I put the MAC addresses up there - but I'm not out there to name and shame, but anyone who's on the APE can find that. There are some Australian ISPs on the APE, some of them are here, and some of them carry packets across their transit links for free.

There's regionalized ISPSs, national ISPs, telcos, metro Ethernet providers, content providers in New Zealand, which is a bigger issue for the content providers, which typically constrained in the outbound direction for their transit anyway.

International transit providers that are connected to the APE are letting me use their transit for free. Anycast DNS nodes, some of them are there. Some banks in New Zealand, one in particular lets you use their transit.

Yeah! It is good, hey!

I don't condone stealing transit, but it is a good project.

So that is in the outbound direction. But for a lot of people at an IXP, they need to get packets back into their network. It is all very well for an eyeball ISP to be able to send packets out to the Internet, but they may be weighted 10 to 1 in the inbound direction. It may not be that useful.

How do you get the packets back in? It is complicated. You cannot just put an Ethernet destination MAC address out into the wire frame address. You have to convince people to send packets back into you via a certain path. It means that whatever you are trying to pull data from, at the far end, has a route back to your network, which is via the IXP. You need to somehow manipulate routing tables between you and the source of the packets that you are trying to receive, to take the right path.

So how to do this?

There's two sort of methods to approach this, depending on the destination on your side of the packets. So if the destination - I've got the slides.

If you have initiated an outbound connection and the source address of that connection is from your IXP address, so your address on the IXP fabric, it requires that the far-end has a route back to the IXP fabric.

If - so that is one option. So this is why you would generally not advertise the IXP space anywhere - definitely not across an AS boundary. If you announce it upstream and they do, and they do, it will end up in the DFZ. You are providing free transit to the IXP. But nobody would do this, right?

This is from routeviews as of January, I think.

So this is the number of routes in the DFZ to 154.0 which is the 8/24. It's seen by routeviews It is not 0 for some of that graph.

So, unfortunately, between the thinking - this would be a good piece of research to do - it did drop to zero.

So a bit of a bummer. But oh well.

The one at the end is just because there is a peer who is at the APE, who announces their routing tables, route views for data collection. So that is not transit.

This is also - this is a good one.

This is from the APE - one of the APE route servers. Someone was announcing the APE /24 into the APE route servers, which is - yeah, always a bit of a laugh. They got cleared up. The route servers no longer accept announcements for the IXP prefix.

So don't announce IXP prefix outside of your network, because, yeah, people will probably abuse it.

So if you don't use the IXP prefix source address, it means that you are sourcing packets from your own address space. It means the far-end must have a path back to your network that goes across the IXP. So you have to try to - you have to try to get someone to do that. You need to get another provider between you and the destination to announce prefixes to the Internet, but deliver the packets to you over the IXP.

Remember that provider that was speaking OSPF on the APE? What happened if I speak OSPF with them and maybe they redistribute that into BGP and send that upstream? You probably shouldn't do that, but if they are speaking OSPF on an IXP in the first place, there is a fair chance they are doing something else wrong. Otherwise you play games with your upstream's routing tables.

So normally, you will announce your /16s, your upstream, they announce it up to the Internet and packets will come through the path as you would expect.

Now, if ASM is also at an IXP talking to the route servers and ASU is at the IXP listening to the route servers, ASU will hear that /16 via two paths; one will be via the transit link to the customer and one via the IXP.

So the ASU may local pref the routes they receive from the route servers in order to send as much traffic as possible across the IXP. Forgetting that they have customers that are also at that IXP. Back home, generally, it is not uncommon. People don't want to send stuff across transit links if they can avoid it, certainly for smaller players, it is not a cheap exercise to have transit.

So this may mean that ASU prefers the path via the IXP rather than the transit path.

And so if ASU does say rate-limiting based on the port that ASM plugs into ASU with, you have now bypassed that rate limiting. Because the packets are not heading out that port any more, they are heading out across IXP. ASU says, "We will try to avoid this. Let's increase the local pref of customer routes above that of the IXP." If I'm ASM, I want to use as little transit as possible. Let's say I break my /16 into two /17s and send the two /17s to the route servers and the /16 across to the ISP. So even though ASU may have local preferred the prefix received across the transit link above the IXP, the longest prefix wins - the same case exists as before.

So if you go - if you combine inbound and outbound transit, or connectivity, and you put them together, and you make use of an AS, which is at two IXPs, can we establish bidirectional communication across that AS between two parts of your own network?

So I'm from New Zealand. There is about 4.3 million people and 43 million sheep! We are very creative with our two largest island-naming. We have seven IXPs companies, all operated by a Wellington-based company called City Link. The two largest are the Auckland Peering Exchange and the Wellington Internet Exchange. Many of the larger ISPs are present at both of these exchange points. I wanted to have this research updated by now to determine the feasibility of this. Unfortunately, I wasn't able to obtain the resources I needed in time. I hope to get it completed in the next few months. If anyone is interested, I will be happy to share it with you.

In theory, if you were to buy an APE and a WIX port, it is in the order of 100 US dollars a month for a gigabit port. If you were to buy an Ethernet service between Auckland and Wellington, I don't buy this, but asking a few people around, it seems like it is roughly about

$800 US per month for ten megabits. Even at 10 megabits, it is almost an order of magnitude cheaper to steal bandwidth off someone else's network than to extend your own. If you want 100 meg between Auckland and Wellington, it is better. So I'm interested to see how feasible this is.

Other things to think about; there is a strong /weak host model thing. Not necessarily used for stealing transit. But it may disclose information to an attacker about your network that you would rather they didn't know. If you have a router with two interfaces - one on the exchange and one on the inside of your network somewhere - if someone sends an ARP request for the inside address across the exchange point, then the router may respond with the external MAC address of that router. So, OK, well, big deal - so what? You may say.

What if 10.2.3.4 is the management address of your router? Now, the attacker knows, "Oh, anything numbered out of something that looks similar to that, is likely to be a managing address of some box in the network." It narrows the scope that they will be looking at to target.

Linux is particularly bad for this. If you have Linux boxes acting as routers, which I know plenty of people do, make sure that you are not leaking information out this way, so proxy ARP is obviously very similar to this, but it is just extending the scope of it. Don't just say - disable proxy ARP, disable proxy ARP and use the host model.

What if I really want to cause trouble? Let's say I'm trying to have a go at someone. I could respond to DHCP peer requests. Nameservers. So if I send a reply and I am the peer server, I'm the default gateway, let's see what happens. I could send more v6 RAs. Or I could answer router solicitations, sort of a similar thing to DHCP, but for v6.

Some of that ARP traffic was not being answered. So if I start answering that traffic, - sorry, answering those requests, maybe that is a router with a BGP session conconfigured where the remote end has gone away for whatever reason. If I become that remote end, will you speak BGP with me? If you speak BGP with me, will you filter routes that I'm sending you? I suspect not.

If I really want to cause trouble, and end up in jail? Then I could be more aggressive. I could spoof ARP, so "Hi, I'll be your new route servers." Another way of getting prefixes into network, plenty of ISPs don't filter what they receive from the route servers. The route servers - I'm not sure on this any more, certainly, they didn't use to be able to have MD 5 when they were BGP. I hear it may not be the case any more. With the OSPF speaker, if I send a more specific prefix into them for a well-known Nameserver, what could I do to them?

Are they redistributing OSPF into BGP? Another way to get routes or prefixes announced out to the Internet? And I could reset people's BGP sessions.

This is all doom and gloom, apparently.

That first point is wrong. There is no routing work shop.

But go up and get - there is plenty of workshop slides that have been done for previous events. Go and read them. They do talk about how to configure your network and according to best practice. If you are an at IXP and you have not read the AMS-IX guide, read it. It's chock-full of really good information

It is down to make sure that you are only accepting a packet that you want to be accepting. This means you probably only want packets coming into your network at the IXP that are destined for your network and your customers. Even more simplified, you want packets coming into you for destinations which you are announcing to the route servers to your exchange point peers, things like that.

If you sell transit via the IXP, you cannot do this. Because people will be sending packets to arbitrary destinations. Don't sell transit via an IXP if you can help it. It is asking for trouble. And again don't advertise IXP prefix outside of your network.

So you need to stop those packets that you don't want coming in from coming in.

So, you know, there's a million ways of doing this. I'm not trying to tell you how to engineer your networks. But there is an easy way to do it, a complicated way to do it, depending how much time you have. Some examples. Either stop the packets coming in, right on the edge, or ensure that if they do get in, they cannot go anywhere useful. The packet comes in and gets dropped somewhere within that router because it has no route. If it is a peering-only route, for example, you probably don't want that packet coming in the first place, but they cannot steal your transit by doing it, so if you are a real small network, you may have only a couple of routers, it probably means that your customers are not coming and going very often.

You have a lot more things statically configured and you only definitely don't have a dedicated peering router. Your routers probably carry a default route, which means that any packet coming in, unless there's filters, will go out to the Internet. You need to apply filters on your IXP interface so that only packets with destinations within your network can come in. You can configure those statically if you are not - especially if you don't have customers with their own address space coming and going. You pretty much set and forget. You only ever have to change it if you have another address being advertised from your AS. If you have a dedicated peering router, then you probably are a little bit bigger and you may want or may not be able to change a static access list, or packet filter every time a customer comes or goes.

I don't really buy that argument, because you have a lot of other work that you need to do to provision or deprovision the customer. Make it part of the process. But maybe people don't want to do it. Ensure the peer router caries only your prefixes and prefixes learned from the IXP. Nothing else. Make sure that it does haven't a default route, which could be statically configured. Make sure that there is no default route in that box at all. If you are a larger network with complex routeing policies, you can make use of multiple routing tables. If you have customers and peering exchanges on the same box, make sure that the routing tables are partitioned off so that you cannot jump into the IXP and then out to transit or whatever.

So make sure that the same thing is a dedicated peering router but in a routing table rather than within a separate box.

If we look back at the example of ASM getting packets from the Internet through ASU through the peering exchange, this is basically one of the things that you need to think about. You don't want packets coming in to - - sorry, you don't want to reach a customer via an IXP if they have a customer. If you have a dedicated transit lank to them, you don't want to send packets to them across the IXP. You want to use the transit link. If you lose the transit link to the customer, and the customer is peering at the exchange point, you probably still want to reach the customer even if only for peering traffic as opposed to transit. But you probably still don't want to provide - I mean, you may want to, you may want to provide transit over the IXP as a back-up path, but I recommend not doing that. If you have VRFs set up, you can ensure that the - that from your network you can reach the customer via the IXP for packets from your network, or you can reach the customer via the transit link for packets coming in from the Internet.

So this means that even - now we are receiving the same prefix via both transit and the IXP. The only route that will be announced out to the Internet will be the one that comes from the transit link. If the transit link goes away, the ISU will receive a /16 via the IXP. But it won't - the route to the Internet will be withdrawn, because it won't be in the right VRF.

So, again, in summary - hopefully it is clear now, IXPs are not just a case of plug and play. It is a special case. It is a shared network and it is not like any of the other links - well, unlikely to be like any of the other links within your ISP. Attaching to the Internet is a risky business anyway. There are talks today about people scamming and looking for vulnerable boxes out on the Internet. But you need to be careful of those. But there are different ways to abuse a network, and an IXP, than just across the Internet.

Thanks for watching.

YOSHINOBU MATSUZAKI: Any questions? Comments?

People want to go to lunch. So, OK, thanks, Mike.

Thanks to the speakers. Today we have talks, - lightning talks, if you want to present something, please tell us, or express your opinion or interest.

Thank you.

^ Top < Home