Plenary Transcript

Chaired By:: Valerie Aurora, Eric Andrei Băleanu
Session:: Plenary
Date:: Tuesday, 21 October
Time:: 14:00 ‐ 15:30 (UTC +0300)
Room:: Main Room
Meetecho chat:: View Chat

RIPE 91

2 pm

Main Hall

Plenary.

ERIC ANDREI BALEANU: Hello. Hello welcome back from lunch, so, we'll start this session with Antonio from Cisco with a BGP clock an BGP observe tree so the floor is yours.

Thank you very much. OK. Thank you everyone for coming. I know it's a great slot right after lunch, you are all very excited to learn about BGP. So we are going to talk about these two things that we are using to hang stuck routes but first of all we have to understand what are these BGP stuck routes like a very quick recap for everyone not familiar with the concept. Basically software and hardware has bugs, and therefore routers also contain these bugs. And one particular case of them causes these update messages between them to be lost so every time you are informed that your prefix isn't available or the path is no longer available, you need to keep to learn about that so you can keep your routing table in order.

Sometimes this just doesn't happen and the update is not processed, we are going to focus mostly on what happens if you lose withdrawal prefix. Because ASN will not be informed that a path is no longer available, if a cable was cut, you are not going to learn about that and you are going to still send traffic to this destination.

Or if a link is congested and someone tried to do some kind of a draining and more traffic from other links, you are going to keep sending everything to this congested link.

And this is going to cause packet loss and of course if you have downstream customers on BGP, they are not going to be informed about any of that either so they are also going to send the traffic to the wrong destination, instead of using the BGP best path selection algorithm and finding a path around this failure.

This happens, according to our research, hundreds of times a day on the internet, possibly more and we have some new standards that are trying to reduce the impact of that, this is a very recent RFC that was published, the send hold timer that will hopefully make some of that impact go away, however it needs to be supported by vendors, software BGP stacks tend to have it and now we need hardware routers to support this as well so if you have a support contract and you can talk to your vendors, it would be interesting to ask about supporting for this RFC, it would make this problem better for everyone as more and more ASes start to deploy this, these routes will eventually reduce and one day maybe even go away.

So the problem is that we don't know why this happens necessarily because it's not something that was designed, it's a bug. We don't know which routings are affected, which models with software versions, we have many questions off end and this is something we started to do last year which was the start of this programme and figure out what are the potential causes and understand it so we can know how to face this problem.

The RFC one step of the way but eventually there are more steps that will get us there to having this routes disappearing from the internet.

So what did we do last year. We were in Prague where we toll you about the ref search that we have done, it was presented at RIPE 89 and we told you some findings that we had. And what we discovered like ASes or impacted implementations and so on. So and what we did back then was every 15 minutes we advertised an IPv6 prefix and then we used sources like RIPE RIS where we looked that after this prefix is withdrawn, like one or two hours later, after it shouldn't exist anywhere on the internet, is other ASes that still see this and they have a valid route source destination and basically we discovered that months after we stopped this, there were ASes that still had this route stuck into them and we even converted the space to RPKI invalid just to see if anyone will remove it and the routes were still stuck, I think a record was nine months with routes being in a few ASes, so that was not reassuring about what's to follow.

Now because we listened to you and we listened to your feedback, you had some suggestions last year around, the first one is what happens to IPv4, some people still care about that. So we wanted to have a look into that as well. Now the previous study was IPv6 only, IPv4 is expensive, can he could not get enough space committed at the time and at least as much as the IPv6 space so this study was IPv6 only. The other thing is that we had beacon prefix going on and some people asked us can you keep advertising this because we only did it for a few weeks back then, can you keep doing this so we can check our networks, they wants to see if the prefix disappeared in Looking Glass or internal tool link, we had academy academics and researchers, we have resources, we can commit some students or researchers we want to look into that, maybe publish something so it would be helpful if you can keep doing this, the other people from the industry, from the community basically who wanted to understand the problem better and who beacon would help them collect the data because we are advertising from a single location and all of that is stored in route views in RIPE advice, whoever collects BGP data, they have all of the paths and all of the information to study this phenomenon, maybe go start asking operators hey what kind of equipment are you using and so on. But we didn't have the capacity to do.

And the answer to all of that is yes we can improve of all of your feedback from last year and the first step is the BGP clock because now you can learn the time from BGP as we encoded into these prefixes with a simple commander in your routers, you don't need NDP any more, we are slowly replacing everything with BGP here.

So, what is this thing well basically there is the IPv6 prefix, it has improved. It has a field, it's going out every ten minutes and it contained the index of this ten minutes period within the year so January 1st at midnight UTC, prefix with zero zero zero zero goes out and then ten minutes later this prefix is withdrawn and prefix with index one goes out, we start at zero of course but this time we also have IPv4, you calculate, you take the hour in UTV, modular eight, and you can see which is the IPv4 prefix that corresponds to that, this allows us to look further into the IPv4 space that we missed in our previous study.

So if you want to learn more because information changes, we have documentation that's readily available right here if you run the who is command, the I net and D net six objects contain everything you need to know and our documentation is version controlled as well, if you go to FTP dot RIPE dot net you can find the older verses there, we document exactly what happened, which communities we are adding to allow everybody to be able to study these routes in more detail and tell us what they find, we are looking for what people and their results that will come out of this.

So the ‑‑ on macroSlike if you ask the am IANA who is server an its legacy IPv4, it doesn't work so you have to add the dash R to query the RIPE server, just a minor detail.

So, what's the process like. These prefix originated from from my personal AS, 4601, just they were last year, something important to note here is that this is something that I am offering personally, this is not something that Cisco is doing, using my personal AS and personal IPv4 and IPv6 and my own connectivity and router and servers and I plan to offer this service to the community as long as it's valuable so everybody is originated from that, eventually I want to move it to a dedicated AS, whoever wants to operate a clock of their own, they can maybe reuse the IPv4 list that would be more helpful.

IPv6 is recycled yearly so we can study the routes, whether they are stuck even after periods of as long as a year, mine Ms ten minutes. And we can see about cases that are very difficult to detect in other bit Cons that basically recycle every four hours. IPv4 is updated every eight hours currently, we found it was a good trade off, you can study what happens after the four hour mark which is usually where it starts to stabilise, whoever has it four hours later is likely to keep it for weeks or maybe even months in some cases but it also allows us to recycle this prefixes frequently enough because what we have discovered is that in some software, there is a certain probability of the route getting stuck, let's say it's 1%. And the more frequently we do this, basically we roll the dice more often and this leads to this phenomenon being more visible than if we did it every day or some longer period of time.

And the prefixes are announced at the top of the hour and they are withdrawn ten minutes later. And then the next hour, you have the next prefix and so on.

I have thought of a way to extend the use of IPv4 space like if you have, you can advertise everything as eight /24s and then go to /23s and go to /22s, but eventually what we need a thing is more IPv4 space, if someone is willing to offer us some temporarily, it would be very useful to get more data and insight into this.

In any case, how do you check all of that, like I operate this service, how can you see if you are affected. You need to run whatever command your router has, I have here BIRD, I also have whoever uses Rotonda for monitoring, they can look into that as well. IosX R, I think this is the current command nor this month and Junos, you can figure the command you are using basically on your equipment and you search for the IPv6 prefix and the IPv4 prefix, should see up to two routes, if you see three, five, ten, a hundred, that means off the problem, all the routes you see you should investigate and see which equipment it is and if you can reach out an let us know about that, it would be great so we can aggregate the data in a single place.

And if you can reach out officially, that's best but if you have any more information that you want that I am happy to learn about that outside at the breaks.

So you got the picture, right? For Rotonda. Great. Moving on.

OK. So what we were thinking were that there are some existing collection networks like RIPE RIS, there is route views, we have around data as well and between them there are probably thousands like maybe one, one and a half thousand unique BGP peers so we can get this data and study it and maybe find a way to identify which ASes have problems. And this is when we created something like the stuck route observe tree, this is something that Cisco has created, this is the Cisco part for a brief period of the presentation.

And in this tool basically we get the data from all of this public sources and we try to figure out what is the, which ASes see this route after it was withdrawn. This tool is available for free to anyone so you don't have to log in, you don't need a Cisco account, it's public anonymous website where you can search by AS and the idea is that if your AS is found anywhere in the path between the RIPE RIS or the origin, then notify you based on the three possible outcomes, the first outcome is you don't appear in any path, like we look for an AS here, it doesn't show up on anything, which means that everything is clean, this AS is not affected as far as we know.

The other possible thing is that you have this prefix visible from your AS and we think heuristicically that the route is probably stuck there. There is no way to tell unless you have access to every router, so we are trying to do some heuristic al algorithms that can pinpoint where the route may be stuck.

In this case, we will let you know that it was stuck in this AS, that you have searched and if it's yours, there are some information like the prefixes that we detected are stuck and some AS paths so you can begin to trouble shooting.

The other case, the other possible case is that you don't, you probably don't have with the same heuristics, you probably don't have a route stuck, however one of your upstreams likely sent you a route that's no longer there and if they did it for our synthetic routes that we are using for detection, maybe they are doing it for real routes and we are giving you which ASes we think, oh, it's still the old slide here, oh, there is the right one. So in any case, yes, we tell you which ASes we think sent you a route and it's stuck there and we also give you the prefixes and the AS paths so you can go and trouble shoot it with your upstream provider or with your partners.

You can log in to we give you some pointers that we hope will be enough.

Now while creating all of this tooling, we faced some challenges. And I mostly want to talk about this because the tool like it will be out soon, you can go and use it but that's not really the point of a technical presentation. So I wanted to tell you some of the challenges that we have faced while trying to build an automated way to detect these things and report on them.

The first thing is visibility, that we basically don't have. So if you look at this diagram here, there is in the blue there at the centre, you can see where the prefixes are being advertised on. So I had to do something very difficult like imagine I am the centre of the internet or the universe and then I send data to everyone else, we can only see these paths that end up with a RIPE RIS collector Tore or a route views collector, every other path we cannot see, we can only see paths that start from me and go to collectors. It turns out that out of all the ASes that exist on the internet, around one and a half thousand ASes are visible so there are only so many between a collector and my AS right now.

Which means that we can only again make conclusions about the specific ASes, unfortunately that's close to 1% of the total ASes on the internet which is not enough. Ideally we would like to have this network grow larger and larger and we would like to have more visible but that's what we have right now, we have to make do. The good thing is that this 1% is likely the important, the more important 1% if you can even say that, because it's all of the tier one providers, it's a lot of large networks, that eventually have a lot of downstreams or stub networks at the edge.

And at the edge of the graph, last year this, they had specific AS, 2,000 downstream customers, only 20 of them have collectors and we see all of these 20 collectors have stuck routes, we can reasonably say the other 1,980 also have the stuck routes but we can never be sure without having this data collected from them. Eventually we can make educated guesses IPv4 more than half of the ASes but this is something that you have to basically take all of the customer cones and try to figure out the probability of a route being stuck there as well, maybe they have multiple upstreams and so on.

The other problem other than visibility is invisibility because there are some ASes that do not appear in the AS path, when we collect the BGP feeds, we just not can see them, for example in IXP route server here, these ASes do not appear by design, for example any IXP would probably not be visible and if it gets stuck in the route server, we will end blaming almost every member while in fact it is the IXP fault, I guess there are some other providers that do it, even if they are not necessarily IXPs but if a route is stuck, we have incorrect attribution and this is not very easy to fix with our existing visibility into the ecosystem.

Thankfully most IXPs they use up to date software so like a lot of them use BIRD and BGP D which is implementation that is do not seem to be affected in any of their modern versions for the past few years by this phenomenon and anecdotely, there is one IXP that uses appliance route servers and in this particular IXP, we think there is a route stuck there but of course there's a very difficult, it's very difficult to prove that it is there but we see almost every member having this route stuck in there.

The other problem is that we face, the final challenge, let's say, was that BGP, so you see one AS but this AS may have you know ten routers or 10,000 routers, you only see one number. You see there, you know, 4601 and it could be a thousand routers behind it. So and it could be using route reflectors or whatever else they may want to use. And it's so happens that only one of these routers may a problem, let's say every router is up to date but one particular one is not. That means that this router will probable gate the stuck routes and the rest of the AS depending on how it's configured may be fine.

And it's very difficult to identify, to narrow down the router or something more than the AS itself and we are trying to look at what communities they advertise and have a meaning for the community like OK, this is North America or this is, you know, Frankfurt but it's really difficult communities don't propagate across the internet very as well and there's no easy and automated way to identify which particular equipment is behind a session, if we are looking from far away from the RIPE collectors and so on.

We found cases where for example there is a tier one that has a stuck route and only 5% of their customers have this route stuck, only 5% of the customers that we can see.

That means that usually it's like these are networks in a single neighbourhood that have this problem and every other part of this tier one network is perfectly fine and what do you do in this case, do you attribute this route, they need more data to be able to trouble shoot it and move further.

And of course there are some internal implementation details, if you use iBGP, they may split, they may update them a few at a time, if you get something in the route reflector that's stuck, a few routers may have it and they may have different hardware and software versions and vendors so you have a lot of red herrings in this research and this is something that proves to be very difficult to solve and it's not fully involved in the final tool either, we try to give the data and let the operators trouble shoot it, we made this trade off and we hope it's something that will allow you to eliminate this problem more easily.

And finally, yes, we spoke with some operators, we told them there are stuck routes and they even have trouble identifying which it was within their network just because you see a prefix, you don't know who sent you this prefix, who actually got it stuck in the first place and there were some tools like BMP that could help you basically figure out which router received it first, they can create a timeline and an audit trail of everything that happened but BMP support in modern routers is not completely there will I would say, there are things that for example stability issues or feature support and also BMP monitoring solutions, especially with historical data like going back through time, do not exist in the wild so far, we don't have too many if any of them right now.

So these were the challenges that I wanted to mention that we faced while we were studying this phenomenon and I yield my time, thank you very much.

(APPLAUSE.)

ERIC ANDREI BALEANU: OK, we have questions?

AUDIENCE SPEAKER: Tom Strickx, Cloudflare, this is really interesting, thank you, I was wondering from a community perspective, there might be a way that we can think through this, like we do for AS 112, where we have a bunch of community operators running the BG R side of things, if we could do it for the BGP clock I think, a bit more co‑ordinate is required, because obviously if everybody is running a clock at different intervals, you don't get the clock but maybe that's worth having a conversation about.

ANTONIUS CHARITON: That's a great point and I wanted to move to dedicate AS and be able for other people to use the prefix so issue one ROA and then everybody can use this AS and just have to change the IRR basically. It would be ideal if it's standard and they don't commit you know hundreds of thousands of dollars of IPv4 into that and it would be great if more people run it, I have around like 2,000 BMP right now, similar networks that have more and if they are willing to do run this, it would be great to have more reachability because we detect more things as well as it doesn't depend on me. So I am very happy to have this conversation and I would like this to be a community project, I will reach out to you after this talk. Thank you very much.

AUDIENCE SPEAKER: Awesome. Thank you.

ERIC ANDREI BALEANU: Any other questions? No, thank you very much.

(APPLAUSE.)

Next we have Branimar.

BRANIMAR RAJTAR: Thank you for the introduction. So thank you all for coming. Just OK, so I am going to be talking about how to get as much as possible from a single X 86 server, first of all. I have been working in a telco deploying like Cisco Juniper hardware in the first period of my career and then I started basically to work on a side project building slides BNG in my spare time and putting it on Github basically and for the record, don't use OpenBRAS as your name, so it might be a bit misleading when you type it into Google.

Actually it led to forming, founding five by nine networks, I am currently the CEO and the company grew, less and less of the coding and just getting smart people to actually to do most of the work I'd say.

I am also like a chair of the Croatian network operator group where we have some beers and if you are lucky we have some presentations an meet ups as well.

So first of all, what is a BNG, so it's part of a, it's actually network function which a lot of people know that actually exists so it's short for broadband network gateway, it terminates fixed line subscribers basically, if you have like a modem or CPE at home, it brings up PPPOE or IPOE session, which then in turn L 3 connectivity to the internet so contrary to popular view, once you connect the modem at home and you don't want to directly go to the internet, but rather you pass through a BNG that enforces your bandwidth speed access lists, does triple A, let's and basically from the BNG you go to the internet and vice versa when the traffic goes back to your home, it has to also traverse the BNG which counts the packets and also does some like basic security, etc, so in a nutshell that's a BNG, every operator has it, not a lot of people actually are aware that it exists and it's required.

Just for context. Our product, I am not going to make this a marketing slide but I want to understand the context why we are, why we need performance gains from a BNG so our system is virtualised system, it can run on a virtual machine or container and it has cups, two levelings of hierarchy, we have the dashboard like the over arranging management isn't, we have the controller which does routing and triple A, allocation and authorisation and we have forwarder which actually does the forwarding of the packets and the idea behind the performance gains is basically if you have a lot of bandwidth and you need five servers for this, you need to charge your consumers five servers, but if you can make optimisation gains for example maybe instead of five servers, the customer needs to buy three or two servers an basically the less of the price of this hardware, the lesser the CD Oand our solution therefore is more competitive for the end customer for the end ISP so basically this topic is interesting from an engineering standpoint but also this is a business driven decision actually to go into full performance optimisation on the server.

You have server, you run KVM or open stack and you have on top of the platform, you have a VM which is running the components of the BNG and in this case we'll focus just on the forwarding path. How do we do, first of all, who is the set up of the, how do we get the packet inside the virtual machine, first of all we have something called a single root input output virtualisation, it takes the interface card and takes a physical ports and can slice it into multiple virtual folders let's say and each virtual folder can be then mapped to the different virtual machine by passing the host OS kernel, basically this is how you map the virtual forwarding instance of the physical Nick to the virtual machine, when the packet comes in to the server, it goes immediately into the virtual machine by pressing a whole host operate to go system, everything inside and most, it's not supported by all hardware but nowadays, it's become actually like a standard feature of most modern NICs, it used to be rare but nowadays it's more or less supported and really widely used.

Once you get the packet inside the VM, you need also to bypass the operating system on the guest OS so use intel DPDK for that. Data plane Development kit is an OpenSource project which maps the interface on the virtual machine to the user space programme so it's basically something in, it's a C library, a library within the C programming language, it asks the network device constantly are there any more packets for me, are there any more packets and this causes 100% of the CPU load on the device because it's pull mode so the CPU also asks for the NIC, are there any more packets and it brings it into the user plain programme, it will never, in this case the packet will never also see the guest kernel, will never see the IP tables and actually once you deploy into intel DPDK, it becomes invisible to the guest OS and now you have the packet IP side your C programme and now you can perform some actual optimisations.

So we start actually by building the small VM concept. OK, the forwarding plane of the BNG, let's give a small number of CPUs like four CPUs and someone who needs more performance, let's just scale horizontally which was OK because like you have more optimal resource utilisation, you have unique configuration and it's really a small number of resources to get started.

But if you have a big deployment, that can lead to a large number of instances.

And that point in time, this looked OK sacrifice for us because we had some really cool features like zero configuration, not a lot of automation, we said OK we can have a large number of instances but they will be automatically configured or managed so who cares, even though like all the other vendors went with the concept of a really big VM, consuming as much resources as it can, we went on a totally opposite direction.

Initial performance on the Intel Xeon Generation 2, it's like I would say six, seven years old, we had 16 forwarders on a machine and we got roughly without high quality of service, roughly 40 million packets her second, that's roughly 160 gigabits per second for five hundred byte packet size, we are measuring everything in the five hundred byte packet sizes and we got quality of service we went down to roughly 26 million packets per second which is a big difference if you are using shaping and queueing and stuff like that.

The Intel reference document claimed to get 100 million packets her second for CPU on newer hardware, a simpler application but we said yeah, it's still a significant difference in performance, we need to do something about it.

So because like I said, the more performance we get, the less hardware need, the more price wise our solution is competitive.

So first of all, we moved to newer hardware, optimise the plans for newer hardware and initially we got 30% performance gain, great. Then we did some in‑depth DPDK tuning so we need to understand the bias, how the PC Ieworked, what are the CPU options, some tweaks on the network cards an also it brought us like 30% performance gain.

This is actually really time consuming so every time we changed something, we need to retest everything and it takes a lot of time so you see in the ends like every change is a lot of resources invested and basically there is no end to this, you could optimise it for the next 20 years probably.

And yeah, we needed a lot of time and to verify the performance of each change and also it doesn't mess anything which is currently working OK.

We also did a deep dive into the code, so we used Intel V tune application which basically puts break points into the code itself. And it runs really in‑depth code analysis on the CPU internally handles it. The problem with this is that the monitoring the exact performance of the code impacts the code itself. Impacts the performance because you are using additional CPU cycles, just monitoring itself, with we saw a significant CPU wait in some parts of the code, we needed to rewrite code to avoid those waits, we needed to rewrite loops to use multiple execution units because and we needed to rewrite to get better branch prediction and we also tested using different compiler optimisations, it's also something you stored for yourself, each time a new version of V line comes out, you need to verify checks and new compilation options, how they work and basically yeah, it's ongoing process.

And in the end this part gave us roughly I would say 20% performance gain.

So it gets, you know, step by step, we get there.

In the end what we saw from all this experimenting is the CPU cache was the main problem in our cause, there was a lot of CPU cache misses and cache rates and we needed to big deep to finds what is, how can we optimise the usage of the CPU cache so the CPU actually has limited cache so it's like I would say modern CPUs, like say from 60 meg gates to several hundred meg bytes on the upper end of the most expensive CPUs and basically once it doesn't fit in that space, it will push it to the RAM and if it doesn't find locally, it will get it from the RAM to the cache and this is where we lost the majority of the performance so we need to rewrite our code also to accommodate for the difference in this writing and reading in the operations. So you neede to rewrite the data in order to accommodate the desk size an when the CPU reads memory, it reads in a cache line which is usually 64 bytes, we needed to fit all or most of the data objects into that single cache line so in order not toll waste the multiple CPU cycles to get multiple cache lines so this is something, we needed to really work hard on to rewrite everything and we spent a lot of time rewriting everything and checking how everything works.

Yeah, so we operated the read and write data sections and read and write threads and separated what was frequently used made as small as can be possible and some data structures like counters etc which were rarely used and pushed to a bigger structure which is not often pulled into the CPU cache. We need to identify additional PC I bottlenecks, we grouped the transactions which is really a low level and we introduced a more advanced hashing algorithm, so it's maybe not that important once we have a default gateway but for example if you have MP less and a whole BGP routing table, it's really amounts to big, to big gains. So we used this technique where we reduced the look upcycle to one CPU cycle and used the in memory routing table by 90%. So just to put it into perspective, so we can fit the full internet routing table roughly I would say at the time we were testing it was 800,000 prefixes, we fit it into 4.7 megabytes which is really I would say really good and we were able to get roughly 77 million look‑ups per second so this is with different source irises and going to random destination addresses so this was a really big change from the previous routing algorithms we had.

And while we were doing this, we actually saw that the small VM concept is inefficient so basically if you have like multiple BNGs instances on the same machine, there will be a lot of information and there will be multiple, stored multiple times in the cache so routing tables, interface tables, etc, routing table is a good example Tore this because if you have like 16 VMs on the same physical machine doing the exact same thing, the routing table will be the same on all 16 VMs unnecessarily so it will like use the amount 16 times more memory than it's supposed to.

In the end, yeah, we saw why the other vendors were doing the big VM, so we threw away the small VM concept to utilise the whole CPU. Server and added a QS mechanism and route look up improvements and you see the bold it's number at the end, 1.6 per bits per second from a single server and this is on this, it's a real powerful CPU, it's like the latest Intel Xeon generation, this is only one VM can handle two hundred million packets per second so it's roughly 800 gigabits per second, but when scaling to two CPUs and the same server you get to 1.6 Tbps per second.

And we have the capacity test the whole server because it's a lot of ports (We didn't have) you can imagine the single server with two CPUs is basically a two separate servers fused into one box, so we can proximate the capacity of the whole server on one CPU and I also want to point out we did QS improvements, so with quality of service and without quality of service, now we have some logic, we turn on the quality of service when the customer actually needed it because during our quality of service leads to roughly 30% lesser, less throughput on the server but if you, but only like 1% of customers actually use a QOS on a given time, the performance on the same server because a couple of users are using quality of service.

So basically in the end, 1.6 Tbps per second is really on a good server but don't think it's the server like us costs 50,000 euro it's approximately if you go in and store and buy it, it's like I would say 15 to 17,000 euro so it's not really that expensive server when you compare it to what kind of performance you get from it.

Lessons learned: So basically always use the latest hardware, it sounds obvious but really each generation the servers, it's much better than the previous one. 20 to 30% better performance in each new generation of CPUs and yeah, we are actually in the beginning, we were sceptical and when we started it's like eight years ago these kind of speeds were not really manageable to so in the beginning like Intel had 100 million in their best case and now we are up to 1.6 Tbps per second, it's really a big change and we evolved from eight years to now. Where servers have evolved, there's a big difference. There's more cache, they have higher frequency, they have lower power usage which is also really important for CPUs, it's really good and it's not just CPUs, it's not listed here but I just saw a couple of days ago, there's now an 800 gig NIC for the server, basically you can use, you can put two time eight hundred gig cards into a server, it's really powerful and really you can see will the benefit, you know of using a server comparing with hardware based solution.

What we did also, we embraced the change, like so basically we moved from the small VM concept to what's called big VM concept but it's not big VM in the sense it's not flexibility, we made it so we can accommodate any number of CPUs you give to it, you can say that we merged the small and big VM concept into our single let's say concept, single type of VM, we take any hardware the customer gives us because usually customers maybe they have some servers that are not used, this weigh tonight use for BNG, they give us what they have and we put the forwarding instance on top of it and regardless how many CPUs of RAM etc that they have, our VM can accommodate that and use it in most efficiently.

What's also important you need to use the available tools like OpenSource, closed source, AI, I like using AI but it helped us sometimes when we were stuck and not to give us the correct answer but maybe point us in the right direction or give you a new idea, give you inspiration how to do... and you need to spend time investigating tools, what can help you in your work and like I said, continuous optimisation is a must, this is a never ending story, we will not wake up one day and say can OK, this is it for performance, we won't go any further. You can continuously work and you can continuously improve it, there's no ending in sight. That's optimistic, what else can we get out of a single server. Yes, you need to be curious, have a learning mindset and what gets measured gets improved and yeah. That's my lot for my side. Thank you.

(APPLAUSE.)

ERIC ANDREI BALEANU: Questions?

AUDIENCE SPEAKER: Hello, I have a question, two questions first the performance measures are between two different cards or one card sent to the same line because usually it isn't much faster just for one packet back to the same interface than for running between the two different cards.

BRANIMAR RAJTAR: We use inIX I can't for this. Speak the server has two interfaces or just one?

BRANIMAR RAJTAR: Two actually ‑‑ to actually use eight interfaces in a single, we managed to fit eight on a single... zone so.

AUDIENCE SPEAKER: The second question you mentioned that you scaled to two sockets, like to two CPUs but I generated the ‑‑ I was connected to one. What is the difference of the vendor processing on remote CPU.

BRANIMAR RAJTAR: Basically measure the whole server, if you managed a single socket, it's basically a single server and two sockets are basically two servers just in the same chassis and they just proximate, if there's one socket, you multiply by two and you get the performance for the whole server, unfortunately we don't have the performance for the whole server to test altogether.

AUDIENCE SPEAKER: OK.

BRANIMAR RAJTAR: It's not perfect but.

AUDIENCE SPEAKER: If you have two chassis, then you have to have two networks.

BRANIMAR RAJTAR: In production you will have full capacity on both sockets. NIC is on one socket and NIC on the other, for performance testing we are only using one socket, the other socket is empty doing nothing.

AUDIENCE SPEAKER: OK. Thank you.

ERIC ANDREI BALEANU: Next question.

AUDIENCE SPEAKER: I have got a couple of questions about your L‑Co architecture. Are you using within the same network card or one L‑Cores for all of the queues, how do you spread it, what is your architecture like and since you already mentioned that you are trying to move all network cards in the same number zone, what does your em balance of architecture, do you have one per queue and one her L‑Co who do you do.

BRANIMAR RAJTAR: We use different L‑Cores for RX and for DX and since the internet traffic subscriber fixed line traffic is asymmetric, we use more he will cores for downstream, for downloads than for upstream, it's usually like it can be perfect, it's usually like four to one, five to one, depending on the core count, it's sprayed for so we have a pool of cores for upstream, pool for downstream and we have also a pool of threads for QS so it moves around a bit but we found initially I think I mentioned in the slide, everything was in the same trace doing everything but then from testing we saw that it was better to split everything up, we were using M‑Buffs and splitting everything up.

VALERIE AURORA: We have a question from online.

"Do you have a packet per second figure for the machine and the end it's about IP packets forwarding? Do you have a packet per second figure for the machine?"

BRANIMAR RAJTAR: Yeah, I think I have it here somewhere. 200 million packets per second, so that's per socket, per CPU so....

VALERIE AURORA: I have a question myself. I noticed that you are very up front about significantly changing your design. Do you have any tips about doing the organisational aspect of that? I know many people here have had the experience of committing to a design early on and having a hard time getting management to change it so any thoughts?

BRANIMAR RAJTAR: We are a small company, so it's not difficult for us so we are actually too small to be specialised in anything, so actually our teams are cross‑functional let's say so we adapt as we go. Basically like recently we found the programme language go is built for something and OK, let's write that. So basically once we, we are too small to be addicted to have a legacy or because like two years ago this is the best way, we don't have to do it. So I think it's different in a big organisation, so as part of a big organisation and I don't know, like if they change, after a few years somebody has to admit they were wrong in a bigger organisation, that's the biggest problem in the big organisations that they have but here we saw what fits better or we knew we needed to get more performance and if it didn't work in the past, let's try to change it and try to get a better performance.

VALERIE AURORA: Yes, be small and agile. Thank you.

ERIC ANDREI BALEANU: Thank you.

(APPLAUSE.)

So next we have Mike Joseph from Meta with PON in the data centre, hyper scale for management and console.

MIKE JOSEPH: Thank you. I am Mike Joseph, many people call me MJ and I do work for Meta, I also ‑‑ this is my first time presenting here at RIPE, it's my first RIPE meeting, I am honoured to be here presenting to you, I have been fairly active in America events, some of you may know me from there. Today I am going to talk about PON and the data centre which is kind of an interesting concept for a lot of people.

First, I am going to talk about who I am and what we do. So I mentioned I am MJ, I run the infrastructure network engineering team at Meta, we are responsible for the provisioning, management, out of band, disaster recovery and VPN remote access systems at Meta, we run the facilities network that connect the infrastructure for buildings.

Basically we are a little bit unique in Meta in that we are the only product networking team that touches 100% of racks in pops and data centres so all the other networks at Meta might use faster links, we are the ones that touch everything. But we do have our own backbone, we have our own infrastructure and we don't ride Meta optical transport, in fact we do everything ourselves in order to maintain that independence and out of band.

So first let's talk about how most people operate networks for management and recovery. This is actually very similar to what I think many of you probably do, with end of row east net switches and terminal services, Meta does the same thing, historically we have put in one more or east net switches and in Meta's case we put them on goal posts and these are 19 inch double wide racks elevated above the data centre floor but aside from the interesting right and shape of them, they are pretty similar to what you use in your data centres, these are great for pretty basic deployments but they have some limitations, one of the challenges with them is they are fixed. Now he could put more switches in and choose not to elevate them and put them but at the end of the day you have to so make some fixed decisions how big you will make the deployment and what ratio it's going to look like. When it you get really big, you are carrying a lot of copper from that end to each rack and you have to have a copper trace to do that. On top of which it makes it hard to change out the racks as your needs change. One thing that happens at Meta relatively frequently, we put new racks in or we have new rack designs coming into the data centres and those racks may have differing needs. Today most of them use one Ethernet and one serial port but we have racks that use 12, 15, 20, we have some racks going to use 80 or more copper drops per rack. Because of this we need the availability to deliver service flexibly. When you put more console servers and Ethernet switches in, those are additional devices to manage, you have to manage them at the end, you have to upgrade them and manage software and you also have to cable them, right, if we do want to change the ratio that we deliver per row, per rack, now you have to potentially order addition devices to be installed, and you have to go somebody remove it later if you pull it out. We decided to do something wild, install PON, in a data centre. Now for most of you you probably know it's traditional residential access technology, usually service residents, sometimes SMBs and sometimes cell towers. To the best of our knowledge, we are the first ones who did in a data centre in a production way, for us it solves a number of specific problems. One of the biggest things we get is flexibility.

We can support, we need to support increasing number of connections per rack position. And with our current PON deployment on our current generation hardware, we can deliver 40 drops her rack position which we generally split to 20 Ethernet and to console, we can vary the split to get up to a 32 Ethernet, we are actively in the process of developing additional products to support higher density racks, I mentioned those 80 copper drops her rack. Advancing AI racks or high density network racks, we are working on that and working on some next generation PON technology devices I will showcase later. We also increase scale, we have end of the road devices, if you put in two pairs of ethernet and two pairs of console, you might get maybe a hundred ports per row. In our current design, we can almost get 2,000 ports per row and that's with the current generation of technology, we can use a single router to handle 12 rows at a time and because of the nature of PON I will talk about in a moment, that one device can effectively manage all of the ON Us in that 12 rows together. It also improves our workflow. One of the great things about the way PON works is that it was designed for rapid deployment and residential services.

ISPs don't want to have technicians in your home mucking about with your residential gateway, from the ground up, it's been designed to provision as it comes up and most of the systems are designed to support PON support this deployment model. At a Meta data centre when a rack comes in, it ends upcoming pre‑payment off a track, it's wrapped, scanned by the data centre technicians and then delivered to its position either by a humour by a robot, once it's put in position, it goes to a process called set level and energise where it's dropped into place, it's raised to a particular height, it's powered on and cooling containment is placed around the rack. This process is actually relatively fast and our goal is to introduce no more significant additional time in deploying PON so we have gotten our PON deployment down to five minutes per rack. All they have to do when they go to the rack is put a canopy on top I will show new a moment. Plug in two fibres, be dropped to every rack position, pre‑stage when data centre is built and then scan the PON equipment with their bar code reader that automatically enables PON for that rack, marks on our system that that rack has those particular ON Us that position and triggers a push to that one pair of LT aggregation devices that allows the ON Us to come up.

And so this rapid deployment really fits well with the way Meta brings on racks in general. We also get some efficiency gains, because we deploy in the small ONU footprints, we are able to deploy until four port increments, most are one on one, we can support that easily, we can support large racks, support those 20 plus 20 rack configurations just by deploying additional ONUs, you get cost efficiency here and also we get power savings. One of the reasons we chose, the upon generally speaking off the shelf A 6, you can get a PON four internet max easily, using a four port increment, we get the four ports pretty cost effectively without having to work with additional chips to increase the capacity, but we have some ONUs that have that for higher density deployments. I mentioned earlier the hassle of deploying lots of copper in your data centres. In order to avoid that, by deploying fiber directly to the rack through PON, we don't have to carry copper at all. We were the last network at Meta to use copper in the data halls, we don't and it frees up an entire tier, you typically have tiers of table trace to cable trays, we free up an entire tier, they are bigger and heavier than fiber trays, we have to free up a lot more capacity in the cable tray space by eliminating copper entirely.

The other thing that's really nice is we design all of our PON solutions to have a veryWELL standardised hand off. We tightly specify the position and we design this in a particular way such that we can have both forward and backward compatibility, so not only do we have support for all the racks that Meta is making now, we can make a rack in three or four years from now knowing it will work not only in data centres that better building then but building now as long as it adheres to the well designed specification of the two fibres that we dropped today and because again the hand off is so well defined it gives us that flexibility to be able to iterate the rack design independent of the data centre design.

Now, another thing that's really helpful is because PON uses a whole suite of standards that are well defined, we can it rate new product designs quickly as well. For example we can take on new individual models within our existing vendors, we can add in, if the vendor changes PON chip set, we can handle it easily, because we don't actually configure the ONUs and I will get to how it it's managed at the moment, the configuration layer and the operations are all centralised and managed through existing PON standards. We still have to do the business I can N PI forfeit and function but we don't do the systems onboarding functionality and integration that we would normally do when we onboard a new platform. It helps with the supply chain diversification efforts. One surprising benefit of PON is it improves staffing answers. For example, we can hire people coming out of carriers, we can hire people coming out of cable operators who have experience in the PON space, they may be a very good fit for my team. We enable cross training within Meta with interest in adding industry skills to their resume and they are able to train up in those areas and get that experience.

Now we do have a lot of efficiencies, we don't need a huge team but we still have the ability to bring people in when we need to and by the way, even though it is a small team, Meta is hiring, so come see our booth.

Now let's talk a little bit about PON fundamentals. Many of you are probably familiar with PON. For whose who want to refresher, I will give you a refresher on how it works. PON is an access technology for ISPs, it's a shared medium. At one end you have OLT, that's the head end device that controls the PON, it controls all aspects of the PON in terms of admitting new ONs and pushing software to them as well as realtime control by setting the timing for the PON and issuing grants for upstream data transmission. It's managed end to end by the OLT. There are two protocols, one is the PLOAM, the low level protocol and the OMCI which are a higher protocol, to programme the MIB of the ONU. Here's an example of a PON topology. On the left you have the OLT and on the right you have the ONUs. And as you can see it's a tree topology with splitters, most commonly the splitters will be even. So they are splitting say one to our or one to two and then splitter does basically what it sound like, it takes byte in on one end and splits it out evenly at the other, you do take a lot, it takes it in both directions.

And then at the other end you have the ONUs themselves and feignly union eports, in the residential deployment, these ONUs would be typically on or in your house, in our case they are on the rack. There's a couple of different PON knowledge in the industry there's G‑PON and G‑PON and there's E‑PON standards, they are competing they are similar in a lot of ways and many devices will inter operate with both, not at the same time. We use XGS‑PON on the right which give us ten gigabyte symmetry tricks up and down and one thing to note about PON, this is by directional, so single fiber coming out of the OLT on this diagram here basically this is one single fiber, one single fiber out of here and one into here, one strand of fiber is used in both directions.

It's not a huge benefit in Meta's case, it would be a big deal for us to put but in residential where you can have much greater fiber plans, it helps reduce costs significantly. XGS‑PON is an iteration and XG‑PON to XG‑PON to NG‑PON and finally to XGS‑PON. There's more beyond this, there's chart and 25 G‑PON etc but fundamentally, some of these use different wave lengths like XGS‑PON and XG‑PON, when moving from one technology to the other but like most IT, you are running at a fixed clock and just the bit rate varies, this is what we use here and it allows us to operate 10 gig symmetry tricks.

Now, let's talk for a minutes about redundancy, ITU specifies a number of different techniques for redundancy, the most common is type B and this is also specified by ITU, you have one OLT active and you have another one that's the warm stand by watching to take over in case the other one dies, in single home type B both of the PON ports are on the same device, in dual home, it's on different devices, it's kind of interesting the way it works they don't actually have to communicate to be transmitted because no ONU can transmit without a significant grant from the OLT, a permission to transmit at the next interval this guy can detect if the other one went down because it won't see northbound light.

In this one you will notice the PON went dark, so even though he doesn't see the output of it, he can tell the ONU is obviously went away and so he will take over. Now it's not perfect, if there's a fiber here, you won't notice that, in most cases this tends to work well. Meta enhanced this. We took this type B dual home redundancy and decided to invent our own thing and created a type F which is enhancement that also adds an ability for the active OLT to detect the loss of subset of ONUs, let's say this fiber got cut here, if we detect that, this OLT will it's a certain threshold of ONUs and it will stop transmitting and within a time, this guy will take over.

And so that's a way to speed up recovery of cases where you have a partial break in the ODN. That works in particular places because of the way we design our ODN.

So, let's talk about our design. We worked with our initial vendor to bring these visions to reality. This is not a vendor talk, this is a Meta talk, I am talking about Meta technology and solution here, I am featuring a lot of designs from the vendor, you will see a number of products featured here, I am not selling anything but they are all available in the product catalogue, very few are customed to us and where they are, we are working to get them to be added to the vendor or other vendor's catalogues so they become available, we participate in the OCP ecosystem and so where possible, we are trying to push our designs into that and there's some cases I will talk about later that we have done that.

This is the OLT aggregation device, it's basically a layer 2, 3 router, it's got 36 ports, the OLT we use from this vendor, their OLT is based on a modular pluggable. It looks like an SFP plus, it's a lot more than an optic. Here's one here. It's actually a PON OLT on a stick. So on one end it has basically a three port switch. On the other end it speaks PON, a different framing and then Meta in the middle, it has a host interface as well for management of the device itself. So we can cram 36 of these in one of these 1 R U devices and through that manage about five hundred racks from a single device, we only have to configure this one device to make it work. We also created this cover concept, this canopy chassis we put on top of the racks, this is an open rack V3 standard rack, we modified the open rack and pushing this back into OCPso there are new mounting features on OCPor R V3 version racks to handle a canopy chassis being installed, if anyone wants to acquire canopy chassis, you can put them on OR V3 racks, here's an example of PON chassis in the lab, this is a lab so obviously there's lots of extra cabling and lots of extra fiber, this would never look this messy in a real data centre but in our case this is how we deploy in one of our lab environments, here's a close up view. One interesting thing to note, this chassis is entirely passive, it just distributes power. Each of these cards, the PON, the one with the actual fiber ports interface and go with the back plane just for power, these are console cards, they don't interface with the back plane at all, we use these clever USB clips, we had a budget of these cables made that are used to connect the cards and that's how the card is both powered and connected for data.

And then at the end here we have a splitter I will talk about in a minute. Here you can see this is the input, the two A and B feeds and then here we have eight outputs that go to these cards here.

Here's a close up view of those ONUs, the cards, this is the ONU, you can see it's much longer, it has the fiber for the PON, here we have the four R S232 ports and uplink and these are managed exclusively over PLOAM and OM CI, we do arrange to ping it but you can't SSH to this thing, again full management through its upstream device.

And here's an example of some splitters, here's a close up of the two splitters we put into the canopy chassis, I mentioned we have a 16 way splitter in our goal post, here's an example of the cassette we loaded into our goal post splitters and here's a ‑‑ this is the type of splitter we use when we have to do head end splitting, this is high density, it has three 16 way splitters in it.

I mentioned 16 way splits, in our goal posts, we peel one one of them off, the 16th we feel off locally to use as a test point for technicians.

Now, in Meta's case, we run two different times of PON networks, we run a normal availability network which is uses that type F that I talked about before, two OLTs, you have two 16 way splitters and then you merge into a two or two to four splitter in the rack and go to a single ONU and consoles for each device, this can handle four devices and this is what every rack in Meta data centres get, however this does have some single points of failure here and here and here and here. So because of that we have some devices that lead a higher level of redundancy, this device, this kind of device is probably something like a switcher router and it probably has an inband management port but Meta uses some devices that don't have inband management. In order to handle those, we developed a high availability solution and in this case we still use a pair of OLTs just like with the NA network but with the HA network, these OLTs are actually completely independent, they don't talk to each other, they don't ‑‑ they are not redundant for each other and in a given rack, because of this two sets of ONUs the client device has its own connection opposite the OLTs, it still uses the same chassis, it's a passive chassis, it's less of a concern and this device has two way management.

As I mentioned, we care a lot about redundancy. Now in the last number of years a lot of vendorses have been moving in the direction of off box controllers. This might work well in residential access where the priority is sort of customer activations, billing, etc. For our solution, we need our PON to be more survival. In order to arrange that, we worked with the vendor to implement on box control, it's actually redundant with its payer so we have a lot of autonomous workflows when a rack comes in, it automatically cause the configure to update and that process is pretty neat and as a result we actually use a NETCONF and GNMI, we have to maintain really large stats, if you are interested interesting in learning how to do this, my colleague did a talk about this that's available on YouTube.

Finally, where are we now, PON is the plan of record for all of our data centres going forward, pretty much every new rack and data centre we build is going to be PON enabled and we have a number of them up and running now, over four hundred active already which may not sound like a lot but we are just getting started, our very first data centres are coming online, our initial role, rack roles for new data centres tends to be a little bit slow at the beginning and ramp you were faster when the data centres gets built but it's still an important milestone for us, as I mentioned every new Meta data centre being built under construction is PON enabled from the get go.

Finally where we are going in the future, we are working on developing new rack products with the current and new vendor, we are working on developing things like a higher density ONUs and chassises, everything I showed was open rack standards, on top of rack. We are also working on one or two ONU designs, we need to develop an RU device, we operate a facilities network in our data centres, air conditioners, transfer switches, etc, we want to bring PON to that, it's good fit, and finally we want to bring PON to pops, outside of data centres where we would be an able to use RU devices in a more commodity footprint not just our own facilities. That's smaller deployment but still unified for our vision. That said, thank you for the opportunity and I am happy to take any questions.

(APPLAUSE.)

ERIC ANDREI BALEANU: Any questions.

AUDIENCE SPEAKER: I was wondering the RS23 RNU, is that a standard thing your vendor is offering or did you develop that or did they develop that based on your requirements?

MIKE JOSEPH: Probably a bit of a grey area between who developed what with our end vendor. We collaborated with the new vendor to bring a new product line to market and we were also the launch customer for the product line, so everything you see here is available publicly, Meta drove the requirements for most of them. This device here is a daughter part for the ONU itself, so you can purchase the ONU without the daughter card if you don't need RS232. But all of this was based on our specifications but these are not publically available.

AUDIENCE SPEAKER: Thank you. The second question in terms of cost, I get the deprecating copper locks causes benefits on the data centre level, for the activity costs like the hardware and OLTs and ONUs, how does that work like?

MIKE JOSEPH: Our cost per post went down. It turns out Ethernet switches are not that expensive but internal servers are really expensive because of the market dynamics there, this device saves us a great deal of money but we can target and put in one of these pairs for a rack that's small, we can put in five of these for a rack that's large. We can size the deployment exactly for the rack, we have to still run the head end, that device but that 1 R U device, a pair of those are over 12 rows, we get quite a bit of cost efficiency there.

JEN LINKOVA: Thank you very much. Very cool, I will take a question, I guess people in the room might be very impressed and excited so if someone would like to do something like that, what was the biggest challenge, what will be your advice for people what not to do?

MIKE JOSEPH: Probably the biggest challenge was selling people internally, and now we have done that and have given a talk on it, it may be easier for you to do. What not to do: One of the challenges, I think the biggest challenge is at Meta because everything is a moving target, right, we we were up against a very tight timeline, we had all these new racks coming with high density requirement that could not be met with the previous design, so because we basically said we are all in on PON, we had to absolutely land all aspects of the new design in times for a particular data centre build which has a construction schedule that's governed by people with pack could hes and walls and a particular set of racks so we can't be late and the biggest challenge is making sure we don't delay production. You may have seen Zuck's comments and new build is a very high priority for Meta.

ERIC ANDREI BALEANU: Any other questions? No, thank you very much. (APPLAUSE.)

Now it's time for coffee break. We'll be back in half an hour.

(COFFEE BREAK)