DNS Transcript - RIPE 91

Chaired By:: Moritz Müller, Yevheniya Nosyk, Willem Toorop
Session:: DNS
Date:: Thursday, 23 October
Time:: 11:00 ‐ 12:30 (UTC +0300)
Room:: Side Room
Meetecho chat:: View Chat

11am

DNS WORKING GROUP

RIPE 91

23 OCTOBER 2025

WILLEM TOOROP: Welcome everybody to the DNS working group session. So the DNS working group is led my three chairs, Moritz and Yevheniya and by me for the last time this session because we have the next time in Edinburgh. We will have a new co‑chair and I am very happy to announce that the new co‑chair will be Ulrich Wisser.

(APPLAUSE.).

I'd like to point out that three weeks ago, there was the DNS week in Stockholm with lots and lots of DNS activity. It started with a centre meeting on internet governance, then there was the excellent DNS Hackathon. There were community sessions organised by the DNS Org and the DNS Org itself and the first meeting and party of Netnod. One of the side meetings or community sessions that was held was the initiative for the DNS community to start best current practices at the DNS Org, currently that group is in the process of creating a charter. So if you are interested in that, please have a look or search for Org 45 and all the events will be listed there, and also there's a lightning talk about the best current practices initiative of the DNS Org.

Right. We have a very interesting programme today. We have our new co‑chair, Ulrich, present on the TLD resilience in the light of signature validation constraints and then Dimitro Kohmanyuk, XoT practice with the Ukraine CC TLD and Shane Kerr talking about TTL up limits in practice, and then the traditional updates of RIPE NCC DNS by Anand Buddhdev.

But first of all, we will start with the announcement from Jim Reid about best common practice on hyperlocal root. Also don't forget to rate all the talks. Jim?

JIM REID: OK. This is just a very, very short presentation because the real work, I assume you can hear me now? Thank you. And awake? You think I'm awake or not awake? ! Anyway.

This is a very short presentation because the real buck of the activities is going to take place at the IETF meeting and there should be discussion in the agenda in the meeting in Montreal in a couple of weeks' time.

So I hope everybody in this room knows what hyperlocal root is. The essential idea is that local resolving servers will have a copy of the root zone so they can answer queries rather than touching the root servers themselves to start the whole resolution chain off. This has been around for a while, the initial RFC was done about ten years ago an it was updated about five years ago but those who were only given informational status at the IETF and uptake has been a little bit disappointing. And the attitude is this is only an informational RFC , we can't be bothered, we don't want to know but things move on a bit from that now because I think first of all, is that we can't continue relying on the current root server system infrastructure as it is, although the root server operators do a wonderful job and the rock solid and reliable. We need more defence and depth to this whole system, and part of that chain is going to be having more copies of the root zone available to those what need it.

This will also solve, I think, a lot of geopolitical problems because as you may imagine, there are lots of places in the world where they say we want to have a root server and of course they can't really get one but if anyone has got a copy of the root zone on the local resolving structure, they have got a root server for all intents to purposes. The first thing is what sort of infrastructure would we need to roll it out, we can't rely on zone transfers for the existing root server operators, maybe we want to look at other mechanisms is such as as sink or putting aavailable or the web somewhere through some cloud provider, we need additional checks on root zone contents, we can use zone MD for that obviously. Maybe it's going to be extra works for the IANA people, they might tend to be agreement with those who are voiding some kind of distribution service for the root zone. And of course we have to do some education and outreach and that's part of the the reason why I am here today.

With that, I will just going to ask for questions and any comments. I don't want to get into discussion of why we should be doing hyperlocal roots and all the rest of it, that's a conversation that can take place at DNS OP. If any has any questions of the motivation of the authors, why we are doing this, now is your chance to come to the mic and throw it at me.

WILLEM TOOROP: We have two more minutes for questions.

JIM REID: Nothing? OK. Thank you.

(APPLAUSE.)

WILLEM TOOROP: The next I would like to invite Ulrich which isser to come to the stage and present on the TLD resilience in the light of signature validation constraints

ULRICH WISSER: Hello, thank you for your support and let's see how we can do this what we did is we looked at all the TLDs and a little bit of the timing configuration that you need to have for your DNS zones, and while also we did this for TLDs because they are easy to observe and limited list. I actually could do it from my home server, but this timing considers are important for all those that you run with DNS SEC, even for your own zones, I hope that you can take something of this home.

So this is if you look at our usual DNS server the configuration is you do some XFRs and every time it stops, then you basically SOA expire time starts. So that's your secondaries will serve the zone as long as the SOA expire time lasts, it's quite easy, it's a definition of it basically. Nothing much to discuss. What it means is that if you have some kind of disaster happening to your infrastructure, basically your hidden primary somewhere, then what you have is there's a limited amount of time that you have to fix your disaster and get up and running again.

I guess we can all agree on that. It's good. Yes, so first thing I did was looked at what is SOA expire times that people use. Well, there was one TLD that used, I think, two months or something that used two hour zone expire time which I think is very challenging for your disaster recovery but they fixed that. But I would say that even seven days is a very short time, depending on the disaster, if you think of it and the timing of the disaster and well in this part of the world, I would say Christmas eve would be a really bad time to have only seven days to fix things and other parts of the world, they have their own holidays that make timing difficult.

But you see most are 4 days or shorter but there is some exceptions to this, 70 days seems also very popular and there's one with 7,000 days, I asked but I did not get an answer why they do this. Well. Your guess is as good as mine. Yes.

Good, so there's some variation in this and good. So but obviously for the DNS to work and especially for DNS assigned zones, even the SS SIG, the signature in your zone need to be valid otherwise disaster. How does this work? We have the signature validity and we have the inception time, you have the expiration time and in between is one of these. But you have refresh and then you have a little bit of jitter at the end and the time in between is the time that you have left for your disaster recovery. So if you think of how the refresh works, you have let's say you sign for, in this case, for 12 days and every four days you renew your signatures and for one record it would look like this, but for all the records in your zone, it would look something like this and then you see there's one record that has basically should be refreshed but is not because something happened to infrastructure and then you have left these remaining bill time.

So, and that's a problem because ‑‑ why is it a problem? Well, that remaining time needs to be longer than your SOA expire time, otherwise you would be serving expired signatures. Well, I am not the first one to recognise this problem; there's an RFC about it, how could there not be. And the RFC actually says as you can see here, when a SOA expires, it should be a third or a quarter of your signature lifetime. Or you could have put it the other way, if you have a SOA expire time, you look at then your RRSIG should be three or four times of that, OK. Really long signature lifetimes.

And so then I looked at well what are people doing, how is the world configured and how many TLDs have actually life times on their DNS SEC, their DNS key record and so if you look at the DNS key records of all the TLDs and then I look how many of these have a lifetime that is longer than the SOA expire time and so the, there is two data sets in here, one is the CCtld's and one is the gTLDs and you can see that the ccTLDs there, approximately half of them is shorter than the SOA expire and half of them is longer than the SOA expire and then they have the gTLDs and you will see there's a very hop, the graph moves a lot and that is basically that there is a number of gTLDs that they sign and then they are longer than the SOA expire and as time goes before they refresh, they drop blow SOA expire and then it goes a few days and the sign refreshes signatures and they hop to and because this is a large number of gTLDs that do this, you see this strange graph. But as you can see, you can see some trend lines in here and I would say 300 is approximately a fourthth of the gTLDs that that are constantly too short.

So yes, then we have some interesting real world examples, looked at some TLDs here and you see this one made a configuration change, first they had a 30‑day SOA expire, but the hour, the signatures were three or four days which is not very long. And then they realised that something needs to be done, they made a change, they lowered the SOA expire to seven days and now you can see that the signatures are longer, just in the end of the validity period before they refresh, they drop actually below the SOA expire time.

Yes, so this is a TLD that actually does line KSK, when they start with the new signatures, they are older than the 30‑day SOA expire, but at some point depending on how they ‑‑ their routines and what holidays are in the week, holidays, they sometimes drop down to three‑day remaining lifetime, which means if something would happen, they would have three days to fix any problem which is, you know, I find scary if you ask me.

This one is an interesting one. You see at the beginning there was some ‑‑ I don't know what they were thinking, but they got too ‑‑ really close to expiring the whole TLD and then they changed something in the infrastructure and now what I would guess happens is that they have two different signers running at the same time and at different points they switch the distribution from one signer to the other.

And so you have the one signer with the very long life times and the short signer with short life times and depending on Whois just in charge just today, you get different signature life times.

Yeah, this is .se and they have always been above the SOA expire time and then they made a change to their zone generation and now they are doing refresh much more often than they used to. So.

Yes, so I have no explanation to this. Maybe you have? But in the beginning they seem to be doing like a new signatures whenever they do a new zone, that's why you get this constant line. And then they change something and but yeah, it goes very fast. I have no idea.

Yes. So there's some conclusions from this.

Yes. So obviously use reasonable SOA expire time. And really look at own disaster recovery routines and decide what you need: Like when does it happen, how fast can you buy a new server, how fast can you install it, how fast can you get your back up on it, stuff like that. Especially if you are also talking about DNSSEC, think of your zone, you have to recreate the state your DNSSEC signing was in, it's not always that easy. Yes, so then you have the refresh to consider for the SOA and then you have the signature expiration for the DNSSEC to consider and yeah so there should be really long life times for your signatures. Monitor your lifetimes, it's actually good to know that all your signatures are valid. And yeah, in the last thing is follow RFC 6781 but how many people do follow it or how little, we might need to think about whether we should change the recommendations.

That was my presentation.

I hope you have many questions.

(APPLAUSE.)

SHANE KERR: Hi. This is really interesting, I love this kind of research. I have two observations, not really questions. The first is that not everyone uses the startedised zone transfer mechanism

ULRICH WISSER: Obviously not, it doesn't apply to everybody, you need to know if the zone expire applies to you.

SHANE KERR: The second observation is while you are not supposed to serve a zone with expired pass a zone expiry, as you can answer records out of cache, you can continue the server zone once the expiry has passed so passing the expiry doesn't necessarily mean autonomous of sudden you are going to disappear

ULRICH WISSER: If your signatures have expired, obviously validating resolver really angry with you and one thing to let's say what you could do, if you have long signature life times, so if you let's say have 70 days, so if you have day 60 and you are still couldn't figure out your problem, then maybe you should just take away the DS record. But it gives you a long time to fix your problem and then you are remove the DSand everybody gets your zone. It's not ideal, I totally agree.

SHANE KERR: We pretend it's not possible so. So in our case we actually used to respect the expiry time when we were secondary, we removed that quite by accident but, no customer has ever asked for it. At some point we'll probably add it for people worried about the situation.

JIM REID: Speaking for myself. This is interesting, so thanks very much for gathering the data. On one of the slides there you mentioned choosing reasonable values for the SOA record parameters. Where would we define where it's meant by "reasonable values," we had an attempt in working group to come up with parameters and there was no enthusiasm to update that, this is about 15 years ago. But I think it would be useful to try and document those ideas about what's the definition of reasonable. And it may not necessarily be upper absolutely limits, the value must be X, but I think it would be helpful if something was documented that says these are the values and these are the trade‑offs so people can make informed decisions about what values rather than blindly cutting and pasting something they have downloaded from the net somewhere. The question is where do we document that and how

ULRICH WISSER: Good question, I have no answer and I totally agree we need that, I don't know ‑‑ I worked for ICANN has technical engagement and what I do is to talk to CC TLD operators for example and many of them would be very happy to have guidelines how to choose this and currently I have no document to point at, all I can say is well I worked for .se for 15 years and we did that way, it might be a good reference but you know ‑‑

JIM REID: Just do what dot come does, that's obviously an attitude people have, it would be helpful to try and document it you, if could be done here, it has done done through a formal setting of the IETF, I don't know

ULRICH WISSER: I don't agree with the dot com part but otherwise, yes.

JIM REID: OK, yes. Thank you.

SPEAKER: On your slide 13 when you have the slightly diverging things with lots of ‑‑ yeah, there seems to be a very clear regularity there, have you tried to figure out if all these TLDs are operated by the same registry or possibly by the same type of software?

ULRICH WISSER: I work for ICANN and we cannot answer that question. Take your best guess who operates many TLDs.

ANAND BUDDHDEV: This is Anand Buddhdev from the RIPE NCC. A completely different aspect to this problem. We are talking about adjusting so SOA timer values and things like that but one discussion I had with some people which has not really gone too far is if authoritative name servers tracked the signature expiry of the zone, the earliest signature expiry in the zone, and expired the zone regardless of what the SOA timer says, then that would prevent the disaster, it would prevent the authoritative server he can /PAOEURG the signatures from the zone which might have some merit

ULRICH WISSER: I agree, but that would also take your zone off‑line.

ANAND BUDDHDEV: Well, speaking from experience that we have had at the RIPE NCC where one of our secondaries was serving a zone with expired signatures because it hadn't noticed the SOA timer, you know, chain of XFR, it would be better for that secondary to Servfail for the zone so the resolver would go elsewhere instead of getting expired signatures and saying, oops, can't validate so it's perhaps an interesting discussion to start.

ULRICH WISSER: Yes.

NIALL O'REILLY: Speaking as the administrator of a micro zone on my my wife's email depends, this is analysis I have been meaning to do on this zone for a while and maybe in the copious free time that I will have from the end of this week maybe I can do that. Do you have the script or are there tools out there which make this easy for a small operator, a micro‑enterprise or a private individual?

ULRICH WISSER: The code is on my Github.

NIALL O'REILLY: Great, I will find it there. Thanks. Oh and thanks for the lovely visualisations, they make things so clear.

WILLEM TOOROP: There's one we online, which you might already have answered when Shane asked the question. Do we know if these TLDs are actually using zone transfers for distribution of the zone data

ULRICH WISSER: We know that many of them do not but also a lot of them do, yes. I know a lot of sole operators that use zone transfers, not everybody does but those that do, it's relevant and, yeah.

WILLEM TOOROP: Thank you, Ulrich.

(APPLAUSE.) I would like to invite Dymtro to the stage to present on transfer ‑‑

DMYTRO KOHMANYUK: Hello everybody. The Address Policy group session, XoT is a job for the quiet Saturday afternoon or maybe the whole weekend. We are running one of our domains and I am not sure this one is running, I need to check. So we have been running with the clicker is, we have been running DNS for .ua for quite sometime, this is previous done at a presentation but we have a long history of doing DNSSEC, authoritative server and we have this current setup which is roughly three layers, we call it zone generators, the machine which buildings the zone from the database up, thanks to aut management and the distribution which has set up a few machines. BIND 9 and these are basically fetching the zones from the source, actually there's more within one source but there's details and then we have the anycast layers this is our anycast and each of them is basically fetching the zones from this group of the intermediate so that's a distribution layer. So our anycast is part of that.

So we have this deployment done by ‑‑ here I am praising the knot signer, we are parting with CZNIC and the direct access, part of the development team, we have found lots of issues in that initial set up, some of these were due to the way we have done things, some of them were small bugs we have fixed. There's no really special thing you have to do except that enable TLS. The interesting thing about Knot and I am sorry for not including lots of configuration fragments, I feel like I have to clean these up ‑‑ but briefly you can create a.

T LS peers and use all of them, if you want to fall back from the TLS to regular encrypted... as we should call it now, for that you need to.describe the peer twice, once with the TLS and once without and then of course there's an issue of multiple sources use, I want to use all of the TLS sources first and then use the non‑encrypted sources or vice versa and there's more than one way to skin this animal and yeah, the BIND features we have are interesting. Being become why are we going to use TLS, most internet traffic is encrypted, SMTP is largely unencrypted, we'll see to that.

Many people use... as a way to know which peer you talk about but this is a shared secret and we use that too. We did not stop using TSIG, we used shared secrets on both peers which is optional, basically the primary service run... and regular DNS on port 53 and the trial would be ......... (? )

As we know TLS handshakes take sometime, which is true for both, unencrypted and encrypted, transfer would have the key exchange mechanisms, the other issue that we did not work with because we didn't feel like it was worth our risk is to use C because we planned to use our DNS distribution with other parts and I am not sure I want to be them to be hard wiring our CA keys to their DNS configuration, some of them use automated provisioning and cannot do that, instead we just use pre‑share keys and basically tell them to trust whatever key is coming from us, assuming that the IP address matches. That's I guess a weaker form of security but it's better than nothing.

And another point of using C, you have to issue the certificates with a certain lifetime and you don't really want to set up your perfectly valid configuration an let it die by itself two or three years, whatever your key lifetime was.

So these are I guess ‑‑ we had about 40 zones on our .va primary server, we have one one at a time and moving each of them to the new peer configuration, the secret is that basically to use, I am going to show you this one, the next slide. Use the group like that, that's what I was talking about, when you have the master using the TLS with the specific remote... or then use the plain address for the particular server. And the second green part is the typical configuration for the primary server which is basically declaring the TLS service on specific address.

I want to go back one slide, yeah. Actually two. We have run into some issues when converging our configuration to TLS, I don't recall exactly, but the Knot had some issues with either session tickets or general TLS connection establishment, if you do transfers quickly enough, they may try to use, no instance would reuse session tickets, we found... that issue was fixed. For reference Knot server does not use open TLS, it uses... a TLS library, maybe that's a particular dependency on a particular no library, I am not an expert in that field, but just be aware.

And like I said, the issue was fixed in later release, there is also work around to treat it with either longer fresh times, maybe just wait it out and have it tried again. Not a big thing, our our zones we update them once an our hour, it's really not an issue.

Going forward again, like I mentioned we are not using CA chain validation and the thing I want to say is that there is a tricky part that you have to the notify messages, depending the way you write your primary configuration, you may have a situation somewhere where there's no TLS used for the notify because the particular set‑up ‑‑ I guess it's mostly, depending on the zone, the configuration part, so your notify may still come unencrypted and that's not a big thing because you can still process them as an advice and do the actual query and then do the transfer.

I guess in an ideal world, we may consider to implement notify over TCP as a mandatory requirement. I mean ‑‑ well good practice. You probably need a specific syntax for that. It's all doable, just a matter of convenience of the operator being aware, especially true for, let's say, band configuration which have too many options, you can change.

I did mention the BIND logging. Generally speaking you would get a rezone transfer logged. If there's some error you may not be sure was it CA validation or handshake or something else, we should be more detailed logging, I think it concerns Knot as well and things change as we are talking because every once in a while, the software is released but I think for anyone using the TLS, be prepared to run some of these TC dump sessions and see what's happening if you run into issues, we haven't seen a lot of troubles ourselves.

What future plans are for us is to ‑‑ well, like I said before, do the CA and maybe it will be fine within our own anycast and distributional layer with external partners I guess we'll still prefer to use anything worse as long as TSIG matches, I want to experiment ‑‑ which is not quite related to the presentation ‑‑ the multi‑signer, which is available in Knot, meaning that you have multiplied instances of primary servers which would generate the sign zone, different RFC same key set, I can't imagine having difficulties with XoT but we'll see to that.

And lastly we only using the XoT now within our network, our anycast and our code zero which is actually our primary helper with this project, they run a presentation I think it's one of the centre meetings, centre organisation of European and some European... with the main operators.

Also I haven't checked the IETF document repository but just a quick, there is still no XOQ startedised for the transfer over QUIC, I think it would be a good idea, especially for how latency may be high loss configuration like intercontinental satellite links and all that because QUIC is more performance than TCP, depending on your TCP. So yeah, if anybody here has heard of any ‑‑ DNS Org being supplemented with transfer with QUIC and I guess the software would be an interesting thing to try, maybe report on that in a couple of years.

These are my generic outlines, how to be successful in new enterprises, start small, do things slow and be prepared to fail and recover.

These are my small.... thanks to everybody involved and yeah, we are doing this still when the war in Ukraine continues, I don't want to talk about this at this moment but yeah, it was a bit difficult dealing with that, that's unrelated to the topic of the working group but I am always mentioning. Any questions for me please either email me or ask me right now or, I am also on Mastodon DNS. Thank you. (APPLAUSE.)

WILLEM TOOROP: We are good with time.

DMYTRO KOHMANYUK: I hope I I was good with time, I probably wasn't really watching, I probably had some time left.

WILLEM TOOROP: Absolutely.

DMYTRO KOHMANYUK: Sorry if I talking a bit too fast.

DAVE KNIGHT: Dave Knight. I just want to share the common experience, about 18 months ago we did this or our own TLD platform, Altra, and I just want to say that our experience and decisions we made seemed very similar to what you have done and when we started at that time, Naut didn't yet support XoT so we also had a BIND nine distribution layer but then beneath that, our customers can choose if they want to run BIND or NSD or Knot so we really just to share that we have the experience of now 18 months of inter‑op internally, we made some slightly different decisions, we forced ‑‑ like there's no fallback to do 53 internally. We force XoT for all of our internal communications and we configure it explicitly externally depending what the customer wants to do, also we made the same decision as you you we don't attempt to do TLStwe do the same operational semantics that everybody is familiar is, we have TLS and use T SIG for authentification and I wanted to share that experience.

DMYTRO KOHMANYUK: I am guilty, I updated the deck last night because I had a presentation at ICANN this year, I wanted to be more focusing on the technical, these are more like reference points and I was talking about different focus. There's three modes in the RFC, I can't recall the number right now, it's mutual, opportunistic and disable... mutual means they have common key pre‑shared, opportunistic, whatever comes in we trust is and the one is fully co‑validated, I think we should go full on and implement key sharing in the DNS, right, because we can put that into DNS, that makes sense, no, I am not really arguing for that right now, but barring that, I don't want to build a pyramid of the CA code and validation code to be on the bottom of that zone infrastructure distributes dependencies because I love crypto but crypto is hard, things break, a lot of people don't have experience and it's already difficult to configure the DNS intra and it's becoming more difficult with DNSSEC and all and XOT, just another variable. Thank you for your experience, we can talk about it later and congrats with the new job. I will let other people ask. Thank you for your comments.

SPEAKER: Eric Vin, thank you for the presentation, it's nice to see the height of the IETF working group being deployed in reality, talking about the IETF, there's an RFC 9250 which is about DNS over QUIC and I think there's zone transfer in it.

DMYTRO KOHMANYUK: Yeah I know it's the usual... recommended to 50.

SPEAKER: I think they got the zone transferring.

WILLEM TOOROP: Yes I can confirm, this was also mentioned on chat.

DMYTRO KOHMANYUK: I don't think the XOQ is a term we have seen, I am sure the RFC can cover the transfers, I would just think maybe you should start using this acronym in parallel comments, maybe can be another RFC specifically of the transfers.

NIALL O'REILLY: As a micro zone manager. I am aware of a couple of other name server codes which have integrated key provisioning and signature provisioning into the name server, they are probably more than the ones I am aware of and at this stage of why not, I missed whether you had evaluated any of the other ones that seem at least from reading the docs to give the same kind of functionality and what was the advantage of Knot over them, if you did.

DMYTRO KOHMANYUK: Sorry, I got it. You say other software can support easier provisioning of this shared keys between different servers? And you think we should try that instead of.

NIALL O'REILLY: Well no, I'm doing something similar with BIND, I think power DNS does the same thing and maybe Knot has some advantages that I am not aware of. And in my copious forthcoming free time, I hope to play with some of those and I know that the architecture of NSD is different and there's a different provisioning chain there, and I am talking to the folks in NL Labs, SO I want to get back to exploring the stuff and taking the input I have from people who got there before me. So we'll follow up by email or Mastodon.

DMYTRO KOHMANYUK: Thanks for suggesting, zone catalogues can be potentially used for provision, that's a big discussion, we can use an entire hour.

NIALL O'REILLY: Let's not do it now.

DMYTRO KOHMANYUK: Thanks everybody.

(APPLAUSE.)

WILLEM TOOROP: I had one more comment on chat, I will relay to you after, which is fine. Which is... commenting on your observation that notifies aren't logged properly, and he says, we are happy to improve logging, please tell us what is missing, we don't know what we don't know, you know.

Next up Shane Kerr with the DNS TLD upper limits in practice.

SHANE KERR: Hello everyone, good morning still. So it's always difficult at a RIPE meeting to know what exactly the level of technical detail, we have ‑‑ at a RIPE meeting I tend to think the level of expertise should be not quite as high, so if you don't do DNS every day, good, this is maybe for you; if you do, I apologise. But it should still be interesting.

So this is about DNS TTL upper limits in practice. The reason I started looking at this, we are a large authoritative server vendor and what we want to do is better understand the behaviour of DNS resolver because those are really the other parts of the network we interact with and we have a bunch of models in our brains how this stuff works, but ‑‑ and we have also dealt with problematic configurations, but what we don't have is data in actual detailed analysis, so we are starting to step into that gap now and figure out how resolvers work with our systems.

So this is basically a presentation about a very simple measurement that we did, which is that the idea was we want to check to see what TTL CAPs resolvers have in the real world and the idea was to use the RIPE Atlas network to do this.

Before I get into the detail of that, let's talk about how TTLs work, any component can cache a DNS record, you look it up and keep it and don't look it up again for up to a certain amount of time and that certain amount of of a time is left up to the person who manages the authoritative zone. And the idea there is, for example, if you know that you have a male server that changes frequently, you will give it a low TTL so you can move it around, if you have a record that's basically static, you can give it a long one and save the effort of everybody having to look it up.

So here's some examples, I put them in orange because I live in Holland. And you can see different records have different TTLs, these are simple queries. The bottom one is a C name change and see how they have their own TTLs and the resolvers and devices keep track of all of them.

Now, are there problems with cacheing? I think if anyone has ever worked with the DNS, you are going to hit problems which having records cached that you want to go away, so caching for long periods of times can be problematic, on the recursive resolver side, it takes more memory, and of course severing answering is really a problem. It makes network migrations and things like that tricky, because you can't safely move to a new IP address if the old one is used by some places.

Now, my hypothesis I say is a fact here is that large TTLs are probably a mistake, the person configuring the data didn't understand the implications, if you see a TTL of six months or a year, it's probably they didn't really think about it or understand what that would actually mean.

Final slide about TTLs, this is where we get a little bit closer to the measurement which is resolvers a maximum TTL because large TTLs are mistake, a recursive resolver doesn't want to cache them for that long. TTLs usually have a maximum. If I have a web record and I say mine is six months, the recursive resolver will probably save it for a maximum of a day or an hour. I think all recursive resolver servers have defaults for this and administrators can play around with it, depending on the balancing of memory and operational changes. As an aside, there's also usually a minimum TTL, this is often not configured at all by default but a lot of recursive resolver operators have it; the idea being there if a record only has one second TTL, it's basically going to get queried every time a user wants to look at it and a recursive resolver operator may say we don't care, we are going to set the minimum to five or Sen seconds or even a minute and it's not allowed in the protocol but it is common practice and probably fine, licence is not a like very tightly contained at all times system.

All right. So why do we care about these details? High TTLs can reduce the query loads, fewer queries, it's important to scale things and also we charge based on the query so our customers are very concerned how many queries we are getting, probably more than we are, our infrastructure can handle if. But they have to pay for it, low TTLs is very important if you are doing traffic engineering and monitoring your server back and if you want to change between IP addresses and things like that so if your services are cloud based and all your things are ephemeral, you want low TTLs.

How do we find these maximums that the covers configured? We configure a record with the maximum possible allowed TTL because the field, it's a 16 bit field, it used to be signed, then they declared it couldn't be negative, now it's a 31 bit field, DNS is an awful protocol, anyway.

So we have this record, we query for it, we use a unique name to ensure it's not cached, so we can have a more accurate assessment of what the TTL is. We use a wild card and we look it up being a caching recursive resolver and check the TTL, that's basically it. We use RIPE Atlas for this, if you are not familiar with RIPE Atlas, it's a bunch of small probes put in networks and homes and offices all around the world and it's an awesome resource, I use it for a lot of stuff and one of the things that the RIPE Atlas probes can do is they can use the recursive resolver that's provided by the network. So there's actually a rest API that can do almost anything with rape Atlas and I used it find all of the RIPE Atlas probes listed as connected at that time there were 14,000 of them and then I didn't have to you but I created separate measurement for each probe to look up the long TTL record and then that's all we got. Spread them ought over time a little bit just to not destroy the RIPE Atlas network and I suspect they have protections anyway but it was fine. And yeah, that's it basically, we are just going to look up this record from everywhere.

So that's on the client side, on the server side it wasn't as easy as I had hoped, we needed an authoritative server with this huge TTL. We add IBM NIS 1 recently changed the maximum TTL to answer with to one day, we do online signing of the DNS and we want to make sure that the and we do that for 25 hours, we get a little bit of clock skew in for how long the signatures are good. And we want to make sure the signatures are expired from caches of non‑validating intermediaries. So we set it to a day. So I couldn't use your own system to do this.

I tried to set up a BIND 9 server and it couldn't do it either, I set it up as a DNSSEC enabled thing and because of something, I don't really know exactly what in BIND 9, it was refusing to do long time, probably nor similar reasons to what we had discover on our side, I think it was maybe more about rotation of keys, anyway.

The simple solution was to use unsigned and this was fine because the TTLs themselves are not signed in DNSSEC, the maximum TTL signed, it wouldn't really affect the data, it's not validated in DNSSEC, it's fine. A few limitations, I will show you the graphs in a minute, first limitation is like I said, RIPE Atlas is awesome, it is a buyed network, it's funned and run by the RIPE NCC, the heavily European centric and also mostly people that care about networking and stuff like that. So they put it in mostly well run networks. And so on and so forth. Another limitation is resolvers can be chained, that can obscure results, a lot of times the RIPE Atlas probe will see the recursive resolver address is like a private address like RFC 1980 address but forwarding to a lab there's no way we can know that. There may be ways but it's not within the scope of this research to figure that out.

In relation to the limitation, there's some quirks, about 10% of the measurement failed. I didn't really try to figure it out, it's fine. Also we saw a lot of results with off by one errors, and I am not 100% sure they are off by one, you can't know what's going on on the recursive resolver side. If you see a result that's 86,399, it's probably not the maximum TTL was configured with that, it's probably that the maximum TTL was a day, and then you are querying the same server a second later and it's counting that down, that's my assumption I am going with, if anyone has a better explanation, I am open to them.

What this is all about, the RIPE Atlas probe will get from the network I guess using DHCP, I don't really know. They will get the list of resolvers in the network. In the measurement it will query them all. So if those, it's querying them from the same place and if it's actually going to the same server, you get these off by oners, it doesn't drastically affect the results.

We have a slide. Cool. So I struggled to find a good way to represent this data. But basically you can see here we just have a few stand out spikes here. So one‑hour pops up, which makes sense. I can imagine a lot of people think an hour is long enough and if it queries again, that's fine. That's six hours there, that's Google, if you go to QUAD 8, it has a six hour maximum and anything over that, longer than that, it will try again. 12 hours I think is QUAD 9, which is why that's there. There are other resolvers configure this way by operators but the vast majority are these large operators, one day is the peak, that's where most, the biggest number of resolvers have their limit. I guess it's probably a default in some places, it's as far as I know none of the big public resolvers have that, it's people running their own ISPs and things like that. There's also a one week, that's open DNS has that as a limit and it's also I think configured as a default in other resolvers as well, maybe BIND has that, I am not sure.

I put that little arrow to one year, not because there's a lot of resolvers configured that way but just to give you an idea, if you look carefully you see the bottom line of the time to lives the different probes have, I want to give you a feel for how we are going to that far right where you see there with the arrow pointing at 68 years. So if you have done the math, that's two to the 31 it comes out in seconds to be 68 years, you will see there's a few dots there on the right, that's because there's like 68 years and 68 years minus one second or two seconds because of that kind of off by‑one error I was talking about.

So now, that's that.

I mentioned public resolvers when we were looking at the previous slide. Only about 20% of the probes use a DNS ‑‑ public DNS provider, it's probably higher because like I talked about people going to local address, maybe forwarding it. Again, there's ‑‑ in this experiment there was no way to know that and there was six hours for Google, 12 hours for QUAD9, open DNS has seven days, and as far as I can tell, Cloudflare has no maximum TTL and I queried, I tested this both to prove it to myself, testing it on my laptop and yeah you can watch the TTL go down and I am also fairly confident this is a real thing, so it's possible for example that you could build a recursive resolver which you will do a query and answer that back and then insert it into the cache as a kind of asynchronous process off to the side so it could be that you would for example do the first query, get the full to the 31st second answer and then the subsequent queries would regret out of cache and would be lower whatever the cache limit it, doesn't seem what's happening here. I actually reach out to Cloudflare and said, hey, you don't have query limits, and they said we do, and I never heard back; maybe they are fixing a bug, this is not a problem, it's all within spec, it's just really surprising to me.

And, for example, you may have an operational thing where you reboot each server daily so it's not really a problem because it will get flushed out of the cache, I don't know what's going on there but that's that.

I also did the same graph again pulling out these large public resolvers and it doesn't look too much different. The six hours for Google is the big difference and there's of course a lot less of the 12 hour ones from QUAD9. All this will do, it means it accents the fact that one day is really the kind of most commonly configured value for TTL.

We talked about the peaks here, and I think what's important is that let's took at what this means for you as someone running the data, 70% of the TTLs have a TTL maximum of one day or less. Like the one day is the most common we use, about 20% of them, if you go to the next kind of peak there, the one week, 90% of resolvers will cap maximum TTL at that and that means of course if you do the maths, 10% of the limit higher than that.

What is that mean. What does that mean, hold it, I am not at that yet. Just a special mention, you don't see it on my charts here because it's logarithmic and you can't represent zero in a scale like this but there are resolvers that seem to have a cap at zero and I think there was like 24 or something like that. Which is very weird. It basically means they have turned up caching, which again should be fine, there's nothing wrong with that. It's in the RFC 1035, it's in the foundational documents of DNS but also weird and unexpected and I have seen many things things happen from TTL zero, don't do it, I know at one point BIND 4 for a few versions are the minimum TTL set to one to avoid problems with the TTL zero so anyway.

They are in there. You can find the data in the measurement if you want to go look on the RIPE Atlas.

Now, what does it all mean. Basically some resolvers have a very short maximum TTL which means that you cannot rely on your data being cached, right, you can hope for it but you can't rely on it, there's nothing in the protocols or anything to prevent people from querying again and again.

And some contrarywise, some resolvers have no maximum, which means you can't rely on the data expiring until that TTL has expired so which is why you need to be conscious of this value.

So my advice for zone owners is don't bother with the TTL more than one day. It's not going to buy you anything, only 30% of users will ever see a value older than that and in the longer you make that TTL, the more operational pain you are inflicting on yourself.

For a recursive resolver operators, it's probably OK to have a lowÏsh cap, we know that six hours is fine because Google does it, and especially if you have some sort of pre‑fetching, then even if you have a low TTL you are still going to be reading from cache. So you don't know the pre‑featuring, the idea is if you have a recursive resolver that's reading answers out of cache, it's getting close to the time where the cache is going to expire, the recursive resolver will go query again so that it never actually expires, it can refresh the value and there's some resolvers which do that and that will prevent you having this spike where things take a long time. Anyway. That's basically it, that was the research and the recommendations.

(APPLAUSE.)

ONDREJ SURY: So my question is: There's no recommendation for the recursive resolver implementers so BIND has default of one week for the maximum TTL, should we make it one day? Would it make sense.

SHANE KERR: Yes. I don't really have strong recommendations, but I think what's interesting is that there haven't been ‑‑ it's all kind of ad hoc, but it looks like looking at the data one day is by far the most used. However, again this is a biased data society we can't really know, it looks like one day is probably the best.

SPEAKER: I have two questions, the first one is have you looked for or found any differences of how things are cached depending on where, which level in the tree, so are TLDs cached in a different way than enterprise level?

SHANE KERR: There's lots of different possibilities here. So the question was specifically about different levels in the hierarchy. I didn't look. I would be surprised if that was a thing, although maybe for the root zone because the root zone isn't especially accepted it is, I also didn't look at different types like address records may have different limits than other records, so this was a TXT record so for example Cloudflare may have a one hour cap for address records, right, because operationally those matter quite a bit, similar for NS records and C&As and even SOAs and DNS data may have different CAPs, I also didn't look at any software or anything like that, I didn't look at source code or.

SPEAKER: Thank you. That ties into my second question actually which is have you identified any, do you have a notion of how many different implementations you have been looking at? Can you ‑‑ oh, 25% are running, I don't know what it is, but it's the same. No such numbers?

SHANE KERR: I didn't look at that, at all. It's a really interesting question. I am not sure how you would drive that data.

SPEAKER: Neither do I but you are better than me.

SHANE KERR: It's possible that version.BIND or NSID queries might return interesting things there, I didn't do that, that would be a separate measurement which would be very interesting and I hope somebody does that and let us all know.

SPEAKER: Thank you.

SPEAKER: I am just wondering how to set the TTL values for the reverse zones niece are used for instance by mail servers who have their own record source with their own cache and for this, it would make sense to have one TTL, wouldn't it.

SHANE KERR: For which type of zones do you want to do this for?

SPEAKER. ...for PKI records.

SHANE KERR: That's an interesting question, I think with reverse zones it's going to be the same as for forward zones in the sense that it's going to depend on how often you expect the data to change. Right. If it's, if you are looking for example at residential cable modems or whatever and you want to remove records as they go off‑line and things like that, maybe a lower TTL, if it's your own infrastructure that's going to be around for weeks and years, yeah, a high TTL seems like the way to go. But even in that case a TTL greater than a day, you are not buying anything, you are making your life hard.

SPEAKER: My idea was that we need to reverse solutions for mail servers to avoid them being considered spammy and as the resolution is done by other mail servers... there could be one exception maybe.

SHANE KERR: That's an interesting idea, for a specific type of service to make a best practice running a longer TTL, now there's some discussion when Willem presented his research at which working group, was it mat? Yeah, about like what mail server people were running an there was mention that mail server operators don't use public resolvers and they run their own, I have no intuition about whether it's true, but if it is true, having best practices for mail server operators in setting TTLs, it seems reasonable to me, but I don't know if that would work in practice.

WILLEM TOOROP: Before you respond, without my co‑chair hat which is that there is an RFC which recommends a mansion TTL and it's the surf still RFC, 8767 and it recommences seven days so BIND is following...

ONDREJ SURY: I stopped listening when you said self still, no, thank you. But I would like want to react to why make a special exception for mail servers because everything over one day is not going to save you anything so like what are you trying to save, like a few milliseconds for the next resolution? Or mail deliver under if there's outage for DNS, if there's an outage of the DNS, you don't control the expiration anyway, the DNS must be running all the time, I don't see any advantage of having different TTLs for different services because it's not going to save you anything. There's a difference between like five seconds and a day, but there's no difference between day and seven days or a week or a month. So I completely agree with you, large TTLs make no sense, even for different services I think.

SHANE KERR: I kind of agree with you but I'm not at a mail operator but I don't really know, certainly you I would think mail as a service so it's so redundant and built at an early time when the internet was even flakier than today, it can handle outages and delays.

SPEAKER: Have you brought this to the IETF?

SHANE KERR: To who?

SPEAKER: IETF.

SHANE KERR: No, I haven't. I just finished this a month ago because I thought it might be interesting.

SPEAKER: I looked at the root zone and NS records are six days and the DNS key records are two days.

SHANE KERR: I think that's a really interesting point and I looked looked at these records while preparing for the presentation to get a feel. And I do think that many TTL operators in the root zone are in the worse possible space, they are taking all the main of not being able to change quickly and not getting any benefit because nobody actually stores them ‑‑ or 90% don't store them that long.

SPEAKER: One more point, thank you for doing this because this group of people have a tendency so just like pull the numbers out of the thin air and you actually did the research and so we have a solid data to actually base, make a decision on what should be the maximum TTL based on the data because it's often the casehmm, seven days would be good enough and that goes into the standard and that's one of the things I am trying to change, make decisions based on data so this is really great research that you did.

SHANE KERR: I am glad that you think so and just as a follow up, I have worked with a few of these types of measurements and if you are curious about this and you have some questions about how the DNS is run and think something like RIPE Atlas, there's other techniques you can use as well, go ahead and contact me, there's no secret sauce in anything I do and I am happy to help anyone out who wants to do this.

WILLEM TOOROP: Thank you, Shane. (APPLAUSE.)

Last but not least, Anand Buddhdev, I can't help that you again look more stunning than you already normally do!

Is there a special reason for that?

ANAND BUDDHDEV: Good afternoon everyone. And yes, Willem, we have been celebrating the Festival of Diwali, starting Friday last week an ending today. So it's a festive week for the Hindu people everywhere in the world, I thought I would dress up a little bit.

WILLEM TOOROP: That's fantastic.

ANAND BUDDHDEV: So I am Anand and I am from the RIPE NCC and I am going to be presenting a short update on what we have been doing since the previous RIPE meeting. First of all hosted DNS, RIPE NCC runs various instances of K‑root and AuthDNS in partnership about our community and this is just a little slide showing how many instances we have, we have 128 K‑root instances. This is because we have been expanding K‑root for a long time and so we naturally have a lot more instances. A few years ago we also started the expansion of our AuthDNS instances and we have 27 at the moment. Of course we want to have many more of these because the AuthDNS instances carry ripe.net and all the reverse DNS zones operated by the RIPE NCC. So it's an important service and we would like to anycast it even further.

So since the previous RIPE meeting in Lisbon, we have added five new AuthDNS instances in Makati city in the Philipines, Latina in Italy, Beirut in Lebanon, Maracaibo, Venezuala and Colombo, Sri Lanka, so that brings the total to 27 and I believe my colleague activated one more today so maybe it's 28 now.

We would still like more hosts to come up and offer to host AuthDNS instances. And if you are interested in applying, please go to this URL hosted dash DNS.ripe.net and apply to host an instance. You can provide a server, it can be a virtual server and as long as it meets the requirements we'd be happy to activate an instance of AuthDNS at your internet exchange and your network.

Something we have been doing since earlier this year is renumbering the IPv6 dresses of our AuthDNS service and the reason for it is that previously we had just one prefix /48 size prefix and those who operate anycast networks will know that this can have some problems, when you run an anycast network, it is often a good idea to have a covering prefix which is less specific but because we only had a /48 there was no way to have a covering prefix.

Also it leaves no scope for announcing any kind of test prefixes. So our AuthDNS also suffered from this issue in the sense when we activated a new instance, we had no way of nothing whether everything was working properly and we would only find out after announcing the production prefix from there. So which have solved this by carving out a small /48 prefix from the RIPE NCC's infrastructure IPv6 addresses and we have a /48 sized production prefix and we also have a /48 sized test prefix. And what this allows us to do is announce the test prefix from a new instance and then look at our monitoring and see whether the test prefix is reachable and when we are happy, then we can confidently announce the production prefix and know that things will work well.

So this is a slow process because it takes time to renumber servers and services. At the moment the new prefixes are fully available everywhere they are announced. We have also renumbered our own name servers so we have several name servers with names like Manus and PR I so we have renamed. So we have renumbered those, RIPE NCC also provides secondary CC TLD service, so we have 24 ccTLDs who get secondary DNS service from us. And the name servers that we operate for them have the form X X dot CC TLD dot AuthDNS .ripe.net, these are in the root zone and the root zone needs to be updated with new IPv6 dresses, we have been working with the TLDs and 18 are done, there are six that are still pending, you still see the old addresses for them and we are in communication with IANA for updating these.

But sometimes things take time.

And we aim to withdraw the old prefix before the end of the year at which point the name servers will no longer answer on the old address.

We have DNSSEC signers, we have been running them for a long time and we use them to sign all the zones that we operate including ripe.net. Our previous signers were getting old, they were seven years old and there was no more warranty available for them. So we purchased a pair of new Dell 360s and similar specs to the old signers; we installed them with the OS and secret key material, we restored it, I would like to point out we haven't changed anything here, we haven't changed the OS or the configuration or the signer, everything is exactly as it was, we just have new hardware with warranty so in case anything goes wrong, we can get support from our vendor.

And both of these signers have been active since August this year.

One of the things that we have been looking at is our statistics collection and graphing. We use a tool called DSC, this collects DNS statistics on our name servers and for a long time we have been presenting the graph using the DSC presenter, it's fairly old software, still does the job but it's getting a little bit old and difficult to maintain because we have to up great the OS and some things don't run on the new OS. So we have been transitioning to Prometheus based collection, so DSC still collects the statistics but then they are transformed into something that Prometheus can scrape and stored in a database and then Graphana on top of that can display statistics, on this chart here I have two examples, the image on the left is DSC presenter and this is receiving about 1400 queries per second and on the right we have the same visualisation but in Grafana showing you the same data and as you can see the query rates are the same, our aim is to switch completely to Grafana soon. One of the things is we publish these graphs on the RIPE.net website so we need to do some automation within Grafana to get static images out and we can publish these on our websites for folks to see, we publish stats for both K‑root and the AuthDNS service.

And with that I come to the end of my presentation. Thank you for listening. And if you have any questions, or comments, I would be willing to listen to those. (APPLAUSE.)

JIM REID: Speaking for myself as always. Nice work Anand, and thanks to your colleagues, you do a great job keeping all that infrastructure going. And we should thank you for it on a regular basis. You were talking about switching off some of the old IPv6 addresses because of this migration to this new numbering scheme that you are using. When the old addresses are no longer live, will you still have something there to keep track of the incoming queries that might be coming to those old still addresses? We have seen this happen with other things and TLDs and whenever the root servers are changed, queries still come in on the old legacy addresses, do you plan to keep track to see Whois maybe still knocking on the old door as it were.

ANAND BUDDHDEV: Yes, on our servers we have PCAP collection tools and we'll keep collecting the queries that come in, even to the old addresses, because that doesn't really cause any harm to so we can keep that that for a few months afterwards and see what's going and if we see any significant number of queries there, we can try and reach out to people.

JIM REID: Great stuff, thank you.

ONDREJ SURY: So I was thinking, can we work together to so you can just remove the DSC and just pool the data you need into Prometheus directly from BIND and other vendors? So, I mean ‑‑ what I mean is like from the software developer point of view, it's very hard often to just make a specification that would work for the operators, so if you have a time for like actually writing what you need from us, it would tremendously help us to make something useful for more people and you do, like you do have the expertise to actually know what you want so what I am asking to do more work, sorry! But it would be helpful to hear from you so we can implement something that's useful for more people.

ANAND BUDDHDEV: Thank you very much for that comment Ondrej and I think it would be useful but one of the reasons we use DSC is because we have multiple name servers, we have BIND some places, Knot and NSD in others and they produce reports in very different ways. The output they give us is quite different and DSC provides a uniform view of the queries coming in and responses going out and things like that.

So that's actually the reason we are not using stats from the software directly.

ONDREJ SURY: I understand that. What I was trying to say to make this multi‑vendor initiative like cut off zones or RPZ or stuff like that so we can like all ‑‑ I know that there are some differences between the servers, but DSC makes them unified, if you could have a set that all of us would provide in the same format, it would probably help other people than just RIPE .

ANAND BUDDHDEV: I get you now, what you are suggesting is DSC like output uniform from various different bits of software? That would be great I think.

ONDREJ SURY: I am also creating work for them, other vendors but I think it would be really useful to have something unified among all the vendors, well all the OpenSource vendors.

ANAND BUDDHDEV: Sure.

ONDREJ SURY: Are you enthusiastic about this?

ANAND BUDDHDEV: Excellent, well there is enthusiasm, I think this is something we can discuss further. All right. Any other questions or comments? In in that case, thank you again for listening.

WILLEM TOOROP: Thank you Anand.

(APPLAUSE.)

I have a few household messages still.

First of all, don't forget to rate the talks; this is helpful for the co‑chairs and also for the speakers. But more importantly, this afternoon there is a Birds of a Feather session on DNS over TLS and DNS over QUIC from the recursive resolver to the authoritative, so please don't miss that because it's going to be great.

Do you want to say something. ?

SPEAKER: Lehman here, you might get more ratings for the talk if you don't require a two‑step log in system just to be able to rate the talk.

(APPLAUSE.)

WILLEM TOOROP: Yes, so we will definitely take this up in the chair meeting which will be this lunch.

Some other things, remember to vote in the Programme Committee election by five o'clock today. If you register to vote in the NRO NC election, remember to vote by Friday at 9:00 in the morning and you can chat with the NRO NC representatives from 1:00 till 2:00 today at the Meet&Greet desk.

And so that concludes our working group session.

MORITZ MULLER: Hold on, hold on. As this is your last session you are chairing, we would like to thank you a lot for the work you have put into this working group. It was a pleasure working with you.

WILLEM TOOROP: Likewise. More more and I mean... the Programme Committee thanks you for everything, a big round of applause for Willem.

(APPLAUSE.) It's tea! High mountain Ulong tea! Thank you. Thank you for the card.

(APPLAUSE.)

(Lunch break)