Keenan

Devblog: Server Issues Postmortem & Future

Recommended Posts

12 minutes ago, Morhedron said:

What'll happen to the hamsters?

 

We tie balloons to them. This is how cloud works.

Share this post


Link to post
Share on other sites
Posted (edited)

The last time I priced out a high-IOPS low-latency always-on 'cloud' offering it was several times as expensive as our own physical boxen over a three year cycle (duh) while being less powerful. Nice if you're running generic web crap, but a horrendous (and expensive) way to host game servers especially if you can't scale down during AUTZ without downtime.

 

//edit I hope you set spending limits on your account. You can rack up extremely high bills in a very, very short amount of time. AWS support does waive them on request, usually, but I wouldn't rely on it.

Edited by Batolemaeus

Share this post


Link to post
Share on other sites

Without all the details that Keenan and the Wurm developers have, we can't make strong conclusions about what's "right" or "wrong". AWS, like many other options, will work just fine and cost a reasonable amount when implemented properly.

Share this post


Link to post
Share on other sites
4 hours ago, Batolemaeus said:

The last time I priced out a high-IOPS low-latency always-on 'cloud' offering it was several times as expensive as our own physical boxen over a three year cycle (duh) while being less powerful. Nice if you're running generic web crap, but a horrendous (and expensive) way to host game servers especially if you can't scale down during AUTZ without downtime.

 

//edit I hope you set spending limits on your account. You can rack up extremely high bills in a very, very short amount of time. AWS support does waive them on request, usually, but I wouldn't rely on it.

 

I've crunched the performance numbers on the EBS volumes and of course they're lower than SSD hardware. We plan on doing a stress test to see what the impact of that is and what we can do to mitigate it. Wurm doesn't rely on reads as much as writes, and writes can be optimized in other ways. The only heavy read time on a Wurm server is during initial load. This is why Xanadu takes so long - it literally loads nearly everything into memory except offline players, their inventories, and their tamed animals.

 

We will certainly be putting proper limits on the account. :)

Share this post


Link to post
Share on other sites

I haven't got a clue what half of this technical stuff means but I'd just like to thanks for all the hard work you're doing! 

Share this post


Link to post
Share on other sites

Keenan, I just want to say...

Thank you for keeping us updated and for doing this.

It is much appreciated.

Share this post


Link to post
Share on other sites

I just like the frankfurt hosting... if it's anything close to frankfurt server for POE... 43-45ms latency sounds great!

* Before the latest lag issues... on release I was having same latency so.. couldn't complain from lag before anyway.. yet.... right now latency varies ~80-120 😬

Share this post


Link to post
Share on other sites
12 hours ago, Keenan said:

 

I've crunched the performance numbers on the EBS volumes and of course they're lower than SSD hardware. We plan on doing a stress test to see what the impact of that is and what we can do to mitigate it. Wurm doesn't rely on reads as much as writes, and writes can be optimized in other ways. The only heavy read time on a Wurm server is during initial load. This is why Xanadu takes so long - it literally loads nearly everything into memory except offline players, their inventories, and their tamed animals.

 

We will certainly be putting proper limits on the account. :)

Yes! It feels like my pc is loading the whole Xanadu server into memory everytime I open the client and log in... and my ram usage seems to think so too!

 

Wait... Oh! You mean the actual server, not my PC... Got it...!

Share this post


Link to post
Share on other sites

flat,550x550,075,f.u1.jpg

Very excited about these cloud hamsters.

Share this post


Link to post
Share on other sites

It's been a bit, so time for an update!

 

I'm still working out the details of getting our server code on the instances in a sane way. I've also been swamped at work so far, but it is only Tuesday.

 

With the downtime today, Independence is back on the spare hardware. It's definitely yet another drive issue that was causing the lag. At this point, Indy will simply exist on the spare hardware until we move hosts. In the mean time I'll be bringing the situation up with Hetzner in case we need the hardware for another emergency. (Well, I'll ask Rolf to!)

 

I will say that it's really cool to see how all this works. I've always been fascinated at just how much you create with a few lines of script when doing this kind of thing. The fact that I've broken it apart enough so I can just drop a logical game server (i.e. Release) onto an instance and it sets up all resources is pretty nice. I wasn't entirely sure if I could abstract it out like that. :)

 

So the steps remaining (TESTING happening between each obviously!):

1) Deploy the proper build of the wurm server automatically based on cluster (i.e. latest snapshot for test, or latest release for live)

2) Script EBS snapshots for backup purposes

3) Update the database with the proper public and private IPs, since we allocate these when an instance starts.

> --- < This is where we can begin testing servers on AWS > --- <

4) Handle Golden Valley - we'll either need to move the DNS hosting to AWS so I can update it with the current public IP for GV or I'll have to find some other way to ensure the DNS points to the login server.

5) Work with Taufiq on moving the shop to AWS

6) Websites: Wiki, Forum, Www

Share this post


Link to post
Share on other sites
Posted (edited)

No matter of all the pros and cons of clouds and physical machines, I love how technical details are discussed here. Finally this forum is becoming a place where you can share your knowledge and learn new things. AWESOME!

 

Another thought we already discussed once for a short moment is that Wurm could really benefit from working with docker containers, which would make Wurm even more cloud friendly at the same time. Having a great way to install MySQL or MariaDB database in the container, get some scripts which will automatically configure DB access when docker container is installed, add a webserver and other helpful things the server needs inside the container.

 

This would make it possible to take any server which runs docker and just start a server on it within seconds. Would be not just a big benefit for WO but especially WU players who want to host a server and do not know much about the technical details behin could have an improved experience.

 

 

Small warning, do not host a Wurm server on a system which has IO times below an SSD. Wurm doesn't need much IOPS but it needs very fast IO operations, since it is single threaded and as we analysed IO delays are the main cause for lag. Nowadays you should get about 3000MByte/s reads and 1000MByte/s writes which is what NVMe SSDs can give you. Plus the seek time of NVMe SSDs is around 0.02ms which is also at a factor of 10 faster than SSD seek times. So if you want to really improve the situation of hosting, choose a platform which comes close to those speeds and also take into account that this solution should be able to take it up with the next years of new technology.

 

That's the speed at Sequential (Block Size=128KiB) Reads/Writes with multi Queues & Threads on my newest NVMe SSD. It is a beast, you never saw a Wurm server running that smooth.

 

Wcz0SZb.png

 

Edited by Sklo:D

Share this post


Link to post
Share on other sites

So it's been a bit since I've last posted.

 

I've spent a lot of time trying to sort out how to get our build artifacts out of the Nexus repository we have without doing janky things. It's looking like my best bet is a maven project to handle building the server docker instance. I also borrowed some time from this project to move Independence again as well as do my part for the WU Beta.

 

As for Sklo's last post, I'm well aware of the performance issues surrounding EBS volumes. I'm just not entirely sure how they'll affect us until we get some testing done on them. I'm doing all my testing in my own personal AWS playground as to not commit us to anything just yet. If things start looking poorly and I can't make it work without serious performance hits, then we'll adjust our strategy and find a new path forward. I will again say that Wurm's servers depend more on write time than read time, and most of that write time is split between map saves and the database. MySQL already handles buffering writes fairly well, so we could work around disk write latency by doing something similar with map saves - or even offloading it to a new thread. I think that was intended at some point, but if memory servers - it's still tied to the main server loop. Correct me of I'm wrong, but be nice about it! It's been a over year since I last looked at that code.

 

More updates as I have them.

Share this post


Link to post
Share on other sites
1 hour ago, Keenan said:

MySQL

 

That's a good point, while IO latency murders WU because sqlite does a bunch of syncing on writes (and for good reason) - WO is probably less affected by this because it runs on mysql which happily buffers writes.

 

1 hour ago, Keenan said:

doing something similar with map saves - or even offloading it to a new thread

 

It already is, if USE_SCHEDULED_EXECUTOR_TO_SAVE_DIRTY_MESH_ROWS is enabled (and it's on by default in WU)

 

Share this post


Link to post
Share on other sites

Wasn't there a mod or something for WU that allowed it to run on MySql instead of sqlite?

Share this post


Link to post
Share on other sites

Can we get an english translation for the rest 98% of us who didnt go to geekschool?,

Share this post


Link to post
Share on other sites
2 minutes ago, Angelklaine said:

Can we get an english translation for the rest 98% of us who didnt go to geekschool?,

They are trying to build a new server environment that's quick to setup, reliable, more lag free and player happy. More details to follow....

Share this post


Link to post
Share on other sites
On 3/5/2019 at 8:42 PM, Keenan said:

I will say that it's really cool to see how all this works. I've always been fascinated at just how much you create with a few lines of script when doing this kind of thing. The fact that I've broken it apart enough so I can just drop a logical game server (i.e. Release) onto an instance and it sets up all resources is pretty nice. I wasn't entirely sure if I could abstract it out like that. :)

 

 

Now set up integration testing. ;) 

Share this post


Link to post
Share on other sites
21 hours ago, bdew said:

 

That's a good point, while IO latency murders WU because sqlite does a bunch of syncing on writes (and for good reason) - WO is probably less affected by this because it runs on mysql which happily buffers writes.

 

 

It already is, if USE_SCHEDULED_EXECUTOR_TO_SAVE_DIRTY_MESH_ROWS is enabled (and it's on by default in WU)

 

 

If I recall, I think it *tries* to do this but there's still something blocking the main game loop on write. I'll have to poke at it again.

Share this post


Link to post
Share on other sites
1 hour ago, Batolemaeus said:

 

Now set up integration testing. ;) 

 

Testing is a sore spot with me! We need much more of it, but right now our best testing comes from a single person who is amazing at it. Still, we have issues that simply fail to show up until the code hits live. What's exciting about this is how I will be able to take a snapshot of a live server EBS volume and stand up a test server mirror in a few minutes. :)

Share this post


Link to post
Share on other sites
6 hours ago, Keenan said:

Still, we have issues that simply fail to show up until the code hits live.

This would explain why so many developers across games get blindsided by some pretty obvious bugs. Why does it happen? I am curious as to what makes it different between testing (or beta) and live?

Share this post


Link to post
Share on other sites
35 minutes ago, Angelklaine said:

This would explain why so many developers across games get blindsided by some pretty obvious bugs. Why does it happen? I am curious as to what makes it different between testing (or beta) and live?

Hundreds of people going at something instead of a few is usually the biggest difference.

Share this post


Link to post
Share on other sites
1 hour ago, Samool said:

Hundreds of people going at something instead of a few is usually the biggest difference.

I have always felt bad for developers because of this. There is not possible way for them to to test for every possible combination of things.  Sometimes they dont show until they have a full server of test subjects. This is one reason why they ask for people to try the beta and use test. 

 

Share this post


Link to post
Share on other sites

Update!

 

It always happens like this. You think you're ready for something and then you have a think. Suddenly you realize you missed something very important. For me, I missed the ability to establish DNS records for each instance. This lead me to a solution for a problem I inaccurately said was impossible, however. I figured out what I was doing wrong when trying to assign multiple public IPs to a single network adapter. This basically means we can have a DNS record and IP address for each server, no matter how we're hosting it at a given time.

 

It also means we'll be using a new domain name and we'll have fewer DNS resolution issues.

 

I'm currently troubleshooting some connectivity issues, but I hope to have this resolved today. Then it's back on getting the server starting up and working on the database updates needed to populate the IP addresses. The good news there is the update only ever has to happen if I need to delete and recreate the network stack. That stack is so simple though that I should never have to do that. Either way, I'll have a command that does the update in the event it is needed.

 

As for the game server, I was working on getting Spotify's maven-dockerfile plugin working. This way I can generate and push a versioned docker instance of the server on build. The way I'm going about this should see a separate repository for Wurm Live and Wurm Test. I'm aiming to have everything separate down the line. Right now we use a single maven repository which makes things a little more complicated. We can use RELEASE and LATEST, but those are deprecated. I'd also prefer to be a little more specific with what version we're pushing to the live servers. Finally, I'd like this process to be something we can kick off from a web interface - such as jenkins. These are the challenges I'm working to overcome.

 

I'll give more updates as I make more progress. I know this is dragging on longer than I had hoped as I made some quick progress out the gate.

Share this post


Link to post
Share on other sites

Well theories, then planned architecture and then actual implementation are an organic thing that evolve during the process. Thanks a lot for sharing the milestones! At least these are interesting to me but I think that is not just myself...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now