Keenan

Devblog: Server Issues Postmortem & Future

Recommended Posts

 

JANYqh4.png

Hi Everyone,

 

As previously promised, I've taken the time to write a postmortem of the stability issues Independence has had, along with our future plans for server hosting. 

 

Independence began to lag considerably some time before February 7th. We had done a maintenance restart as scheduled and had hoped this would fix the issue. It did not. Later in the day we restarted only Independence in an attempt to fix the lag. During this restart I rebooted the server that Independence runs on and upgraded packages. This didn’t resolve the lag either, which I was beginning to suspect was hardware-related. By February 8th, we had a fairly good idea that a drive in the RAID was failing, so we began to move Independence to spare hardware.

 

That hardware was our old Bridges test server, for those interested. It wasn’t new by any means, but it did have fewer cycles on it and the drives were fresher. Independence lived here for about two weeks while we worked on restoring the previous hardware. In the end. Hetzner replaced the failed drive as well as the entire hardware, leaving just one of the older drives with all the data. I restored the RAID and we upgraded the operating system as well as all packages. I had done this on a test server already, but the intention was to make Independence the first server to get this treatment in quite some time.

 

You may have recalled that we were planning a very long downtime in the future. This was to do the same to all other servers and get things back up to date. Well, more on that in a bit.

 

In the end, Independence is still experiencing lag and we’re quite aware of it. I believe this to be hardware related once again, and we will monitor as it continues. The sad part is that all of these issues overshadowed the fine work Samool did to reduce lag across all servers. We can move Independence back to the spare hardware should it become needed, though I am trying to isolate the problem. If it becomes unplayable though, we’ll do the move.

 

While Independence was taking up all my time, Xanadu wanted some attention as well. You may recall a few crashes experienced. Well, these were long-standing and known crashes that we were unable to trace before. Thanks to the scrutiny and diagnostics tools we’ve had running to single out lag hot spots, we were able to trace the crashes back and fix them. Finally. One of these issues actually took Celebration down back in January. All of this was completely unrelated to the issues Independence was facing, and yet it made our stability look pretty awful.

 

Budda and I were working on solutions in the background, not just for our stability issues but for a number of other problems as well. Hetzner has not been the most reliable host for many here, with network slowdowns and hiccups. Even router outages and emergency work that left us helpless as people were unable to play. Working with Rolf, we developed a plan to move from Hetzner and onto a more stable infrastructure with Amazon Web Services.

 

This move is in its early stages. I am in the process of writing the code for the infrastructure and I am planning on standing up our test instances there first. If all goes well, I can begin writing the live server infrastructure and we can write up a future plan on the move and the required downtime to make this happen. This is what I meant by that extended downtime in the future - instead of patching up old hardware, we will be moving to new instances in a reliable cloud environment. For those concerned, we plan on using the Frankfurt, EU location so the servers won’t be “moving” all that far.

 

I’ve had a lot of experience with AWS and I am very excited for what this means. While we maintain backups right now, all of this will become more secure and easier to manage. We can allocate more resources to a specific server if it becomes needed, or scale back and save money if a server becomes less populated. It means flexibility and stability for Wurm now and into the future, especially with the option to purchase reserved resources.

 

I’m excited and I ask for everyone to have patience while we work through this transition. If anyone is curious, I can detail the infrastructure a bit once I’ve ironed out the details. Until then, happy Wurming!

Share this post


Link to post
Share on other sites

Awesome news, cant wait a lag free lagadu!

Share this post


Link to post
Share on other sites

Aweosme mate, nice work.

Share this post


Link to post
Share on other sites

I for one welcome our new amazon overlords. Thanks.

Share this post


Link to post
Share on other sites

Keenan, Thank you for this very informative post on this situation. It is great to have such a capable and dedicated person such as yourself on the Wurm Development Team to sort out these and other issues with the game and resolve them over time. I hope you will remain with the game Dev team for many years to come and continue to gain satisfaction for challenging work well done.

 

=Ayes=

Share this post


Link to post
Share on other sites

very well explained , I appreciate that as it gives me some idea in my limited tech mind what has been going on and what to expect.  This allows for more understanding and patience..just knowing something.  I'm pleased to hear about the move as well, for i've had intermittent route fail connections to heztner more than should be.  I'm glad you have chose to find better than the level of service they were providing..it was hurting codeclub.

Share this post


Link to post
Share on other sites

I applaud the transparency in the ongoing server issues. Sincerely, well done and good luck to you all.

Share this post


Link to post
Share on other sites

Thanks a lot for all of your hard work- you honestly deserve a drop of your favourite tipple and some 'feet-up' time.

Share this post


Link to post
Share on other sites

Thanks for all the work into this!

The lag has almost injured me a dozen times it's not fun falling down a couple 300 slope ledges.

But what if this change doesn't fix the lag issue?

Share this post


Link to post
Share on other sites

@Keenandoes this mean we are upgrading our hamsters to guinea pigs?

 

 

Share this post


Link to post
Share on other sites
1 hour ago, Keenan said:

move from Hetzner

 

greatest news for wurm in the last year probably

Share this post


Link to post
Share on other sites

I'm sure not a lot excites you Keenan because in all my years I've barely seen you say it,
 think you were most excited about the Epic transfer of characters because of the issues that might happen, that was probably evil excitement tho at the thought of us all crying out "wheres my account"

But if this move excites you, I'm excited too and I have zero idea about server hardware. 

 

Share this post


Link to post
Share on other sites

As many here have said, the informative updates are much appreciated. Thank you! Looking forward to it.

Share this post


Link to post
Share on other sites

!!Nerd Warning!! The following content may not be suitable for all audiences.

 

So I can share a little of what my first step here is. I'm a very big fan of infrastructure as code. The idea of being able to define whole server farms in code and commit that to a repository... well, it rivals coffee.

 

That's why I'm using CloudFormation and a Python library called troposphere. I've played around with some other solutions in the past, but I've had the most experience with CloudFormation and recently troposphere and Python in general. To those not in the know trying to follow along: CloudFormation is an AWS-provided way of spinning up cloud resources (machines, networks, etc) by providing templates. Troposphere is a way to generate those templates from Python code. This is allowing me to abstract out things like the actual game servers. The reason for this is that it may actually be more cost-effective to have one or more servers share one larger machine instance rather than spin up two smaller ones. It also allows us to more easily move "game servers" around as needed. Since we'll be trying to fine-tune what our exact needs are, the ability for me to just "pick up" Xanadu and plop it on a beefier machine with a few edits and a button push is quite appealing to me.

 

In the last day or so I've gotten my code and layers to the point where I've successfully spun up a small test cluster. This is exciting!

 

I've got some more work to do on this, but I hope to be working on the provisioning scripts before the weekend is out. I still have some decisions to make on how I'll handle that, but the main goal there is going to be stability and ease of use. If we ever want to do something like Challenge again, or any other limited-time specialty live or test server, I'd rather it be something as easy as a button push to do it.
 

Tl;Dr: I did something cool with clouds and parseltongue.

Share this post


Link to post
Share on other sites

woot AWS to the rescue something i have talked with friends about in the past of "what if it was in AWS vs a classic host imagine a near lagless xanadu" and seeing indy fail and forcing you guys to actually work towards this makes me happy as it will overall be cheaper for you guys and give you more freedom and safety against hardware failure and with the nature of things cloud based backup lag might(and i say might) actually become less or none existent 

Share this post


Link to post
Share on other sites
Posted (edited)

Another clear sign how "old-school" server hosting should really be frightened by cloud providers offering more and more versatile and easier to manage solutions - and as some tech dragons are currently fighting over the prey (us, users) the costs are dropping all the time.

If not else two possible aspects are amazing: setting the power level of hamsters by simply turning a knob and automatic georedundant backups/mirrors (I don't know the actual AWS offers though).

EDIT: I'm really interested in any details

Edited by Jaz

Share this post


Link to post
Share on other sites

Thank You for posting this information. I'm sure for many this is what we wanted/needed.

Share this post


Link to post
Share on other sites
Posted (edited)
7 hours ago, Keenan said:

!!Nerd Warning!! The following content may not be suitable for all audiences.

 

So I can share a little of what my first step here is. I'm a very big fan of infrastructure as code. The idea of being able to define whole server farms in code and commit that to a repository... well, it rivals coffee.

 

That's why I'm using CloudFormation and a Python library called troposphere. I've played around with some other solutions in the past, but I've had the most experience with CloudFormation and recently troposphere and Python in general. To those not in the know trying to follow along: CloudFormation is an AWS-provided way of spinning up cloud resources (machines, networks, etc) by providing templates. Troposphere is a way to generate those templates from Python code. This is allowing me to abstract out things like the actual game servers. The reason for this is that it may actually be more cost-effective to have one or more servers share one larger machine instance rather than spin up two smaller ones. It also allows us to more easily move "game servers" around as needed. Since we'll be trying to fine-tune what our exact needs are, the ability for me to just "pick up" Xanadu and plop it on a beefier machine with a few edits and a button push is quite appealing to me.

 

In the last day or so I've gotten my code and layers to the point where I've successfully spun up a small test cluster. This is exciting!

 

I've got some more work to do on this, but I hope to be working on the provisioning scripts before the weekend is out. I still have some decisions to make on how I'll handle that, but the main goal there is going to be stability and ease of use. If we ever want to do something like Challenge again, or any other limited-time specialty live or test server, I'd rather it be something as easy as a button push to do it.
 

Tl;Dr: I did something cool with clouds and parseltongue.

 

Amazon Web Services have pros and cons, you are giving even more control out of your hands and will start to fully rely on Amazon. Especially when it comes to fast I/O operations a cloud environment can always become very interesting.

 

Hetzner has been a very reliable hoster for me personally, in all the years I am hosting a Wurm Unlimited server there were maybe 2 shorter downtimes due to router outages, steam had a lot more outages in this time frame which prevented the login to our server for example. So I wouldn't expect Amazon to be more reliable than your own pyhsical box at hetzner. When your VM for example is moved to another physical storage or physical machien in the background there always will be bigger lag spikes and you can NOT control that. Wurm is almost a real time application, which in my opinion doesn't fit into a cloud very well. But time will show, I personally had Wurm servers at hetzner in their VM cloud without too many issues, but that is a smaller and less changing environment, when our VM was moved we always experienced like 2 minutes of non stop lag, which was out of our control and not communicated to us. Still amazon should have mechanisms to prevent a bigger outtage.

 

When it comes to deployment you don't need Amazon for that. You can do a lot with for example Jenkins for automated deployment, installing the whole with simple bash scripts and a VM used a template. You could even go further by using an open source tool for automated VM set up and configuration.

 

Since our big expansion when the humble bundle with Wurm Unlimited came out we are running one big machine with a lot of RAM and SSDs. Using Windows Hyper-V Server 2016 for all our VMs, every server has it's own VM and we are running a login server, two big 4k worlds with more than a million of items on both worlds each, a smaller hunting server, a virtual firewall, a Windows 10 MGMT machine, a gateway server managing the game servers doing WU updates and handing mods, a test server with 2  worlds and a web server handing out our livemaps. No real lag, it is just really awesome how much power you can get for such a small coin. We needed to get as much performance as cheap as possible because someone needs to be able to pay the party. Luckily our donators are supporting us very well.

 

Still Amazon is an interesting concept, while I currently see it as a waste of money for smaller companies, other people are big fans. My opinion would be hire a server administrator who has a good bunch of knowledge in different tools and platforms, find the best combination between linux and windows, and run an extremly reliable environment at the same costs as now. And compared to the problems which are Wurm related and hidden deep in the code, the hardware reliability is a almost perfect in comparision.

 

Edited by Sklo:D

Share this post


Link to post
Share on other sites
9 minutes ago, Sklo:D said:

 

Amazon Web Services have pros and cons, you are giving even more control out of your hands and will start to fully rely on Amazon. Especially when it comes to fast I/O operations a cloud environment can always become very interesting.

 

Hetzner has been a very reliable hoster for me personally, in all the years I am hosting a Wurm Unlimited server there were maybe 2 shorter downtimes due to router outages, steam had a lot more outages in this time frame which prevented the login to our server for example. So I wouldn't expect Amazon to be more reliable than your own pyhsical box at hetzner. When your VM for example is moved to another physical storage or physical machien in the background there always will be bigger lag spikes and you can NOT control that. Wurm is almost a real time application, which in my opinion doesn't fit into a cloud very well. But time will show, I personally had Wurm servers at hetzner in their VM cloud without too many issues, but that is a smaller and less changing environment, when our VM was moved we always experienced like 2 minutes of non stop lag, which was out of our control and not communicated to us. Still amazon should have mechanisms to prevent a bigger outtage.

 

When it comes to deployment you don't need Amazon for that. You can do a lot with for example Jenkins for automated deployment, installing the whole with simple bash scripts and a VM used a template. You could even go further by using an open source tool for automated VM set up and configuration.

 

Since our big expansion when the humble bundle with Wurm Unlimited came out we are running one big machine with a lot of RAM and SSDs. Using Windows Hyper-V Server 2016 for all our VMs, every server has it's own VM and we are running a login server, two big 4k worlds with more than a million of items on both worlds each, a smaller hunting server, a virtual firewall, a Windows 10 MGMT machine, a gateway server managing the game servers doing WU updates and handing mods, a test server with 2  worlds and a web server handing out our livemaps. No real lag, it is just really awesome how much power you can get for such a small coin. We needed to get as much performance as cheap as possible because someone needs to be able to pay the party. Luckily our donators are supporting us very well.

 

Still Amazon is an interesting concept, while I currently see it as a waste of money for smaller companies, other people are big fans. My opinion would be hire a server administrator who has a good bunch of knowledge in different tools and platforms, find the best combination between linux and windows, and run an extremly reliable environment at the same costs as now. And compared to the problems which are Wurm related and hidden deep in the code, the hardware reliability is a almost perfect in comparision.

 

 

I've worked hands-on with AWS for well over a year, with a good chunk of that time being part of a two-person team responsible for transitioning a company's entire infrastructure up with minimal downtime.

 

I've never experienced the things you've mentioned in AWS, but I have been blocked out or lagged out of the Wurm servers on a number of occasions thanks to Hetzner's lousy routing. Time will tell as you say, but there are plenty of options within AWS. My current full-time job uses AWS exclusively for the project I'm on - which requires high availability and performance. There's a plethora of companies that trust AWS with their livelihood, so I'm not entirely sure how it can be as bad as you say. Perhaps you've not had much experience with them?

 

The deployment of cloud resources will not be slapped together and hacked into a Jenkins build. That's a terrible way to do it. That sounds like you're proposing AWS-CLI calls from a central server to spin up resources? No. If you mean the deployment of the code, that's not even what I was discussing in the previous post. So you'd likely be mistaken. Code is already built and deployed from Jenkins to test servers. I don't trust live deployments to Jenkins, but they'll be much more simple as a result of the work I'm doing here.

 

As for "hire a server administrator who has a good bunch of knowledge"... I'll send you my resume if you want. I can't tell if that was an intentional slight or not. ;) Oh and... I'll never, ever run Wurm servers on Windows. All that wasted overhead? Gah, it makes me hurt just thinking about it.

Share this post


Link to post
Share on other sites

@Sklo:DCloud environments can be a terrible experience if you try to copy the physical environment there. As much as I'm aware those require a different baseline of architectural thinking and you can end up with a reliable, custom tailored and not overpriced setup. Sure, never go after the basic setups offered on the first page :D

Basic offers rarely give you the proper iops on a disk intensive application, that is granted.

 

Share this post


Link to post
Share on other sites

I don't know what I like more, these news or seeing how much fun you have to work on this project.

 

I am really curious how this will continue so yes, please share all the nerdy details you are allowed to share. :D

Guranteed like for those from me every time.

 

Thanks a lot @Keenanand @Buddaand the other devs as well I suppose... and the guy who writes the text. 😛

Share this post


Link to post
Share on other sites
4 hours ago, Keenan said:

 

I've worked hands-on with AWS for well over a year, with a good chunk of that time being part of a two-person team responsible for transitioning a company's entire infrastructure up with minimal downtime.

 

I've never experienced the things you've mentioned in AWS, but I have been blocked out or lagged out of the Wurm servers on a number of occasions thanks to Hetzner's lousy routing. Time will tell as you say, but there are plenty of options within AWS. My current full-time job uses AWS exclusively for the project I'm on - which requires high availability and performance. There's a plethora of companies that trust AWS with their livelihood, so I'm not entirely sure how it can be as bad as you say. Perhaps you've not had much experience with them?

 

The deployment of cloud resources will not be slapped together and hacked into a Jenkins build. That's a terrible way to do it. That sounds like you're proposing AWS-CLI calls from a central server to spin up resources? No. If you mean the deployment of the code, that's not even what I was discussing in the previous post. So you'd likely be mistaken. Code is already built and deployed from Jenkins to test servers. I don't trust live deployments to Jenkins, but they'll be much more simple as a result of the work I'm doing here.

 

As for "hire a server administrator who has a good bunch of knowledge"... I'll send you my resume if you want. I can't tell if that was an intentional slight or not. ;) Oh and... I'll never, ever run Wurm servers on Windows. All that wasted overhead? Gah, it makes me hurt just thinking about it.

 

While AWS is a new and modern concept with great features, I don't see that many advantages in AWS compared to physical hosting, no matter if you are using hetzner, ovh or whatever.

If you work with it then it is a good thing to use what you know, if you are no longer with Wurm one day then the people following you will also use their own style and so do I.

 

And of course I am not using Windows as operating system for all the server VMs I run, that would be ridiculous and a waste of money. All servers are running on their own Debian 9 VM.

Anyways a good server administrator should know the advantages of both systems, windows hyper-v server is a very compact solution and doesn't produce real overhead, there is no GUI and only the basic features pre-installed which makes it the perfect option for us with only two server administrators. Using this as a hypervisor safes us a lot work, since we can backup, move VMs, clone VMs, create Snapshots without the 100% overhead like KVM does snapshots and we even could add a failover cluster making it possible for us to move a VM to another physical machine with the shortest downtime possible if we need that feature one day.

 

AWS is a good way to move but it will bring up new problems and it is everything else than cheap.

Cloud in general is the future of course, but only for applications which are designed for it, if Wurm would be using micro services or support clustering or just support that multiple instances which do the jobs on the wurm-server-instance and you can add more instances if the server is busy and shut down some instances if the server is less active, then nobody would disagree that cloud is the way to go. 

 

From the current scientific research state there are only 3 reasons when moving to the cloud pays out:

 

  • Demand for a service varies with time
    • eg. different loads over the day
  • Demand is unknown in advance
    • eg. You are new on the market and don't know how much performance you will need.
  • Batch analytics
    • e.g. 100 AWS instances for one hour cost the same as one instance for 100 hours

 

And Wurm doesn't really fit into any category at the moment, so from the current scientific view it is wrong to use a cloud.

Share this post


Link to post
Share on other sites
10 minutes ago, Sklo:D said:

 

While AWS is a new and modern concept with great features, I don't see that many advantages in AWS compared to physical hosting, no matter if you are using hetzner, ovh or whatever.

If you work with it then it is a good thing to use what you know, if you are no longer with Wurm one day then the people following you will also use their own style and so do I.

 

And of course I am not using Windows as operating system for all the server VMs I run, that would be ridiculous and a waste of money. All servers are running on their own Debian 9 VM.

Anyways a good server administrator should know the advantages of both systems, windows hyper-v server is a very compact solution and doesn't produce real overhead, there is no GUI and only the basic features pre-installed which makes it the perfect option for us with only two server administrators. Using this as a hypervisor safes us a lot work, since we can backup, move VMs, clone VMs, create Snapshots without the 100% overhead like KVM does snapshots and we even could add a failover cluster making it possible for us to move a VM to another physical machine with the shortest downtime possible if we need that feature one day.

 

AWS is a good way to move but it will bring up new problems and it is everything else than cheap.

Cloud in general is the future of course, but only for applications which are designed for it, if Wurm would be using micro services or support clustering or just support that multiple instances which do the jobs on the wurm-server-instance and you can add more instances if the server is busy and shut down some instances if the server is less active, then nobody would disagree that cloud is the way to go. 

 

From the current scientific research state there are only 3 reasons when moving to the cloud pays out:

 

  • Demand for a service varies with time
    • eg. different loads over the day
  • Demand is unknown in advance
    • eg. You are new on the market and don't know how much performance you will need.
  • Batch analytics
    • e.g. 100 AWS instances for one hour cost the same as one instance for 100 hours

 

And Wurm doesn't really fit into any category at the moment, so from the current scientific view it is wrong to use a cloud.

 

In my real world experience, there's more use cases. For example, highly available services (which Wurm is) and data redundancy and security (encrypted volumes and snapshots, both of which we will be using). One thing you're entirely failing to consider is the abstraction of hardware. Not having to maintain hardware is a huge benefit. While you are correct, that cloud shines more with micro services or scaling resources, it's not the only use case.

 

Anyway, this has been a fun tit-for-tat, however I'll be getting back to things now. To each their own, and I wish you well.

Share this post


Link to post
Share on other sites
Posted (edited)

AWS has grown a lot beyond just hosts designed to be spun up and replaced quickly and often. Either way, the real victory is getting off Hetzner.

 

us-west-2 us-west-2 us-west-2

Edited by Chakron

Share this post


Link to post
Share on other sites

!! MORE Nerd Talk !!

 

So I've managed to get a server for our test servers up and running and using a framework I wrote.

 

from troposphere import Template

from clusters.test.Servers import Oracle, Druska, Baphomet
from wurm.server.GameServerInstance import GameServerInstance
from wurm.networking.Network import Network
from wurm.networking.PublicIngressRules import PublicSshIngress, PublicHttpIngress, PublicHttpsIngress

t = Template()
net = Network(t, availability_zone='us-east-1a', title_prefix='TestCluster')
net.add_vpc('192.168.0.0/16')
net.add_subnet('192.168.56.0/24')
net.add_internet_gateway()
net.add_route_table()
net.add_route("0.0.0.0/0")
net.add_security_group(title='DefaultSecurityGroup', description='Default ports',
                       rules=[PublicSshIngress, PublicHttpIngress, PublicHttpsIngress])
instance = GameServerInstance(t, net, availability_zone='us-east-1a')
instance.image_id = 'ami-0f9e7e8867f55fd8e'  # Debian Stretch
instance.instance_type = 't3.medium'
instance.ssh_key_name = 'wantsmore-coffee-us-east-1'
instance.game_servers = [Oracle, Baphomet, Druska]
instance.add_instance()

f = open("test-instance.json", "w")
f.write(t.to_json())
f.close()

 

It may not look like much, but that's because I've abstracted much of it behind "GameServerInstance" and "Network". There's a bit more to do, such as move security groups out of Network so that I can spin up the VPC (essentially the entire network) in it's own stack. Then the instances themselves will spin up in their third and final stack. The first stack contains the volumes for each server, but that isn't shown here for brevity.

 

This is all what I've done in about 36 hours of plugging away. The next step is to get provisioning going - I want to be able to provision new servers automatically. Then I can transfer a snapshot of all test servers over to AWS and see what we've got. The provisioning is really the hard part as I'll be doing some newer things - such as pulling the server artifacts from maven instead of manually uploading them or building them before launch. I've been meaning to do this for literally years now, so what better time than now!

 

Ideally, this will lead to shorter update downtime and less human error.

 

And before the techno-critics rip this apart - it's not final! This is a rushed test script that generates a proper template that I've validated in AWS.

 

Oh and if anyone wants to look at the template file the last run generated, it's here: https://pastebin.com/8QTxnXTH

Don't worry, there's nothing sensitive in it. :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now