Sign in to follow this  
Keenan

Devblog: Server Issues Postmortem & Future

Recommended Posts

1 minute ago, Jaz said:

Well theories, then planned architecture and then actual implementation are an organic thing that evolve during the process. Thanks a lot for sharing the milestones! At least these are interesting to me but I think that is not just myself...

 

This thing has more evolutions than an Eevee.

  • Like 4

Share this post


Link to post
Share on other sites
6 hours ago, Keenan said:

Update!

 

I'll give more updates as I make more progress. I know this is dragging on longer than I had hoped as I made some quick progress out the gate.

 

Please do this "right" and not "fast".

  • Like 2

Share this post


Link to post
Share on other sites
7 hours ago, zethreal said:

 

Please do this "right" and not "fast".

+100000000000000000000

  • Like 1

Share this post


Link to post
Share on other sites
On 3/17/2019 at 8:06 PM, zethreal said:

 

Please do this "right" and not "fast".

 

That's the intention!

 

Status Update:

Every time I touch this, there seems to be more work to do. That's okay though, I'm plugging along.

 

Docker!

 

Wurm's server build configuration pushes a docker image of the server to a repository. This means that every build will update the image's "latest" tag and give us a fallback point should we need it. Think of this as snapshots of the running build. This with volume snapshots means that if things go horribly wrong in an update, we can very easily set things back to where they were before the update happened. No one likes the world "rollback", but at least it'd be less painful than it currently is if we need it. Lately we've dealt with just handling fixing what went wrong. The poor GM team has had the burden of that, but I'd much rather lose 15 minutes of progress for the few who have connected than spend two weeks trying to catch everyone affected and fix their issues.

 

More Docker!

I've successfully ran our Oracle test server in a docker instance. It's using a docker instance for MySQL as well as an EBS volume for the map and logs. This is precisely what I've been after, but there's still some manual configuration that I need to script so that this happens automagically when a "stack" is started. I want to do as little manually as possible as human error seeps in. Plus I'm lazy, okay? Gosh. No, seriously - it's about human error.

 

Downsides

So far there's some downsides that I need to mitigate. For one, I want the servers to auto-recover. What happened to Indy yesterday really shouldn't happen in this new environment, so I need to solve that problem. At the very least, I want to make it so there's a number of people who can press a button to recover a server. We obviously want direct access to be restricted and not needed for basic things. I was hoping to use an Auto Scaling Group for this, but I had forgotten how restrictive those are when it comes to configuration, and I'd much prefer the network configuration I have over a mechanic that may not even work well for us. I mean the idea of it killing a server because of a failed health check makes me worry, so the idea is dead there.

 

Another downside is that I'm using static private IPs. It was a way to make things work, but I really want to get them to be dynamic. The reason for this is so I can do stack updates instead of delete and recreate. The latter takes considerably more time. I want to minimize downtime for things like OS updates and such.

 

Finally, there's the point Sklo has brought up a number of times. We need to slam the I/O and see what's going to happen. Samool has suggested that we basically work with thousands of items at a time. Given that item updates are one of the most costly things in Wurm, I think that might work for the database. Yet we'll need to also test map updates, so perhaps we can find a way to get a good number of you on this server once it's up and start digging holes! I'm not sure what we can do to reward such testers, but I'll bring it up with Retrograde. I know I'd prefer a good hundred people or so, alts or otherwise. Just enough to give a good live-server-ish test.

 

Going Forward

The plan now is to finish converting the manual configuration to automatic and then move the test servers over fully. I also need to get the logs into CloudWatch so we can set up proper alarms when things go wrong and give access to high level staff members to look through them when needed. I'd also like to get some monitoring going in CloudWatch, so we can tell when an instance is over-burdened and may need to be bumped. There's also the moving of our build server, which I've not even started yet - though that can wait until after everything else is moved. Finally, there's the special cases around Golden Valley - including the shop.

 

That's all for now.

  • Like 5

Share this post


Link to post
Share on other sites

I feel I should add a little about the I/O solutions here.

 

We're fully willing to make changes to the Wurm server to compensate for I/O issues. I'm less worried about the database and more worried about map saves as I mentioned.

 

If I find that map saves are a problem and we can't work around it with code, then another option is to use either a provisioned drive or an instance with an attached NVMe.

 

The latter is literally an SSD attached and the I/O speeds are on par with direct hardware. The main issue with an attached NVMe is that its ephemeral, and thus I'd have to copy the map to it before start and schedule regular copies to the EBS volume to ensure it's constantly backed up. Another issue is that the cost per instance goes up with that option, and I am being cost-aware here. We're willing to pay for the benefits, but the more frugal I am the better. Obviously!

 

The provisioned drive also cost more, and basically what it does is allow for faster I/O speeds at a cost. That will be a bit more complicated as I'd have to do some benchmarks to see where our IOPS need to be. You essentially say "I need this much i/o per second" and pay for it. If you don't use it, you still pay for it.

  • Like 5

Share this post


Link to post
Share on other sites

Definitely look into i3's if you're concerned about IO.

  • Like 1

Share this post


Link to post
Share on other sites
3 hours ago, Chakron said:

Definitely look into i3's if you're concerned about IO.

 

Will do!

 

Update, Pt 2

So I couldn't put this down today at all. It's been about 13 hours total with an hour break for dinner. One of those days.

 

And yet as I type this, I have all three test servers running in a sandbox and talking to each other. Samool was kind enough to give a test connection and it worked. This doesn't mean it's ready for you folks yet! I still need to move them to their final home. I managed to get the database updates for IPs and ports automated and I've got a path forward on auto-restarting the server for updates and recovery. A simple daemon will suffice. No, not demon. Hamsters are enough trouble.

 

A daemon basically something that runs in the background. In this case, the daemon will watch to make sure all Wurm docker instances are operational. Since the docker instance goes away upon termination, if a server crashes the daemon will know. With a proper configuration, it'll know which server and can start it back up. At the same time, I can tell it to pull down the latest image - and I plan on using a repository for test and one for live, so that there's never a chance of test's code getting on live accidentally. Live will be push-button whereas test will be a continuous deployment pipeline up until the act of shutting down the servers. That'll still be manual on both sides.

 

I've decided that keeping static IPs is actually for the best as well, yet I've written everything with the possibility of using ports instead. What that means is that if more than one game server shares a host for cost efficiency, they either need separate IPs or they need separate ports. I prefer the IP method, but ports are an option as well. The reason I'm okay with this is because of the EBS volume that stores all server data. This is something that can't be attached during an update either, so basically if the instances will need to be replaced then I'll have to delete the stack and recreate anyway. The stack will likely take 15-20 minutes to create, so honestly it's not a huge amount of downtime. If I'm doing something that requires it, then we'll just do an "extended 1-hour downtime".

 

Finally, now that I'm this far, we can soon start testing for I/O and I can start building the live cluster profile. I'd still like to devise a way to auto-import the live data, but if the choice is to spend hours doing it manually once or spend days getting an auto-import right? We'll do the hours. I don't want this held up on some fancy thing that I'll probably use once.

 

Once this is all done, I'll be turning my gaze at the shop, GV, and our build infrastructure. After that will be forums, WO and WU web, and finally Wurmpedia. That last one, I would like to take the time to address some requests that @sEeDliNgShas made, so it may take some time. I'd also like a sandbox for her to play in, since who doesn't like sandboxes?

  • Like 5

Share this post


Link to post
Share on other sites
22 hours ago, Keenan said:

copy the map to it before start and schedule regular copies to the EBS volume to ensure it's constantly backed up

 

And at that point you might as well just put it on a ramdrive, which would be both cheaper, faster and not any less durable. For xanadu that's like 1.28gb of ram (+ overhead) with the other servers way smaller.

  • Like 1

Share this post


Link to post
Share on other sites

Hi all.

 

It's been a while since my last update here. I'm going to start by taking this quote from another thread:

 

3 hours ago, Sklo:D said:

I think we currently see a lot unneeded changes and additions, like having Wurm into the Amazon cloud. A cloud can be a great thing for modern software but when it comes to Wurm there is no cloud optimisation, so the advantages are really small compared to the additional costs. Plus this is an ongoing project for more than 3 months now and it became quite silent, which shows that nobody expected so much work coming up with that, which further more shows that the planning is often not done right and wrong design decisions are made. Cloud is a cool thing for sure but absolutly not something Wurm would benefit in a way which would cover the work and costs which run into it.

 

Let me break this apart into the main bits:

 

No cloud optimization for Wurm

This is true in general. Wurm cannot scale in the sense that we can't spin up say two or three instances to help Xanadu cope with lag. Yet this is not the only solution cloud has to offer. I'm mainly looking for the stability that comes with hosting on AWS from a network perspective as well as the ability to build in even more safeguards against data loss. It also means faster recovery in the event of a server outage. Hetzner doesn't give us a whole lot of options in that regard and the network has been abysmal for quite some time.

 

The Costs

The costs are a huge burden on me, as a matter of fact. One thing I did when talking to Rolf about this was show that, if done right, we can meet or beat the current hosting costs. This is primarily why it has been taking me so long - I'm trying to do more heavy lifting with code than infrastructure. It would be easier for me to shove things into the more expensive offerings, but it would be bad for Wurm as a whole to incur such a high cost in comparison. We are also looking into the three-year pricing to save even more money and to ensure that Wurm will be around for quite some time.

 

The Silence

It hasn't quite been three months, but I can explain the silence. First off, part of that has been me taking a leave. I got things to a point where I can start getting a test cluster stood up and we've had a successful connection test with my infrastructure in place. Since that time my focus was on Wurm Unlimited as well as some personal things that had come up. I hate to use this as a shield, but keep in mind that I work on Wurm in addition to a full-time job. I needed some time off from everything so I could come back to this fresh. My day job had me working deeply with AWS as well, and I've actually learned some new tricks that might help me with Wurm's infrastructure. I need to spend a little time with that and see if it'll be a better way than how I've been doing it.

 

Going Forward

My time over the next two weeks will be scarce, but starting in mid-May, I plan on diving back into this in all my off-time again. My current road map for this looks like:

  1. Server auto-deployment code. (We currently do a manual deploy to Test for the server)
  2. Data backup and restore from S3. (This will allow me to clone a server during downtime)
  3. Using the above, stand up clones of all three test servers in the new Wurm AWS account.
  4. Deploy a special test-client that connects to them

After this, it will be a period of observation and hopefully some stress testing from all of you. I'll work with Retrograde and Budda on some kind of event with rewards, but the main problem with that is it may require several attempts. I'll mainly need to push the server to it's limits and see where any bottlenecks are.

 

I welcome questions, comments, and criticism equally.

  • Like 7

Share this post


Link to post
Share on other sites

Just to tangent to a more light-hearted matter. Will we get scheduled auto-updated maps for non PVP servers like what is possible on WU? This will mean new deeds/guard towers/highways etc will auto updated so nobody has to maintain them? Or is that likely to put too much load on the precious resources?

 

Thanks for all your hard work by the way, it is much appreciated!!

Share this post


Link to post
Share on other sites
5 hours ago, solmark said:

Just to tangent to a more light-hearted matter. Will we get scheduled auto-updated maps for non PVP servers like what is possible on WU? This will mean new deeds/guard towers/highways etc will auto updated so nobody has to maintain them? Or is that likely to put too much load on the precious resources?

 

Thanks for all your hard work by the way, it is much appreciated!!

 

This is one of my pet projects that I do hope to complete. Since I have to do it all manually right now, I'll more than likely automate the system at some point. I'm not sure if it'll be live map as we do support events like mazes and such, but something that lets us put out a more scheduled dump of the maps would be amazing.

  • Like 1

Share this post


Link to post
Share on other sites

How are things going with the AWS stuff?

Share this post


Link to post
Share on other sites
20 hours ago, solmark said:

How are things going with the AWS stuff?

 

As I already mentioned in some other posts before there are a lot things which work different on cloud. We are speaking of Wurm a non cloud optimized application which means even the smallest lag or downtime will put a world completly on hold. My favorite example for cloud optimized software is netflix, if there is lag or a downtime in the cloud there are many other services available which can cover the frozen service, which means that in worst case the user has to do one reload or wait a few seconds and everything is fine again since he has got the requested data from a different service in the cloud. Wurm is one big service per world, so there is 100% downtime for a world if there is something happening.The cloud is not designed for one big application, the main concept behind clouds are micro services which are redundant and not error prone if one of those micro services is unavailable all out of a sudden. And a very big thing is I/O speeds, those speeds can be really bad on clouds, which doesn't mean much for netflix since another service can take over, but for Wurm we see a big lag spike, since Wurm is writing I/O in the same thread where it is processing user input and everything else, the server will freeze. Even if the AWS project would be already finished, there would be significant testing over months needed until you can be sure that wurm has a future on the cloud.

 

It is better when they take their time than rush things, I would be very carefully when putting big single threaded applications in the cloud.

I also wouldn't be surprised if we never see Wurm on AWS in the end.

Edited by Sklo:D
  • Like 1

Share this post


Link to post
Share on other sites

Wurm's server is pretty monolithic by nature.  I don't think it would be very practical to try and break it up for cloud services

Share this post


Link to post
Share on other sites

I'm guessing the first step is getting it working on the cloud at all.  Next step is probably making it work better by splitting stuff into a more cloud-optimised form.  Clearly the current hosting isn't working, so something needs to be done; I'm just glad someone on the dev team clearly has a plan and is working towards getting it working.  And the dev diaries are interesting, too

Share this post


Link to post
Share on other sites
3 hours ago, Wonka said:

I'm guessing the first step is getting it working on the cloud at all.  Next step is probably making it work better by splitting stuff into a more cloud-optimised form.  Clearly the current hosting isn't working, so something needs to be done; I'm just glad someone on the dev team clearly has a plan and is working towards getting it working.  And the dev diaries are interesting, too

 

The problem is that they currently seem to use the wrong outdated hardware on hetzner.

I mean after more than 5 years after the last server move the hardware of most worlds is old. So when hardware hasn't been changed in the last 5 years it doesn't surprise me that there are hardware outages nowadays. Next thing is that they most likely are not using the Dell-Enterprise line which would be the only suitable line for business products. Sklotopolis for example is being hosted on hetzner on the cheaper lines for monetary reasons, but even at the lower lines we had never have any bigger issues. There have been smaller network downtimes from time to time but those have been fixed within less than an hour. In general we are measuring the uptime and we are above 99.9% (8 hours per year) which is the minimum percentage we pay for. 

 

I don't see any issues with hetzner, you can't expect after the cloud move the uptime of Wurm will be 100%, the biggest donwtimes over the year are still the restarts, updates and bug hotfixes... Cloud will not fix that. And also the routing issues can be local issues of the players, which means that they can happen on the cloud too. And in general cloud is a lot slower for applications like Wurm, there is a lot work to invest into the cloud move before Wurm can run on the cloud, while I am sure there is a way to put Wurm into the cloud.

Share this post


Link to post
Share on other sites
On 3/2/2019 at 7:39 PM, Keenan said:

 

 

Security really isn't hard. It's when people "cheat" and think it'll be "okay" that things fall down.

 

 

Said from a network engineer, not a CISSP. Security IS hard, as it should be. It HAS to be hard to keep those who will go to all lengths to penetrate, corrupt, or steal information.

 

*said from someone in the information security industry for 15 years*

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this