Sign in to follow this  
Keenan

AWS Status Update - The Rise of Jackal

Recommended Posts

Hello all!

 

This morning we brought Jackal into AWS. Everything went smooth and we're now monitoring the server for any indication of bumps in the road ahead. \o/

 

You may be wondering "But Jackal is over?" - and yes it is. We need to keep the server hot for Jackal transfers right now, and since you all were so busy on there creating items and building things there's plenty of data to poll and help us see where any bottlenecks may be. The true real test will be Celebration, but this gives us insight on what needs to be changed or optimized before we go that far.

 

Right now the tentative uplift schedule for Celebration is early to mid March, but this will depend on our findings over the next few weeks. We'll give a more firm date once we have one. We are not looking to rush this!

  • Like 20

Share this post


Link to post
Share on other sites

[21:35:33] 59 other players are online. You are on Celebration (533 totally in Wurm).

Cele will need a boost soon if it keeps getting more and more crowded. Nowadays I cannot get hit by a troll without waiting out a serious queue.

Also found that some of the late immigrants brought a few of their most adored  lag   spi ke   s      with them from Xana.

Edited by Jaz
  • Like 3

Share this post


Link to post
Share on other sites
3 hours ago, Jaz said:

[21:35:33] 59 other players are online. You are on Celebration (533 totally in Wurm).

Cele will need a boost soon if it keeps getting more and more crowded. Nowadays I cannot get hit by a troll without waiting out a serious queue.

Also found that some of the late immigrants brought a few of their most adored  lag   spi ke   s      with them from Xana.

The Tears are real ...

Share this post


Link to post
Share on other sites

yay cant wait for cele to go live on aws! maybe we can reopen jackal on aws and let it flood with people foraging for pelts again :)

 

edit: just a way for us to test out aws on jackal :)

Edited by Evilreaper

Share this post


Link to post
Share on other sites

Awesome work there Keenan! Can't wait for the moment you guys take Hetzner out back and shoot it. Who knows, maybe we will even see a lag free Xanadu!

Share this post


Link to post
Share on other sites
10 hours ago, Angelklaine said:

Awesome work there Keenan! Can't wait for the moment you guys take Hetzner out back and shoot it. Who knows, maybe we will even see a lag free Xanadu!

 

AWS does not give latency or IOPS guarantees unless you pay big bucks. Its benefit is easier right-sizing of the instance when you aren't running cloud-native services. I.e. allocating less resources. If you need more, going to AWS makes no sense because it gets expensive real fast.

 

Some expectation management might be in order…

  • Like 2
  • Cat 1

Share this post


Link to post
Share on other sites
5 hours ago, Batolemaeus said:

 

AWS does not give latency or IOPS guarantees unless you pay big bucks. Its benefit is easier right-sizing of the instance when you aren't running cloud-native services. I.e. allocating less resources. If you need more, going to AWS makes no sense because it gets expensive real fast.

 

Some expectation management might be in order…

 

exactly what I think.

 

Warned about that in many many many threads. AWS will not bring the performance increase everybody is hoping for.

I worked with AWS and Google Cloud services and performance is not that great compared to bare metal servers.

Edited by Sklo:D

Share this post


Link to post
Share on other sites
5 hours ago, Batolemaeus said:

 

AWS does not give latency or IOPS guarantees unless you pay big bucks. Its benefit is easier right-sizing of the instance when you aren't running cloud-native services. I.e. allocating less resources. If you need more, going to AWS makes no sense because it gets expensive real fast.

 

Some expectation management might be in order…

Can I get this in English now?

Share this post


Link to post
Share on other sites

Performance on AWS is lower than on your own hardware while being more expensive. It's used primarily to shift capex to opex and for autoscaling web ###### that doesn't need performance guarantees.

Share this post


Link to post
Share on other sites

The goal of the shift isn't performance based, it's downtime and network based. We're working on ensuring that server performance isn't impacted by a lot of optimisation but the bigger benefits come from moving away from hetzners ongoing network issues and server redundancy, the ability to have minimal downtime if anything goes wrong. 

  • Like 3

Share this post


Link to post
Share on other sites
49 minutes ago, Retrograde said:

The goal of the shift isn't performance based, it's downtime and network based. We're working on ensuring that server performance isn't impacted by a lot of optimisation but the bigger benefits come from moving away from hetzners ongoing network issues and server redundancy, the ability to have minimal downtime if anything goes wrong. 

I agree with this completely.  With Xanadu at it's peak populations, performance was an issue. I don't see AWS moderate performance specs being an issue at all for all the rest of our servers.  The primary complaints over the years (outside of the Xan stuff) has been dropping servers due to poor hosting providers - at least from what I can recall.

 

AWS has some pretty solid uptime numbers.  A ton of my large enterprise clients had made moves from their own on prem servers to AWS and were very happy with it.  They might be paying for premium SLAs, but Amazon rarely let them down.

Share this post


Link to post
Share on other sites
12 hours ago, Batolemaeus said:

 

AWS does not give latency or IOPS guarantees unless you pay big bucks. Its benefit is easier right-sizing of the instance when you aren't running cloud-native services. I.e. allocating less resources. If you need more, going to AWS makes no sense because it gets expensive real fast.

 

Some expectation management might be in order…

 

I've never given any indication that we'll see huge performance increases on the server side. In fact I've been careful to temper things - we're very cautiously looking at the I/O latency. I'm very much aware of the limitations of AWS. Samool and I agree that issues we find should likely be code-fixed though, with hardware being a second option if a code fix isn't possible. We've already realized a number of optimizations to the Wurm server in the lead up to this.

 

The main draws for AWS:

1) Scaling - while we cannot scale the Wurm server itself (i.e. add more instances to Xanadu to make it lag-free) we *will* be able to scale the instances to match the demand (i.e. scale up a low-pop server if needed).

2) Infrastructure as Code - this is huge for us. Especially with Steam WO on the horizon - the ability to stand up a new server quickly will be a huge plus.

3) Disaster Recovery & Prevention - Another huge thing for us. I've practiced on our test servers and I was able to take a test server down completely - kill the entire stack - and revive it from backup (which is stored in S3) in about 30 minutes. Right now a major disaster would result in quite the extended downtime to bring everything back. We do have backups and they are stored off-site, but configuring a box from scratch and restoring will take hours each server. Jackal was stood up in half an hour.

4) Network - Hetzner has had a number of issues over the years with routing and stability. Just recently we were offline because a router close to our rack went offline and it took them hours to fix the issue. This isn't the first time.

5) Maintenance - Right now we're looking at some serious maintenance on the Hetzner servers. If we don't move to AWS then I'll have to begin scheduling multi-hour downtimes per server to start replacing hard drives. The process is tedious - with a support ticket placed to replace one, then rebuild the RAID, then replace the other drive and again rebuild the RAID (Edit - Not to mention the chance the rebuild will fail and we'll have to resort to a complete reconfigure and restore.). With AWS, any kind of maintenance required will take much less time due to our infrastructure being what it is. Again - Jackal was 30 minutes. I expect Xanadu to take me up to an hour to uplift, maybe longer. That's also how long it'd take to tear the server down to nothing and stand it back up again - completely replacing the hardware underneath it at the same time.

 

AWS is not going to make Xanadu's lag disappear, since that lag is caused by code issues. You cannot throw hardware at code issues and hope it goes away. We need to do a better job at managing the main thread of the server, and that's where our focus has been. As for the costs - the owners are aware of the increase and still support the move.

  • Like 12

Share this post


Link to post
Share on other sites
9 hours ago, Keenan said:

 

I've never given any indication that we'll see huge performance increases on the server side. In fact I've been careful to temper things - we're very cautiously looking at the I/O latency. I'm very much aware of the limitations of AWS.

 

It would probably be worth reimplementing <blink /> tags in the forum and perhaps bump that to 64p size with a flaming effect. Every time the topic has come up in kchat or on this forum people take the AWS move as a panacea to all of Wurm's problems instead of infra modernisation.

 

I feel with you on the RAID by the way. We've managed to finally excise all hardware RAID from our infra entirely and it has been liberating.

 

9 hours ago, Keenan said:

configuring a box from scratch and restoring will take hours each server.

 

Is that comparing manual setup with automated? We got similar gains when we moved to preseed+Ansible, except we are mostly physical. Most time is actually wasted because server manufacturers love twiddling their thumbs in firmware instead of booting while the actual configuration work is done in a minute or two. (And restoring from backup would take days to weeks because, cloud or no, you aren't restoring 0.5PB over lunch)

Share this post


Link to post
Share on other sites

Do agree that AWS is coming across as a panacea - you’re clearly choosing them for the right reasons, but most people won’t see it that way. Might be better to refer to this as just “server migration”.

Share this post


Link to post
Share on other sites
1 hour ago, Chakron said:

Do agree that AWS is coming across as a panacea - you’re clearly choosing them for the right reasons, but most people won’t see it that way. Might be better to refer to this as just “server migration”.

 

That's what it is for sure. I've just heard the term "uplift" used when referring to cloud migrations.

 

I really do want to temper expectations though. I've been quite cheerful in my posts because I've been working on this for a while and it's great seeing it work and work so well. Yet the reality is that I have anxiety about server performance and that the first servers will have growing pains while we optimize the instance sizes and code to match expected performance. People will no doubt be keen to notice even the smallest delay in a right-click menu and possibly assume this is all for the worst. That's what I'm most afraid of.

 

So yeah, for those following along - just because it works well in testing right now doesn't mean that your server will uplift painlessly. We'll need patience and we'll handle issues as they arise. That's really what we need is faith in us handling the issues that do arise as quickly as possible.

  • Like 7

Share this post


Link to post
Share on other sites
30 minutes ago, Keenan said:

So yeah, for those following along - just because it works well in testing right now doesn't mean that your server will uplift painlessly. We'll need patience and we'll handle issues as they arise. That's really what we need is faith in us handling the issues that do arise as quickly as possible.

I'm ready for the slower upkeep burn on Cele - just like it is on Xana :P

Plus worst case I'm ready to eat through any sleep bonus ...

I'm not so very concerned about the facelift uplift, looking how little server resources are needed to run a WU server I'm quite sure a reliable and redundant network layer will count more to the experience of players around various part of the world (I'm one of the lucky ones with mostly stable and fast connection to current hosting).

Share this post


Link to post
Share on other sites
21 hours ago, Keenan said:

That's really what we need is faith in us handling the issues that do arise as quickly as possible.

 

Sir, I only have 30 faith. But you can have all of it. 
Post image

  • Like 1

Share this post


Link to post
Share on other sites
On 2/19/2020 at 1:03 PM, Keenan said:

 

That's what it is for sure. I've just heard the term "uplift" used when referring to cloud migrations.

 

I really do want to temper expectations though. I've been quite cheerful in my posts because I've been working on this for a while and it's great seeing it work and work so well. Yet the reality is that I have anxiety about server performance and that the first servers will have growing pains while we optimize the instance sizes and code to match expected performance. People will no doubt be keen to notice even the smallest delay in a right-click menu and possibly assume this is all for the worst. That's what I'm most afraid of.

 

So yeah, for those following along - just because it works well in testing right now doesn't mean that your server will uplift painlessly. We'll need patience and we'll handle issues as they arise. That's really what we need is faith in us handling the issues that do arise as quickly as possible.

My experience in cloud migrations has been the pain of not having a dedicated backend network for cluster chatter. This of course gets exponentially worse the more nodes there are in a cluster even on bare metal with a dedicated fiber switch. In AWS where you don't have any control over Layer2 this can get super painful. Especially since AWS has hidden service limits on throughput that you only find out about once you hit them (although I truly can't see that much traffic getting pumped inside the freedom cluster any time soon anyway).

Faith is one thing I definitely have in you Keenan. I can't say I fully extend it to every aspect of the game, code, team, or migration but in my esteemed opinion you are top notch and I'd pick you for my e-kickball team any day. (I'd have a lot more  patience if I had an in-game wiki to pacify me though... just saying.)

Share this post


Link to post
Share on other sites

Quick update as folks have been asking.

 

So we had a rather rocky start to Jackal being in AWS.

  • We picked too small of an instance size to start with.
  • Samool made some server-side optimizations to help with a bit of lag and resource usage.
  • There were configurations required to run a production-level Wurm server that did not show up in testing.

This lead to crashes, lag, and more crashes. Jackal couldn't remain up for more than a day without issues.

 

Fixes for all this went out earlier this month, but then we had a rather rocky update last week which resulted in several restarts - not enough time to measure performance and stability of Jackal to any reasonable degree. Our next patch is scheduled for this coming Thursday, which will be over a week of up-time for all servers - including Jackal. That will help us determine if we're stable enough to bring Celebration into AWS. If we're not, then we'll consider other options for the server, such as relocating it on the Hetzner side of things to see if it's hardware.

 

  • Like 7

Share this post


Link to post
Share on other sites
On 3/21/2020 at 11:15 AM, Keenan said:

Quick update as folks have been asking.

 

So we had a rather rocky start to Jackal being in AWS.

  • We picked too small of an instance size to start with.
  • Samool made some server-side optimizations to help with a bit of lag and resource usage.
  • There were configurations required to run a production-level Wurm server that did not show up in testing.

This lead to crashes, lag, and more crashes. Jackal couldn't remain up for more than a day without issues.

 

Fixes for all this went out earlier this month, but then we had a rather rocky update last week which resulted in several restarts - not enough time to measure performance and stability of Jackal to any reasonable degree. Our next patch is scheduled for this coming Thursday, which will be over a week of up-time for all servers - including Jackal. That will help us determine if we're stable enough to bring Celebration into AWS. If we're not, then we'll consider other options for the server, such as relocating it on the Hetzner side of things to see if it's hardware.

 

How did your test go to see if it was stable enough? As we are probably just as curious as you guys for that result

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this