Sign in to follow this  
Keenan

Celebrate the migration of Celebration to AWS on April 1 2020!

Recommended Posts

15 minutes ago, Keenan said:

Servers are online now. Let's see how we do!

 

Server time doesn't seem to be accurate yet. Current time for the server is 13:36 while it should be 15:36 (swedish time)

Share this post


Link to post
Share on other sites

After a few minutes of constant right clicking ... the lag still seems there.

EDIT: but less frequent and I can simply walk through gates and doors now.

Edited by Jaz

Share this post


Link to post
Share on other sites
2 minutes ago, Jaz said:

After a few minutes of constant right clicking ... the lag still seems there.

 

I didn't want to say that without loggin in but according to the MRTG graphs the lag is at least the same if not even worse.

Edited by Sklo:D

Share this post


Link to post
Share on other sites
1 hour ago, Sklo:D said:

 

I didn't want to say that without loggin in but according to the MRTG graphs the lag is at least the same if not even worse.

 

The big spike in lag at the start of the new mrtg is from the old shutdown data. That's what the lag was at before AWS. I forgot to purge the old mrtg data from the wurm folder when I copied everything over.

 

Edit: I removed the stale data from the new graphs: https://celebration.live.wurmonline.io/mrtg/lag.html

Share this post


Link to post
Share on other sites
12 minutes ago, Jaz said:

After a few minutes of constant right clicking ... the lag still seems there.

EDIT: but less frequent and I can simply walk through gates and doors now.

 

Right-clicking is a hard way to judge. I felt some delay too (east coast US), though Samool, who is closer to the servers, felt it was instant.

The doors I noticed immediately though. I think I had one instance of it hitching shortly after start-up, but since then it's been nice. I'll be on Keenan most of the day doing some dog-fooding. :)

  • Like 2

Share this post


Link to post
Share on other sites
2 hours ago, Keenan said:

I'll be on Keenan most of the day doing some dog-fooding. :)

So I can pester you about the in game wiki?

Share this post


Link to post
Share on other sites

Um, Keenan?  It broke. lol  It wasn't me!  Zethreal did it!

Share this post


Link to post
Share on other sites
2 minutes ago, PandyLynn said:

Um, Keenan?  It broke. lol  It wasn't me!  Zethreal did it!

Working on it!

Share this post


Link to post
Share on other sites

Celebration is coming back up now. Unfortunately we had a 3 minute period of data loss. Please create a support ticket if you find yourself in a stuck or bugged situation!

Share this post


Link to post
Share on other sites

So a bit of a postmortem on the crash from earlier.

 

Samool worked hard to get database connection pooling working for us, which is a nice optimization to have. Unfortunately the legacy code has some areas where connections aren't given back, as they need to be in order for pooling to work. This situations cause resource leaks that can exhaust the number of connections available for the server to work. We have the ability to log these leaks, but we were unable to stress-test the code enough on test and Jackal to foresee the exhaustion of connections. Basically we thought we had them all.

 

We've since disabled this feature for the time being until we can fix the leaks we've detected and try to harden the system from data loss. We'll be increasing the max connections as well, which should give us a better buffer before a crash happens. In this case we want the server to crash sooner rather than later to prevent data loss.

 

Sadly, this optimization proved to be quite the lag reduction as the lag graph shows: https://celebration.live.wurmonline.io/mrtg/lag.html

So we will have it re-enabled as quickly as possible.

 

Thank you for your patience!

  • Like 1

Share this post


Link to post
Share on other sites

I live on the east coast US and after the switch it was running great for me. I didn't really even have any right click or menu lag. But since that last restart the lag is so bad I makes me long for our old servers lag. Hope you guys get it right soon.

Share this post


Link to post
Share on other sites
1 hour ago, Evilvision said:

I live on the east coast US and after the switch it was running great for me. I didn't really even have any right click or menu lag. But since that last restart the lag is so bad I makes me long for our old servers lag. Hope you guys get it right soon.

 

I'm investigating now.

Share this post


Link to post
Share on other sites

Hamster down i repeat hamster down this isnt even "connection" lag this is straight up server going "derp derp im frozen for a good 20-30 seconds at random and then fine again and then broke again" :(

No ping spike, no packet loss, latency is lower then before but this is bad i logged in took a minute to get out of my house(2 doors) and received 3 pms from people saying "get out now its worse then before" like as if cele was burning or something was kinda funny to see that kinda panic but ya its sad

It does seem to be software hanging up for a bit and stuff which i figured was the case before considering the last few days during my prospecting grind i was getting random double or triple tick sizes due to "lag spikes" which makes me believe that it wasnt a connection issue or hardware issue but a software freeze so actions take longer then expected resulting in more skill gain as a result of the spikes 😕

here let me add some to show

no "lag spike": [15:12:51] Prospecting increased by 0.0051 to 67.8535

"lag spike" [15:13:57] Prospecting increased by 0.0128 to 67.8664 and [15:15:35] Prospecting increased by 0.0235 to 67.8899 that last tick took 15 seconds after the action

Edited by wipeout

Share this post


Link to post
Share on other sites
20 minutes ago, wipeout said:

Hamster down i repeat hamster down this isnt even "connection" lag this is straight up server going "derp derp im frozen for a good 20-30 seconds at random and then fine again and then broke again" :(

No ping spike, no packet loss, latency is lower then before but this is bad i logged in took a minute to get out of my house(2 doors) and received 3 pms from people saying "get out now its worse then before" like as if cele was burning or something was kinda funny to see that kinda panic but ya its sad

It does seem to be software hanging up for a bit and stuff which i figured was the case before considering the last few days during my prospecting grind i was getting random double or triple tick sizes due to "lag spikes" which makes me believe that it wasnt a connection issue or hardware issue but a software freeze so actions take longer then expected resulting in more skill gain as a result of the spikes 😕

here let me add some to show

no "lag spike": [15:12:51] Prospecting increased by 0.0051 to 67.8535

"lag spike" [15:13:57] Prospecting increased by 0.0128 to 67.8664 and [15:15:35] Prospecting increased by 0.0235 to 67.8899 that last tick took 15 seconds after the action

that's hardware lag

Share this post


Link to post
Share on other sites

The cause of the lag has been identified. To put it simply, a server with a player load uses more disk than what we had allotted.

 

To put it technically, we tested and went with the mid-tier general purpose drives. They worked well without a player load, and we couldn't load test it enough to get a good idea of how well it would hold up. The reason things were good to start is because there's a "burst balance" of I/O Operations Per Second (IOPS), which we exhausted. This is when the lag started as our IOPS capped at a rather low number.

 

The solution is two-part and will more than likely need adjustments. We're upgrading to the IOPS provisioned drives (io1) and allocating a baseline of IOPS. It's unclear whether the base line we allocate will be sufficient right off but based upon the usage graph I have, it should be right in the ball park of where we need to be. We may need to increase this value should lag persist.

 

Celebration is on the way back up, and we will continue to monitor the situation.

  • Like 7

Share this post


Link to post
Share on other sites

Well the bad lag is gone but the same old lag spikes still pop up every while with the same effect but getting there ❤️

Share this post


Link to post
Share on other sites
3 hours ago, Keenan said:

The solution is two-part and will more than likely need adjustments. We're upgrading to the IOPS provisioned drives (io1) and allocating a baseline of IOPS. It's unclear whether the base line we allocate will be sufficient right off but based upon the usage graph I have, it should be right in the ball park of where we need to be. We may need to increase this value should lag persist.

 

Well this is exactly what I was talking about the last 6 months, because I have been there and done that not in AWS but in another Cloud.

 

Even the io1 storage is also still way too weak for that, 64k MAX IOPS is really not the the world and you are at some point sharing that so it is really just a best case value. We did a lot of testing with different drives and we are monitoring the cap with IOPS for years and luckily due to the long research we can provide numbers. What Wurm needs and what for example we are using are drives that have at least 90k IOPS 24/7 this is the most important thing to run a wurm server (especially 4k and bigger) without any bigger lags. 

And even that is not enough nowadays (we are not 100% satisfied with the performance, but it works good) so we are even considering to move to a new server infrastructure with drives giving us 350k-550k IOPS in August, which is like 4-8 times faster than io1, that is quite a number. And this is where AWS is the nogo for me personally because now we are coming to the point where costs are exploding. (I love to use AWS or GCS for other stuff though)


 

Actually there are 3 options in my opinion:

1) Stick with AWS and have the most expensive time of your life trying to get the fastest storages AWS can offer which maybe still will not be enough.

 

2) Move back to hardware servers with NVMe drives which guarantee ~350k+ IOPS which is the easy way and use for example kubernets if you want to have your own easy cloud deployment and stuff. (This is what I would do) This gives you most certainly the best performance over all of your options, but well no AWS...

 

3) Stick to AWS and adjust the buffer you were talking about and make sure it can't overflow, but no matter how many connections to the DB you allow simultaneously you will hit the limit, if not on celebration then when moving Indy. It is not a bad idea but there will be very frequent crashes if not after 3 hours then after 10 hours for sure.

So I don't think you will be jumping far with increasing DB connections, what you could do and what is the best bet is to stop static polling when the buffer reaches a 90% limit. What I am talking about is to freeze item, fence, structure polling (not creature polling) and reactivate it when you are back down at 20% of the buffer size. This could be one of the only chances to get that working because the pollers are needing huge IO amounts and actually each time you only should need to freeze them for less than a minute. 

 

Just my 2 cents about what I would do if I would be in the same situation.

Share this post


Link to post
Share on other sites

seems the aws outlook is GRIMM since we only had 20-25 people on cele and it was already capping... isnt the server supposed to allow 200+ at once?

Share this post


Link to post
Share on other sites
5 hours ago, Sklo:D said:

 

Well this is exactly what I was talking about the last 6 months, because I have been there and done that not in AWS but in another Cloud.

 

Even the io1 storage is also still way too weak for that, 64k MAX IOPS is really not the the world and you are at some point sharing that so it is really just a best case value. We did a lot of testing with different drives and we are monitoring the cap with IOPS for years and luckily due to the long research we can provide numbers. What Wurm needs and what for example we are using are drives that have at least 90k IOPS 24/7 this is the most important thing to run a wurm server (especially 4k and bigger) without any bigger lags. 

And even that is not enough nowadays (we are not 100% satisfied with the performance, but it works good) so we are even considering to move to a new server infrastructure with drives giving us 350k-550k IOPS in August, which is like 4-8 times faster than io1, that is quite a number. And this is where AWS is the nogo for me personally because now we are coming to the point where costs are exploding. (I love to use AWS or GCS for other stuff though)


 

Actually there are 3 options in my opinion:

1) Stick with AWS and have the most expensive time of your life trying to get the fastest storages AWS can offer which maybe still will not be enough.

 

2) Move back to hardware servers with NVMe drives which guarantee ~350k+ IOPS which is the easy way and use for example kubernets if you want to have your own easy cloud deployment and stuff. (This is what I would do) This gives you most certainly the best performance over all of your options, but well no AWS...

 

3) Stick to AWS and adjust the buffer you were talking about and make sure it can't overflow, but no matter how many connections to the DB you allow simultaneously you will hit the limit, if not on celebration then when moving Indy. It is not a bad idea but there will be very frequent crashes if not after 3 hours then after 10 hours for sure.

So I don't think you will be jumping far with increasing DB connections, what you could do and what is the best bet is to stop static polling when the buffer reaches a 90% limit. What I am talking about is to freeze item, fence, structure polling (not creature polling) and reactivate it when you are back down at 20% of the buffer size. This could be one of the only chances to get that working because the pollers are needing huge IO amounts and actually each time you only should need to freeze them for less than a minute. 

 

Just my 2 cents about what I would do if I would be in the same situation.

 

I appreciate the input.

 

The current plan forward is to increase the provisioned IOPS and see the effect it has on lag. I actually found a nice tutorial on how to more precisely see the required IOPS, and we're no where near the 64K max so far. I've doubled the IOPS and we will monitor the situation.

 

My weekend will be crunching costs and savings plans that will help. :) I can't waait... I love spreadsheets.

Share this post


Link to post
Share on other sites
8 minutes ago, Evilreaper said:

seems the aws outlook is GRIMM since we only had 20-25 people on cele and it was already capping... isnt the server supposed to allow 200+ at once?

 

It wasn't the player count causing the issue, it was zone polling. The two can be correlated by way of more players = more stuff, but the main bottleneck was how fast the database could write updates. Basically we're getting throttled on the operations per second on the drive. AWS charges for this as Sklo has pointed out. I've raised the limit again this morning, and will continue monitoring. We've got a lot more wiggle-room with that limit.

Share this post


Link to post
Share on other sites

Still pretty nasty today. Could the old lag spikes be related to the JVM garbage collector running periodically? I know there's a fair amount of tuning that can be done to reduce lag spikes caused by it if that's the case. 

Share this post


Link to post
Share on other sites

We've again increased the number of IOPS, and we're continuing to monitor the situation.

Share this post


Link to post
Share on other sites

It's been much better tonight. Maybe a little bit but that could as easily be network lag.

 

edit: 

had a couple pretty bad lag spikes while dropping a bunch of dirt on the ground for terraforming. Might have just been a coincidence but it shows up at 8 on your MRTG graph

Edited by morbiddog

Share this post


Link to post
Share on other sites

I don’t know anything about writing games. What makes Wurm so IOPS intensive? Why would it need to interact with storage that much? Are you running the DB locally instead of using RDS?

Share this post


Link to post
Share on other sites
12 hours ago, Chakron said:

I don’t know anything about writing games. What makes Wurm so IOPS intensive? Why would it need to interact with storage that much? Are you running the DB locally instead of using RDS?

 

We are running it locally, yes. RDS is an option, but an expensive one.

 

TL;DR on why it's so intensive - every time an item is touched (dropped, picked up, transferred) - it hits the database with a write. Then there are zone polling, so a zone (group of tiles) with a number of structures, items, creatures (like a zone with a village in it) hits the database pretty hard with polling writes. Everything has a last polled time that has to get updated. I'm sure any WU server owner that has done any profiling can attest to this. :)

 

It's not the best system. It's as old as the game itself really. We've been strategizing on better ways of handling it, but reluctant to touch such important code. You know, if it works don't touch it?

 

We're still reviewing metrics right now to see what we can do next to address the lag issue.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this