Keenan

Desertion Issues - Postmortem

Recommended Posts

Hello all.

 

We were made aware of Desertion's issues a few weeks ago, and at the time the plan was to roll it into AWS after Celebration's optimizations were complete. Today we had found that Desertion had degraded too far to continue waiting for our hosting solution to be ready and we decided to move it to a spare within Hetzner during maintenance.

 

After moving the server it was unable to connect with other servers. This outage lasted about an hour until we found the configuration issue and corrected it.

 

We will be posting an update on our hosting situation in the near future as we are currently in talks with other providers to see what fits us best. We've learned many lessons in our attempted move to AWS and we plan on using those lessons as we move forward.

 

- Happy Wurming!

  • Like 3

Share this post


Link to post
Share on other sites

So does this mean AWS is off the table and another host is being looked for?

Share this post


Link to post
Share on other sites
11 hours ago, Jeston said:

So does this mean AWS is off the table and another host is being looked for?

 

Sounds like it, which is a very smart decision. 

Share this post


Link to post
Share on other sites
Posted (edited)
3 minutes ago, Sklo:D said:

 

Sounds like it, which is a very smart decision. 

 

Wurm should contact CCP Games (eve online devs) in iceland and ask them who deals with their hosting. Very big sandbox space mmo with lots of persistence needs.

Edited by Jeston
  • Like 2

Share this post


Link to post
Share on other sites
Posted (edited)
3 minutes ago, Jeston said:

 

Wurm should contact CCP Games (eve online devs) in iceland and ask them who deals with their hosting. Very big sandbox space mmo with lots of persistence needs.

 

Different technologies what works for them doesn't mean it works for Wurm. They also have a lot more ressources to work on cloud optimised code.

Edited by Sklo:D

Share this post


Link to post
Share on other sites
Posted (edited)
8 minutes ago, Sklo:D said:

 

Different technologies what works for them doesn't mean it works for Wurm.

 

A server is a server there are probably millions of ways to configure the most minute things, I was merely suggesting asking a game company with the same persistence and read and write requirements on a similar level to wurm and what kind of host if not themselves they might use hardware wise, I play from the states and have no issues with latency, all the way to iceland, unless you have 200 people in a star system.

Edited by Jeston

Share this post


Link to post
Share on other sites

so what about all the lags on other servers?

Share this post


Link to post
Share on other sites
Posted (edited)
25 minutes ago, Jeston said:

 

A server is a server there are probably millions of ways to configure the most minute things, I was merely suggesting asking a game company with the same persistence and read and write requirements on a similar level to wurm and what kind of host if not themselves they might use hardware wise, I play from the states and have no issues with latency, all the way to iceland, unless you have 200 people in a star system.

 

You would wonder how much different systems can be when they are optimized for different aspects. Cloud for example is not performance optimised but is optimised for fault tolerance and scalability. While one cloud node is usually a lot weaker than a server node, the right combination of technologies used in Software Engineering can make your software run a few thousand times better in the cloud than on normal servers. I am currently working with good bunch of destributed system technologies, there is so much potential but you need to use those modern technologies very smart. The biggest part of my Master of Science education is about Cloud technologies, so there really is a lot.

Edited by Sklo:D
  • Like 4

Share this post


Link to post
Share on other sites

In regards to AWS commentary...

The clues that I have gathered is that running WURM Online without more modern database coding, structures and calls may be leading to put Wurm into the highest tier IOPs hardware which can get expensive and unsustainable and most likely not reaching the desired results as Sklo outlined.    If you are already maxing out hardware speeds with the fasted CPU and Disk then next place to look is probably some of the code in conjunction with a less expensive cloud hosting solution.     Given $$ and time there are ways to track this stuff down (cough.. "LoadRunner"... cough)

 

This is not easy stuff and there is likely no magic bullet that will just fix everything in a single stroke.   

 

Keep at it Keenan!!!   

  • Like 3

Share this post


Link to post
Share on other sites
1 hour ago, Sklo:D said:

 

Different technologies what works for them doesn't mean it works for Wurm. They also have a lot more ressources to work on cloud optimised code.

but server go vroom

  • Like 2

Share this post


Link to post
Share on other sites

I was saving the bigger announcement for our hosting status update, but yes - AWS is not a viable solution for us. While in testing it worked fine, once we put more load on the system we ran into bottlenecks on the disk I/O that not even raising IOPS would resolve to a satisfactory condition.

 

I'll give more information in another post over the next few days. I will say that I'm in the middle of testing one provider and in talks with a second.

  • Like 8

Share this post


Link to post
Share on other sites
2 hours ago, Keenan said:

I was saving the bigger announcement for our hosting status update, but yes - AWS is not a viable solution for us. While in testing it worked fine, once we put more load on the system we ran into bottlenecks on the disk I/O that not even raising IOPS would resolve to a satisfactory condition.

 

I'll give more information in another post over the next few days. I will say that I'm in the middle of testing one provider and in talks with a second.

 

Actually to me it was absolutly clear from the beginning that AWS can't work out when hosting Wurm, it is a decision many businesses aim for, but I am personally very careful with cloud stuff, it is still somewhat overhyped. During my university lectures I became a lot insight in cloud technologies, so I finally develop a bit of feeling about what is a good thing in the cloud and what isn't.

I tested Wurm in a Cloud back in 2016/2017 and yes that was exactly the biggest problem I also ran into. So I found (hardware) SSDs to give satisfying results, still especially on SQLite Wurm seems to be even more IOPS hungry, maybe due to different transcation handling and so on.  Actually my first tests on newest generation of NVMe SSDs had mindblowing results for Wurm, I really like to dive deeper into that topic. I post on the forums once I know how big the improvements are. 

 

Anyways I guess it is better to fail than to never try something new, I would just recommend to be a bit more careful about hyping plans, it leads to dissappointment.

 

  • Like 5

Share this post


Link to post
Share on other sites

Perhaps see if it is possible to run on servers using Optane storage. In my experience they are perfect for poorly optomized bursty legacy code.

 

The 900P and 905P solutions both have very high IOPS, don't get hammered with mixed workload, and absurd endurance (10 DWPD).

Share this post


Link to post
Share on other sites

That is good to hear Sklo. My new WU server is running on a new nvme storage.  I don't have many users though, but 50k critters and every click feels instant. I also run a Conan Exiles server from the same box and not a problem with either.

Share this post


Link to post
Share on other sites
Posted (edited)
12 hours ago, nygen said:

Perhaps see if it is possible to run on servers using Optane storage. In my experience they are perfect for poorly optomized bursty legacy code.

 

The 900P and 905P solutions both have very high IOPS, don't get hammered with mixed workload, and absurd endurance (10 DWPD).

 

Intel Optane is a caching solution especially to improve the performance of old HDDs, using additional software which uses AI to analyse which programs and files are used by you to cache them so you can access them a lot faster. I haven't seen that in servers yet, generally caching disks in servers are mostly used when you have like 100TB storage on HDDs and you want to read/write the data faster. Wurm doesn't need that much space you can store a whole server with 1GB of storage, so NVMe storage is cheap enough nowadays to run Wurm on them directly without relying on caching solutions which will not really increase performance of NVMe storage as Intel Optane is also based on that type of storage so the performance is similar.

 

(This explanation is simplified a lot the topic is quite complex.)

Edited by Sklo:D

Share this post


Link to post
Share on other sites
17 hours ago, Sklo:D said:

 

Actually to me it was absolutly clear from the beginning that AWS can't work out when hosting Wurm, it is a decision many businesses aim for, but I am personally very careful with cloud stuff, it is still somewhat overhyped. During my university lectures I became a lot insight in cloud technologies, so I finally develop a bit of feeling about what is a good thing in the cloud and what isn't.

I tested Wurm in a Cloud back in 2016/2017 and yes that was exactly the biggest problem I also ran into. So I found (hardware) SSDs to give satisfying results, still especially on SQLite Wurm seems to be even more IOPS hungry, maybe due to different transcation handling and so on.  Actually my first tests on newest generation of NVMe SSDs had mindblowing results for Wurm, I really like to dive deeper into that topic. I post on the forums once I know how big the improvements are. 


Anyways I guess it is better to fail than to never try something new, I would just recommend to be a bit more careful about hyping plans, it leads to dissappointment.

2 hours ago, Sklo:D said:

 

Intel Optane is a caching solution especially to improve the performance of old HDDs, using additional software which uses AI to analyse which programs and files are used by you to cache them so you can access them a lot faster. I haven't seen that in servers yet, generally caching disks in servers are mostly used when you have like 100TB storage on HDDs and you want to read/write the data faster. Wurm doesn't need that much space you can store a whole server with 1GB of storage, so NVMe storage is cheap enough nowadays to run Wurm on them directly without relying on caching solutions which will not really increase performance of NVMe storage as Intel Optane is also based on that type of storage so the performance is similar.

 

(This explanation is simplified a lot the topic is quite complex.)

 

Honestly it blows me that you're not being brought in to advise more on how to optimize the servers. Ive seen many posts since this whole AWS thing started where you spoke pretty much against it and it seems those words fell on deaf ears. 

  • Like 1

Share this post


Link to post
Share on other sites
9 minutes ago, Ruger said:

Honestly it blows me that you're not being brought in to advise more on how to optimize the servers. Ive seen many posts since this whole AWS thing started where you spoke pretty much against it and it seems those words fell on deaf ears. 

 

He also praises hetzner like they are gods, which is utterly as far as possible the opposite from the truth for anyone not in germany, which is a huge part of why we're moving away from them

  • Like 1

Share this post


Link to post
Share on other sites
Posted (edited)
3 hours ago, Sklo:D said:

 

Intel Optane is a caching solution especially to improve the performance of old HDDs, using additional software which uses AI to analyse which programs and files are used by you to cache them so you can access them a lot faster. I haven't seen that in servers yet, generally caching disks in servers are mostly used when you have like 100TB storage on HDDs and you want to read/write the data faster. Wurm doesn't need that much space you can store a whole server with 1GB of storage, so NVMe storage is cheap enough nowadays to run Wurm on them directly without relying on caching solutions which will not really increase performance of NVMe storage as Intel Optane is also based on that type of storage so the performance is similar.

 

(This explanation is simplified a lot the topic is quite complex.)

 

Look up the 900p and 905p. They are full fledged SSDs with the best IOPS rate on the market. AIC or M2 solution available. Up to 1.5 TB, with low Queue depth throughput 3x as much as any NAND NVME SSD.

 

You are confusing them with the add in solutions that were sold to try to improve Xpoint volume that sucked. What i am talking about is only useful for the 5% of edge cases that have mixed loads or super high IOPS. NAND NVME solutions have really gaudy sequential rates, but tank in mixed workload at low QD.

 

*edited for clarity*

missing-image.svgmissing-image.svg

Edited by nygen

Share this post


Link to post
Share on other sites
18 minutes ago, MrGARY said:

 

He also praises hetzner like they are gods, which is utterly as far as possible the opposite from the truth for anyone not in germany, which is a huge part of why we're moving away from them

 

I am not praising Hetzner. Hetzner is quite a big company which provides a great DDOS protection and is reliable, so their hosting is professional that is a fact. There are many hosting companies which aren't reliable in this form and finding something better than Hetzner is quite difficult as they are quite good in what they are doing compared to the few hundred other hosts just in Germany. OVH.de and Strato.de for example also look very good with their DDOS protection, price-value ratio and so on. Then there is Host-Europe which I don't know much about but they also seem to be a bigger company with reliable service. But after that the air is becoming thinner, big professional hosters with good reliable service and strong DDOS-Protection are hard to find. I have suffered multiple stronger DDOS attacks, which showed me how hard it is to get a hoster which protects you, my first hoster for example just disabled the network of the server for hours.

 

Conclusio: There are other good hosts, but Hetzner is already a quite professional and reliable host, it will be a long road to find a hoster which really provides such a big benefit that the work needed moving the servers to another host pays off in the end. When it comes to being not close to Germany then even AWS and GCS can have troubles with routing to eg. the US at some point, so finding a solution to this problem could be even more problematic.

Share this post


Link to post
Share on other sites
37 minutes ago, nygen said:

 

Look up the 900p and 905p. They are full fledged SSDs with the best IOPS rate on the market. AIC or M2 solution available. Up to 1.5 TB, with low Queue depth throughput 3x as much as any NVME SSD.

 

You are confusing them with the add in solutions that were sold to try to improve Xpoint volume that sucked.

 

Yeah got it, sorry I am not very familiar with Intel products, I learned them as caching drives and looked that up now. That is exactly the performance a wurm server would have a mindblowing increase. Those Intel obtane drives are pretty much just high end NVMe drives. Still most hosters don't give you the choice on which drives you want to have in there sadly, since they have their contracts with different hardware manufaturers of course. But anything NVMe is for sure a big advantage when it comes to polling lag on a Wurm server. I personally am looking forward to use NVMe drives with Wurm servers.

Share this post


Link to post
Share on other sites
3 minutes ago, Sklo:D said:

 

Yeah got it, sorry I am not very familiar with Intel products, I learned them as caching drives and looked that up now. That is exactly the performance a wurm server would have a mindblowing increase. Those Intel obtane drives are pretty much just high end NVMe drives. Still most hosters don't give you the choice on which drives you want to have in there sadly, since they have their contracts with different hardware manufaturers of course. But anything NVMe is for sure a big advantage when it comes to polling lag on a Wurm server. I personally am looking forward to use NVMe drives with Wurm servers.

 

Oh yeah.

 

They are very expensive per GB which ($1.5 per, vs 0.35 for Nand). The *only* benefit is for garbage loads like this. Also durability and reliability because its a hybrid NAND/RAM by function.

Share this post


Link to post
Share on other sites
On 4/23/2020 at 1:52 PM, Keenan said:

I was saving the bigger announcement for our hosting status update, but yes - AWS is not a viable solution for us. While in testing it worked fine, once we put more load on the system we ran into bottlenecks on the disk I/O that not even raising IOPS would resolve to a satisfactory condition.

 

I'll give more information in another post over the next few days. I will say that I'm in the middle of testing one provider and in talks with a second.

Thank you for the update, Keenan. I know you don't have to, but telling us these things is very much appreciated. The transparency and involvement is well received.

  • Like 1

Share this post


Link to post
Share on other sites

What happened with Desertion today around 4:10 PM server time? I was with a group of players and we all lost connection for a few minutes. Is this what can be expected from Desertion in the future?

Share this post


Link to post
Share on other sites
3 hours ago, John said:

What happened with Desertion today around 4:10 PM server time? I was with a group of players and we all lost connection for a few minutes. Is this what can be expected from Desertion in the future?

 

It was a crash. The cause was identified as an edge case with tower influence and will be included in the update coming later this week. We don't expect it to be an issue again until then.

 

So no, this isn't what can be expected.

  • Like 2

Share this post


Link to post
Share on other sites

All works smoothly now. Strange feeling, when you go through the doors without any resistance.

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.