Blogs

Migrating To The Cloud

Introduction

This is the story of Kognitio’s migration from on-premise IT infrastructure to cloud computing.

It includes:

  • Why we decided to migrate
  • What cloud provider we chose
  • How we did the migration
  • Lessons learned

Background

Kognitio develop an analytical platform that can run on clusters of commodity servers.

As Hadoop gained acceptance, Kognitio ported their software to run on Hadoop. As part of that port the software could run on Amazon Elastic Map Reduce (EMR), which is basically a version of Hadoop in Amazon cloud.

This gave Kognitio initial experience in using Amazon Web Services (AWS), as part of developing and testing the EMR version of our software.

The benefits over on-premise solutions were obvious:

1.      It provided a quick and easy way for prospects to evaluate the software. Prospects didn’t need to get commitment from their IT department to create a small Hadoop cluster for testing, or to deploy new software on an existing production Hadoop cluster.

2.      It was possible to go from nothing to a working cluster ready to load data in about 20 minutes.

3.      On-demand clusters could be run for about $14 per TB RAM per hour, or around $3 per TB RAM per hour using spot pricing.

4.      It was easy to scale up. Running e.g. a 5TB RAM cluster using spot instances was immediately achievable. Previously we’d have to reserve a lot of dedicated internal servers for this sort of exercise.

The potential downsides were:

1.      Networking was significantly worse than dedicated on-premise servers. At the time, the best AWS systems had bandwidth of about 12% of that seen with on-premise servers per TB RAM.

2.      The marginal cost of running an on-demand system 24×7 was higher than the equivalent on-premise system. This could be reduced somewhat with a long-term commitment in AWS, but would still have been more expensive than the equivalent on-premise solution. It would also have sacrificed some of the flexibility from moving to cloud.

Going All-in On Cloud

Having seen the rewards to be reaped from cloud adoption, the next steps were to migrate our own on-premise infrastructure to cloud.

As a provider of an MPP data processing solution, we had many hundreds of commodity servers for use by product development, QA, and presales. We also had a number of servers for internal infrastructure – file servers, build servers, etc.

A number of factors made this infrastructure ideally suited for a move to cloud:

1.      Some of the kit was old. So the power, cooling and other hosting costs associated with it were significantly higher than one would expect for newer hardware.

2.      A lot of the kit was only used for a small percentage of hours in a week:

  • the development kit was typicaly used during the normal working day
  • the QA kit was predominantly used for running tests overnight
  • the presales kit was used sporadically on projects or for demos.

The Choice Of Cloud Provider

We decided to migrate as much infrastructure to cloud as possible. We chose to use AWS as that was an environment we were familiar with as a result of the work with Amazon EMR mentioned above. We also wanted to take advantage of spot pricing (see later).

Given the factors mentioned in (2) above, we expected costs to be mitigated from not having systems running 24×7, and ensuring systems were shutdown / terminated when not in use.

We also planned to use spot instances for at least development and QA work. Here the price benefits outweighed any inconvenience from occasionally losing instances (something which in practice happened very rarely).

We knew we’d incur higher costs in some areas, such as moving our Linux file systems to AWS given they would be running 24×7. However, such systems were a very small percentage of the total overall cost both on-premise and in cloud.

The Migration To Cloud

Our first steps were to migrate some key bits of infrastructure to AWS. This included a Linux file system and servers for building, running git, bugzilla and jenkins for development, and for generating software licences for customers.

A lot of these components, such as git, had been built on our existing VMWare infrastructure. These could just be picked up and deposited in the cloud using a Lift and Shift approach, then we’d change the network routing and carry on using them as before.

Other components didn’t transfer so easily. Particularly things we had already imported from physical hardware into VMs on-premise, as Amazon VM import tools couldn’t always convert them. Our wiki was one of these examples, and in this case we had to rebuild the service and import the data into it. In situations like this we also took opportunities to e.g. split up services which had been co-located on one server to use a dedicated (and hence smaller) Amazon VM after the migration.

We then developed some simple scripts for launching multi-node systems for development and QA purposes (Kognitio software is an MPP offering, so it is relatively rare to launch a system comprising of a single node). This replaced the RDP infrastructure we had on-premise for deploying to physical servers.

This gave us tremendous flexibility compared to our previous on-premise infrastructure. We could trivially launch large systems with any Linux distribution, different node types, etc. We could also do competitive analysis with other products, which made things like benchmarking much more straightforward. For example, we were able to compare Kognitio on Hadoop against other SQL on Hadoop offerings, as you can read at https://kognitio.com/blog/how-different-sql-on-hadoop-fare-in-99-tpc-ds-test-queries/.

 Having these scripts to simplify the process of launching systems was critical in ensuring that developers and other staff were not averse to shutting down systems when they weren’t using them. With more friction on spinning up resources, the temptation would be for people to hold onto nodes for longer “just in case” they wanted to use them later in the day.

 Following that we moved other bits and pieces to AWS as and when it made sense – for example, our QA infrastructure.

Later we were able to restart all our core AWS infrastructure to…

  • encrypt all filesystems
  • restructure the file system to have more space and use multiple volumes
  • upgrade the node types used for our build and NFS servers to make them faster and have better networking
  • upgrade the OS on the build server
  • install anti-meltdown patches as required

…all in less than two hours.

Lessons Learned

As always with this sort of migration, there were some teething troubles along the way, and some other learnings subsequently:

1.      We had some short-term issues during the transition phase. At this time we had a hybrid environment with some resources in AWS and some on-premise. There were some communication glitches in that period which were only relevant whilst we were partially migrated to AWS. The learning from that is to minimise such transition periods. Avoid lots of phases which lead to problems only seen during the transition process itself.

2.      Hitting AWS limits for our accounts:

  • By default we could only have five spot requests at a time originally. We have up to a hundred in play on each account sometimes, so we needed to get Amazon to increase the default limit for our accounts. In practice, the only way to do this was by hitting the limit in place, and getting a new higher limit, then hitting that and asking for it to be increased again. Amazon would not speculatively increase the limit to a much higher value in one step.
  • We have a number of accounts to isolate e.g. development, QA, presales, and the limit-increasing exercise described above had to be conducted on each of those accounts in turn.
  • Similar limits existed for on-demand (e.g. a default limit of only 1 x r4.16xlarge instance originally), so again we had to hit those limits and get them increased.
  • Some limits were not visible to us. On one occasion we had to raise a support ticket and wait two days until all the appropriate limits were raised to allow 40 x r4.16xlarge spot nodes to be allocated at once.

3.      We found a lot of places that had hard-coded references to IP addresses, or hard-coded paths that had to change as part of the migration.

4.      Occasional oddities such as the problem described at https://stackoverflow.com/questions/31783160/why-vim-is-changing-first-letter-to-g-after-opening-a-file which we saw when moving to Amazon Linux for our development.

5.      Changes in how Amazon handle spot, which you can read about at https://blog.spotinst.com/2017/11/29/everything-spot-instances-reinvent-2017/. The previous process of bidding for spot instances went away (where the highest bids would get instances). Now the price changes gradually. You can still bid, but everyone who gets nodes gets the gradually changing price. So you cannot put in a high bid to take an instance from an existing spot user at a price still lower than the on-demand price, which was how things operated before. If all the nodes are used up, you either wait for someone to finish with nodes, or have to use on-demand rather than spot. One consequence of this is that we tend to use more of the older generation nodes (e.g. R4 nodes), as usually you can get spot instances of those without having to wait.

Other Benefits Of Cloud

There have been additional, unanticipated benefits of cloud, including:

1.      We have been able to show customers and prospects how to run Kognitio on AWS, to give them another deployment option. This is particularly useful for development/testing/project requirements. Here the time and cost to deploy new infrastructure on-premise or with a traditional hosting provider, can be prohibitive.

2.      We have found issues when running our product on different Linux distributions, as it is now so much easier to try a Linux distribution of choice. Our nightly QA can use a wide range of Linux distributions, and those can be changed very simply.

In addition, our original focus on a product for Hadoop was hampered by broader issues in Hadoop adoption. Hadoop proved difficult for organisations to deploy. Once deployed, people were wary of changing production systems, having suffered many Hadoop issues already. So a lot of the expected Hadoop market has moved straight to cloud to mitigate against the complexity of Hadoop. Fortunately, we moved to cloud as a company at the same time, which served us well by ensuring we soon had a cloud product available.

What Next?

We still have some dedicated hardware for a variety of purposes. Migration of that as and when required/practical is something for the future:

1.      Windows domain infrastructure, which we haven’t tried to migrate.

2.      Dedicated hardware for product testing with high bandwidth networking unavailable in AWS. We have customers using on-premise systems with excellent networking, and no way to test that sort of environment in AWS currently.

3.      To allow us to backup our AWS storage outside AWS, ensuring we have a copy outside that cloud environment.

Summary

Migrating from on-premise to cloud for the workloads we’ve chosen has been an unexpectedly straightforward journey from my viewpoint. Although I didn’t have to do the implementation!

If you would like more information on anything discussed in this article, please contact me via the comments section.

More information on using Kognitio in AWS is here, including a link taking you to the AWS Marketplace page for Kognitio.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *