profile for Gajendra D Ambi on Stack Exchange, a network of free, community-driven Q&A sites

Sunday, January 22, 2023

Importance of lift and shift architecture and migration to azure in a week or less!

 the architecture for hardware layer, OS layer, orchestration layer, application layer, auto devops was all in my bucket along with a lot more for a project. We wanted to have a sytem in house, a private cloud, a private cloud for a hardware company. Yes, you heard it right. This is also why the core of what we wanted was not available from any public cloud providers. They in turn depended on us for their CPUs, GPUs and all the BIOS, drivers, firmware, frameworks associated with them to work perfectly in their cloud for their customers. Then why did we move to cloud and what part?!

I am in IT from 2005 and I am from that generation which went gaga over intel celeron, floppy drives etc., Building computers, workstations, storage servers, NAS, compute servers was a side hustle even during the call center days. So when people were moving from baremetal to hypervisors (VMware boom), many of us realized that how important it is to have a decoupled, well connected lift and shift architecture, where you can lift your entire architecture and shift it to a new platform, when you wanted it, wherever you wanted it. This requires a couple of things which seem trivial but they are a sweet trap that many people fall into.

  1.  choose technologies/products which are independent or not vendor specific
    1. ex: using jenkins over github github actions or gitlab enterprise; because jenkins works with anyone.
  2. prefer opensource over closed source
    1. ex: jenkins over github actions (fyi we are using github actions btw, I know I am not practising what I preach but you will see why, later)
  3. Matured over new kid on the block
    1. ex: jenkins or jenkins x over bamboo or circleci. [I am keep saying jenkins not because I have married it, but because most people can relate to it]
  4. Dont chase the shiny.
    Some have this habit of chasing the latest, greatest, new shiny thing. May be it is the Gen Z fast fasion habit. It does not matter the product or solution is dynasaur old or born yesterday, if it solves most of your problems then you choose it. If there is a tie or a dilemna, then choose the oldest. Older and more matured a product is, bigger will be the userbase, higher will be the number of issues reported and number of solutions offered per issue. So almost all the time all your issues related to it will be old and solved by somebody else years ago.
  5. one who can do more than more who do one.
    When you are choosing products, choose those who meet most of your requirements than choosing one product per requirement.
    ex: You want nosql and sql database, So you go with postgresql and mongodb. I will go with just postgresql since it offers sql and no sql, unless there is a specific requirement which is only met by mongodb and its document structure etc.,
  6. Automate except the 1st one
    I had to have many k8s clusters deployed, managed, monitored. So I did the 1st one manually. Automated everything. Destroyed it and rebuilt with automation. When I was happy about the results, I used the same automation for all k8s cluster. 
  7. Backup and restore
    Automate the backup of your databases, you should be able to restore it to 15-30 days prior to the corruption. You should only be attending it if it is not working and never attend it or know that it exists if it is working. 
  8. DR and SR
    For our usecase this was not an issue but may be for you it will be. All your backups should ideally be on a different network, different site, different platform. If your DB is on azure CA region, then your backups should be on AWS alask region. Your backup site which comes up should be in different site, different state, if possible different timezone. This is from my experience with many datacenters we built at EMC for some of world's biggest banks like State Bank of India, Citibank etc., Disaster Recovery and Site Recovery are a whole new game. It should never be or never can be a one man show
The above are the most important but not all. I was also hosting our own git with gitlab on k8s which was also our AutoDevops cluster with gitlab runners.
Rest of the clusters were for different environment and project. 

AutoDevOps

When you are the one guy who is supposed to install, manage, administer, monitor multiple k8s clusters, responsible for hosting codebase for the projects, then you better also have automated devops just like the other pieces, else you will have insomnia.
  1. Implement a git plan (in our case it was gitflow)
  2. production branches are always protected
  3. develop branches are always protected
  4. None can merge to develop branch
  5. Any can do a pull request from develop branch
  6. None shall merge to develop without a peer reviewed PR.
  7. Depending on which branch the code is pushed to, it should only get deployed to that environment. If code gets merged to develop branch, then it gets deployed in develop environment. main branch to production etc.,
  8. All repos go through SAST and reports are generated and enforced
  9. No secrets will be in the code or even avoid putting them in the repo's settings. Let them be injected during CICD.
You will add more to this list as you go along but these should be a must.
So now we come to the part where we make the applications 'lift and shift' compliant.
  1. 1 application per repo
  2. all applications get dockerized during CICD
  3. docker images gets pushed to a remote registry, you can a separate namespace per environment. ex: docker.io/myproject/develop/myapp:latest, docker.io/myproject/production/myapp:latest etc.,
  4. All apps get deployed via helm chart and variables injected with values.yaml during cicd.
  5. Always be stateless with your containerized applications unless you really can't, like with databases.
  6. Have prometheus+grafana alerts set for each of your app if it goes down or becomes unavailable, linked to your email or office chat application.
The above makes your applications 'lift and shift' ready. Now we want to make what lies underneath, lift and shift ready, k8s.
Just how you I had a gitlab-ci.yaml for each application's autodevops, now you want to have a gitlab-ci.yaml (or whatever you are using) for your k8s cluster, where it will deploy all of the below in order with 1 click to run one cicd pipeline. I suggest you always have static reserved IP addresses available for many of your crucial apps like nginx ingress controller.
  1. It first deploys storage, I recommend nfs via helm chart from opensource community. most reliable and easy
  2. deploys nginx controller
  3. deploys on prem LB like a metallb with op pool
  4. deploys databases and backend appliations like rabbitmq, vault etc.,
  5. deploys all the middleware like webservers like django or flask or node etc.,
  6. deploys all the front end apps in series.
  7. deploys all the ingresses for each of the app
  8. deploys your monitoring solutons (prometheus, loki, grafana etc.,)
IF you are wondering what I used for on prem k8s installation, for production cluster it was rke2 with kube-vip with cluster master nodes offering HA (high availability) and FT (Fault Tolerance) for all other clusters, it was rke1. Now let us make this k8s layer too lift and shift ready.
for rke it is easy. If you have the cluster.yaml file with all the configuration ready, then it is a 1 click shell script to deploy or destroy your whole cluster.
rke2 is a bit of work to create that automation script but it will be same.
So now you have a 
  1. 1 click script which can have a multi node k8s cluster.
  2. map your installed clusters to your cicd engine (jenkins, gitlabci etc.,)
  3. using the CICD from the previously defined stage, you deploy all the cluster's apps, backend, middleware, front end, monitoring tools etc.,
IF you do it rightly, then if your cluster does down or gets destroyed in the evening, you will have the cluster redeployed in less than few hours. Where have you heard of an entire project or infrastructure which can be destroyed and redeployed with almost full automation in less than a few hours? Thanks to k8s, containerization this is possible.

Azure Migration

    Even though we had on prem k8s cluster, the problem was the hardware, network, power outage, drives failing, motherboard frying or some other failure would cripple us and I had to be on my toes with my recently joined colleague too. We realized we are spending more on maintaining the infrastructure than our actual goal, so we decided to move to azure. So instead of on prem k8s, we shall use AKS, azure k8s service. Just to give you an example, our monitoring and logging and alerting system consisted

  1.  prometheus server per k8s
  2. one log collecting loki deployment per k8s
  3. a thanos metric aggregator which collects all data from all prometheus servers and offers it to grafana
  4. 1 grafana front end
  5. 1 dashboard per k8s environment
  6. 1 dashboard per app per cluster
  7. Alerts set per app per cluster/environment
After we moved to azure, sysdig and ELK stack hosted on cloud took care of this part too. More free time, more investment in our actual goal.
I will give you the oversimplified version of migration to AKS from on prem k8s.
  1. Move repos from gitlab to github
  2. convert gitlabci to github actions. In almost all github actions, the actual action is shell script, so wrapper was different but shell script had to be a copy paste in many jobs
  3. So now the github actions will deploy all your apps in one go (earlier it was done by gitlabci but not github actions)
If we discount ourselves from the delay in getting access and some delays with the IT and networking which are not in our hands, then the actual migration happened in one weekday. The best part is I got late access, which means my colleague who joined late to the organization and project was able to migrate it in a weekday. May be if it was someone else, they would have took more time but nonetheless such migrations use to take weeks of planning and a month or more with many involved in such mgiration. In the case of virtualization techs, it would take months. Since we moved to azure and we had a good enterprise subscription, that means I was greedy and did not want to maintain our own codebase anymore, especially if we are already paying.
If your entire infrastructure can be moved in a week by someone who has only a 2nd hand knowledge of it, it shows 2 things, how *lift and shift* ready the architecture was and how good the guy who did it is too.
TLDR
Ensure all of your applications are as decoupled, connected, containerized as much as possible and you have no vendor lock in products in your architecture.