profile for Gajendra D Ambi on Stack Exchange, a network of free, community-driven Q&A sites

Sunday, January 22, 2023

Importance of lift and shift architecture and migration to azure in a week or less!

 the architecture for hardware layer, OS layer, orchestration layer, application layer, auto devops was all in my bucket along with a lot more for a project. We wanted to have a sytem in house, a private cloud, a private cloud for a hardware company. Yes, you heard it right. This is also why the core of what we wanted was not available from any public cloud providers. They in turn depended on us for their CPUs, GPUs and all the BIOS, drivers, firmware, frameworks associated with them to work perfectly in their cloud for their customers. Then why did we move to cloud and what part?!

I am in IT from 2005 and I am from that generation which went gaga over intel celeron, floppy drives etc., Building computers, workstations, storage servers, NAS, compute servers was a side hustle even during the call center days. So when people were moving from baremetal to hypervisors (VMware boom), many of us realized that how important it is to have a decoupled, well connected lift and shift architecture, where you can lift your entire architecture and shift it to a new platform, when you wanted it, wherever you wanted it. This requires a couple of things which seem trivial but they are a sweet trap that many people fall into.

  1.  choose technologies/products which are independent or not vendor specific
    1. ex: using jenkins over github github actions or gitlab enterprise; because jenkins works with anyone.
  2. prefer opensource over closed source
    1. ex: jenkins over github actions (fyi we are using github actions btw, I know I am not practising what I preach but you will see why, later)
  3. Matured over new kid on the block
    1. ex: jenkins or jenkins x over bamboo or circleci. [I am keep saying jenkins not because I have married it, but because most people can relate to it]
  4. Dont chase the shiny.
    Some have this habit of chasing the latest, greatest, new shiny thing. May be it is the Gen Z fast fasion habit. It does not matter the product or solution is dynasaur old or born yesterday, if it solves most of your problems then you choose it. If there is a tie or a dilemna, then choose the oldest. Older and more matured a product is, bigger will be the userbase, higher will be the number of issues reported and number of solutions offered per issue. So almost all the time all your issues related to it will be old and solved by somebody else years ago.
  5. one who can do more than more who do one.
    When you are choosing products, choose those who meet most of your requirements than choosing one product per requirement.
    ex: You want nosql and sql database, So you go with postgresql and mongodb. I will go with just postgresql since it offers sql and no sql, unless there is a specific requirement which is only met by mongodb and its document structure etc.,
  6. Automate except the 1st one
    I had to have many k8s clusters deployed, managed, monitored. So I did the 1st one manually. Automated everything. Destroyed it and rebuilt with automation. When I was happy about the results, I used the same automation for all k8s cluster. 
  7. Backup and restore
    Automate the backup of your databases, you should be able to restore it to 15-30 days prior to the corruption. You should only be attending it if it is not working and never attend it or know that it exists if it is working. 
  8. DR and SR
    For our usecase this was not an issue but may be for you it will be. All your backups should ideally be on a different network, different site, different platform. If your DB is on azure CA region, then your backups should be on AWS alask region. Your backup site which comes up should be in different site, different state, if possible different timezone. This is from my experience with many datacenters we built at EMC for some of world's biggest banks like State Bank of India, Citibank etc., Disaster Recovery and Site Recovery are a whole new game. It should never be or never can be a one man show
The above are the most important but not all. I was also hosting our own git with gitlab on k8s which was also our AutoDevops cluster with gitlab runners.
Rest of the clusters were for different environment and project. 

AutoDevOps

When you are the one guy who is supposed to install, manage, administer, monitor multiple k8s clusters, responsible for hosting codebase for the projects, then you better also have automated devops just like the other pieces, else you will have insomnia.
  1. Implement a git plan (in our case it was gitflow)
  2. production branches are always protected
  3. develop branches are always protected
  4. None can merge to develop branch
  5. Any can do a pull request from develop branch
  6. None shall merge to develop without a peer reviewed PR.
  7. Depending on which branch the code is pushed to, it should only get deployed to that environment. If code gets merged to develop branch, then it gets deployed in develop environment. main branch to production etc.,
  8. All repos go through SAST and reports are generated and enforced
  9. No secrets will be in the code or even avoid putting them in the repo's settings. Let them be injected during CICD.
You will add more to this list as you go along but these should be a must.
So now we come to the part where we make the applications 'lift and shift' compliant.
  1. 1 application per repo
  2. all applications get dockerized during CICD
  3. docker images gets pushed to a remote registry, you can a separate namespace per environment. ex: docker.io/myproject/develop/myapp:latest, docker.io/myproject/production/myapp:latest etc.,
  4. All apps get deployed via helm chart and variables injected with values.yaml during cicd.
  5. Always be stateless with your containerized applications unless you really can't, like with databases.
  6. Have prometheus+grafana alerts set for each of your app if it goes down or becomes unavailable, linked to your email or office chat application.
The above makes your applications 'lift and shift' ready. Now we want to make what lies underneath, lift and shift ready, k8s.
Just how you I had a gitlab-ci.yaml for each application's autodevops, now you want to have a gitlab-ci.yaml (or whatever you are using) for your k8s cluster, where it will deploy all of the below in order with 1 click to run one cicd pipeline. I suggest you always have static reserved IP addresses available for many of your crucial apps like nginx ingress controller.
  1. It first deploys storage, I recommend nfs via helm chart from opensource community. most reliable and easy
  2. deploys nginx controller
  3. deploys on prem LB like a metallb with op pool
  4. deploys databases and backend appliations like rabbitmq, vault etc.,
  5. deploys all the middleware like webservers like django or flask or node etc.,
  6. deploys all the front end apps in series.
  7. deploys all the ingresses for each of the app
  8. deploys your monitoring solutons (prometheus, loki, grafana etc.,)
IF you are wondering what I used for on prem k8s installation, for production cluster it was rke2 with kube-vip with cluster master nodes offering HA (high availability) and FT (Fault Tolerance) for all other clusters, it was rke1. Now let us make this k8s layer too lift and shift ready.
for rke it is easy. If you have the cluster.yaml file with all the configuration ready, then it is a 1 click shell script to deploy or destroy your whole cluster.
rke2 is a bit of work to create that automation script but it will be same.
So now you have a 
  1. 1 click script which can have a multi node k8s cluster.
  2. map your installed clusters to your cicd engine (jenkins, gitlabci etc.,)
  3. using the CICD from the previously defined stage, you deploy all the cluster's apps, backend, middleware, front end, monitoring tools etc.,
IF you do it rightly, then if your cluster does down or gets destroyed in the evening, you will have the cluster redeployed in less than few hours. Where have you heard of an entire project or infrastructure which can be destroyed and redeployed with almost full automation in less than a few hours? Thanks to k8s, containerization this is possible.

Azure Migration

    Even though we had on prem k8s cluster, the problem was the hardware, network, power outage, drives failing, motherboard frying or some other failure would cripple us and I had to be on my toes with my recently joined colleague too. We realized we are spending more on maintaining the infrastructure than our actual goal, so we decided to move to azure. So instead of on prem k8s, we shall use AKS, azure k8s service. Just to give you an example, our monitoring and logging and alerting system consisted

  1.  prometheus server per k8s
  2. one log collecting loki deployment per k8s
  3. a thanos metric aggregator which collects all data from all prometheus servers and offers it to grafana
  4. 1 grafana front end
  5. 1 dashboard per k8s environment
  6. 1 dashboard per app per cluster
  7. Alerts set per app per cluster/environment
After we moved to azure, sysdig and ELK stack hosted on cloud took care of this part too. More free time, more investment in our actual goal.
I will give you the oversimplified version of migration to AKS from on prem k8s.
  1. Move repos from gitlab to github
  2. convert gitlabci to github actions. In almost all github actions, the actual action is shell script, so wrapper was different but shell script had to be a copy paste in many jobs
  3. So now the github actions will deploy all your apps in one go (earlier it was done by gitlabci but not github actions)
If we discount ourselves from the delay in getting access and some delays with the IT and networking which are not in our hands, then the actual migration happened in one weekday. The best part is I got late access, which means my colleague who joined late to the organization and project was able to migrate it in a weekday. May be if it was someone else, they would have took more time but nonetheless such migrations use to take weeks of planning and a month or more with many involved in such mgiration. In the case of virtualization techs, it would take months. Since we moved to azure and we had a good enterprise subscription, that means I was greedy and did not want to maintain our own codebase anymore, especially if we are already paying.
If your entire infrastructure can be moved in a week by someone who has only a 2nd hand knowledge of it, it shows 2 things, how *lift and shift* ready the architecture was and how good the guy who did it is too.
TLDR
Ensure all of your applications are as decoupled, connected, containerized as much as possible and you have no vendor lock in products in your architecture.

Wednesday, March 2, 2022

Thursday, May 20, 2021

Redis 6 and multi threading

 We have an excellent article here https://www.digitalocean.com/community/tutorials/how-to-install-redis-from-source-on-ubuntu-18-04 on how to install redis from the source.

Here we can configure the redis to have multi threading

https://www.programmersought.com/article/30635498543/

update both the settings, number of cores and multi threading.

Apart from the above make sure to update the following

bind 0.0.0.0
protected-mode no
in the configuration file of redis at /etc/redis/redis.conf.
Once we reload the redis service, we should be good to go. In our case we were using redis to speed up for build process (as cache) and thus we did not need the data snapshots, Using redis-cli do 'config set stop-writes-on-bgsave-error no'

Friday, April 9, 2021

awx on kubernetes with operator

 So I was using awx15 on our on prem k8s and I had to reinstall the community helm chart was yanked out. The official awx github page was no good, the awx operator could only install the latest but we needed 15 and it gave 18+. Luckily the older operator 0.6.0 deployed the awx 15 but now without the admin secret to log in. Thanks to this https://github.com/ansible/awx-operator/issues/123#issuecomment-797820294 I could just login, create a new superuser and get on with it. The awx operator has a long way to go.

Friday, March 12, 2021

Battle Royale or Scam Royale? An experience with Apex Legends by EA and Respawn

 Top AAA title to play for free forever, what is the catch? nothing, well at least nothing that you can see or know or prove.

I have been an admirer and a continuous player of apex legends from season 01 and so far I have 1000s of hours on this game with highest level in the game too. I still believe that to my taste it is the best game royale as of today and others should learn from it.

I am a free player (I do not pay or spend for anything in the game), I am not an influencer (not a streaming celebrity where my stream has millions of viewers and they get influenced by me on what they have to play. Here I have taken apex legends of course only as an example and it can be done on any online game.

Perfect Crime

A perfect crime is an act of unethical or immoral or dishonest behavior where proving the crime is almost next to impossible. Some might want to dismiss this as a conspiracy, koo koo, tin foil hat theory, my only answer to that is, in this world there are who commit heinous crimes like murder for few bucks where the chances of getting caught are very high, then what are the chances of some one committing a dishonest, almost impossible to prove crime where tens of crores (or billions) of money is riding and this act even if proven, there is no jail time for it, no lives lost or harmed. The code/game keeps getting updated, new features keep getting added. In this constantly changing state of the game, it is of course next to impossible to prove a crime. A search engine giant constantly updates the search engine and they do not disclose their algorithm, which means if they used their search engine to influence elections, promote their partners over others then there is no way way of proving it since they would have already updated the code/algorithm used by the search engine. These companies are so big that it is unthinkable for any use of them to have resources, energy and time to face such goliaths in court or outside.
So how does the game actually make money then? The only source for them is for player to do micro transactions. They (respawn and EA who are the developers of this game) have claimed that no micro transactions will affect the gameplay or in other words the experience (your spending won't affect your chances of losing or winning) and the game play will be the same for players who spend and players who do not. There are 3 types of players in any free to play video game. 

  1. Influencers
  2. Spenders
  3. free player (or fodder or none of the above)
I feel/think that (of course I do not have any proof of it and it is slowly based on my personal observations and experience) your chances of winning and losing can be increased or decreased or in other words game developers have a system which can give a boost to the players that they want. Currently of course the motivation is money but when money is not the factor it can also be done based on the ideology that they want to promote. conservative, politics, religion etc.,
It is also important that they do this only to extent where the players don't completely quit and never play the game. It should be done only enough where fodder level players still play it. If there are those who play it anyway, then you can always target them and feed them as fodder.

Influencers 

Influencers are the one who bring players to the game, if they are not having a great time playing it, then they lose viewership and influences (streamers) won't stream a game for which they will lose viewers. It is in streamers best interest that they stream games which will get them viewers, subscribers and make them famous. If streamers won't stream a game where they seem to suck at or lose or not doing well then game companies will lose, hence it is in game company's best interest that streamers always have an experience which promotes their growth (in terms of subscribers and viewers). All gamers know what are cheat codes and god modes. It is in game company's best interest that they apply cheat codes to the profiles of the influencer's profiles in their game's database on the server side itself and based on that tag attached to their profile the game server will increase their chances of winning ( I will explain how). It is the influences who bring spenders and free players.

Spenders

Spenders are the ones who actually pay money to buy items like skin and other cosmetics which actually have (or should) no value in the game's outcome (winning or losing). A spender will only spend on a game even if he does not have to, only if he is having a lot of fun. It is in the company's best interest that these spenders are of a high (not higher than influencers of course) priority. An AI based on the whether a person is a spender or a free player, can attach certain tags in the database for the profile of the player and based on that the game server will either degrade or upgrade your experience (think of these as cheat codes, but applied at the server side, none can see them except the developers of that AI). 

Free Player (fodder)

This is me. If Influencers need to get on an average 10+ kills per game and spenders should get 5-10 kills per game then it is the free player and some bots who are being fed as a fodder to the influencers and spenders. 

How?

The following tactics can be used to alter a player's experience and thus the outcome of the player's game (winning or losing or points gained (kill counts, wins, level ups etc.,) and here I have taken the game I play apex legend for reference. The following can/will be altered (increase or decrease) based on what type of player (influencer, spender, free) is playing the game
  1. network lag
  2. input lag
  3. output lag
  4. damage done
  5. damage taken
  6. aim assist
  7. availability of the type of gun and ammo on entry
  8. availability of the type of ammo during the gameplay
  9. reliability of the abilities
  10. cool down time
  11. heal and recharge items
  12. heal and recharge time
  13. auto/burst symbols disappear or won't show when you switch modes
  14. sound issues
  15. can't/won't hear enemies
  16. false sound (enemy on the left is heard as enemy on the right or center and vice versa)
  17. spawning of items (it is no more random but customized for player types, otherwise one wouldn't be getting the weakest weapons on landing, 7/10 times)
They can always dismiss the above and many more as just a bug.
I will list out some possible ways on how they can cripple gameplayer for type 3 players for different characters

Caustic

  1. Gas trap wont deploy
  2. Gas trap won't get triggered
  3. Gas trap trigger time increases
  4. Gas cloud damage done decreases (or increases depending on type of player)
  5. Disconnect player and when he reconnects he has been shot by the type 1, 2 players.
  6. Make player appear in front of enemy by making him suddenly appear in front of enemy (type 1,2 players)

Watson

  1. Her fence wont' deploy

Loba

  1. jump drive comes back and won't work when you really need it


The whole thing can be done by AI with no or minimal intervention and a ML model can learn about the player to adopt. Currently AI/ML is used by twitter on whom/what to trend, whom/what to suppress, what/whom to cancel or censor, what/whom to shadow ban etc.., This is no secret. The same is true for these online games. Today the motivation is most possibly money but if you look at the top player of apex legends, if not all but a great majority of them all are western, white and/or Christian and this tends to make one think about other motivations too.