Badoo is the world's largest social network for meeting new people. We have thousands of servers across two data centers and some of them inevitably crash. To execute scheduled tasks we use so-called 'script running' machines, which are used to launch PHP scripts from command line. Until recently, we used ordinary Cron jobs to launch scheduled tasks, as well as an internal utility used to automate crontab generation. Nonetheless, developers ended up having to manually pick machines where to run the cron jobs. This resulted in tight coupling to specific servers and in the event of a server crash we had to manually transfer scripts from one server to another. To evenly balance the load across multiple servers and provide automatic failover, we decided to create an internal cloud to solve this problem. This talk is about how we created our cloud and the initial results we had.
- Specs and requirements
- Alternative solutions
- Load balancing
- Diagnostics and crash recovery
- Cloud monitoring