Case Study: Clusterone's Kubernetes Cluster Implementation

The Challenge 

When planning our Kubernetes cluster, we embark on a journey that demands careful consideration of our unique needs and objectives. Here's how we approach this crucial task, drawing wisdom from a variety of expert sources:

  • We have prepared extensive inventory for our installer, written in Ansible. The goal was to provide access to each server and accurately describe the hardware configuration.
  • Our installer equalized the OS versions on all servers and installed the necessary additional software and tools.

In the next steps, containerization and network support were installed and configured. We created a Kubernetes cluster, and we connected all other computers to Nody. From the pool of available additional disks, we created a distributed disk storage system using Ceph. The installer set up persistent volume for Ceph on K8s, installed the appropriate versions of the card drivers on the GPU-equipped servers, and made the cards available for use by ML jobs running as PODy on the Kubernetes cluster.

How do we normally plan it? 7 steps to fit customer needs:

When planning our Kubernetes cluster, we embark on a journey that demands careful consideration of our unique needs and objectives. Here's how we approach this crucial task, drawing wisdom from a variety of expert sources:

  1. Selecting the Ideal Kubernetes Configuration: For our initial forays or when our focus is on learning, we opt for Minikube. This choice offers us a compact, manageable Kubernetes environment that's perfect for individual exploration or small-scale testing. With Docker as the container runtime, setting up Minikube becomes a straightforward task, enabling us to dive into Kubernetes with minimal setup​​.
  2. Preparing for Production: As we transition to production, our strategy evolves. We lean towards using Kubeadm for its robustness in creating a production-grade Kubernetes cluster. This method allows us to construct a cluster with a single master node and an etcd configuration, setting the stage for a scalable and resilient environment. The process involves a sequence of steps, starting from installing essential Kubernetes components to initializing our cluster withkubeadm init​​.
  3. Machine Type and Node Size Decision-Making: Our workloads' specific demands guide our choice of node machine types and sizes. Whether our applications require basic resources for testing or more dedicated resources for intensive tasks, platforms like DigitalOcean offer a spectrum of Droplet plans that cater to these needs. This decision affects not just the performance but also the efficiency and scalability of our applications​​.
  4. Node Size and Count Considerations: The architecture of our cluster hinges on the size and number of nodes we deploy. This balance is critical for ensuring that our cluster has the necessary resources to perform optimally while maintaining the flexibility to scale. We aim for a configuration that supports our workload without unnecessary excess, ensuring efficient use of resources​​.
  5. Adopting a Data-Driven Approach: Once our cluster is operational, we engage in benchmarking and load testing to gauge performance under various scenarios. This empirical approach allows us to refine our resource allocation and node configuration, tailoring our cluster to meet the precise needs of our applications effectively​​.
  6. Ensuring High Availability and Scalability: High availability and the ability to scale dynamically are paramount for our critical workloads. By incorporating high-availability features for the control plane and enabling autoscaling, we enhance our cluster's resilience and adaptability, ensuring that our applications remain robust and responsive under varying loads​​.
  7. Implementing Kubernetes Best Practices: Our strategy encompasses a broad array of best practices, from optimizing container images to securing and organizing our cluster using namespaces, labels, and RBAC. These practices not only streamline our operations but also bolster the security and manageability of our environment​​.

What was a solution that we used in ClusterOne?

The goal was to install the shared hardware of a Kubernetes cluster to run a ClusterOne solution. Shared servers came from different manufacturers. They were equipped with various models of Nvidia cards; they required a separate version of the drivers. Access to the infrastructure was possible only by redirecting one SSH port on the firewall.

We have implemented the ability to run ML tasks in a Kubernetes cluster at the scheduled time, with the expected result of using all available infrastructure on the client's side. The solution met the customer's expectations; ML tasks are automatically distributed across the cluster, allowing the use of all available hardware.

Technologies

Ceph

We've successfully executed various Ceph applications, both in the context of virtualization and deployment within Kubernetes clusters. Our expertise includes Ceph optimization, specifically with regard to configuration and type of OSD, journaling, and disk selection, such as physical disk parameters. MooseFS and LizardFS implementations have also been tested by us; however, neither possesses inbuilt support for functioning as persistent volumes in Kubernetes clusters.

Setting up configuration and infrastructureRather than using Foreman for production, we've drawn on our extensive knowledge of Ansible. The combination of Foreman and the forklift tool aligns well with our proficiency in Ansible. Meanwhile, we use Terraform and Packer on a regular basis.

Why Ansible?

  1. Container Build and Management: Ansible streamlines the container build process, offering an alternative to relying solely on Dockerfiles for container image creation. By using Ansible, teams can utilize more expressive and maintainable playbooks for building container images, leveraging tools like Buildah and Ansible-Bender. This approach enhances the clarity and maintainability of container builds, moving beyond the limitations of Dockerfiles​​.
  2. Cluster Management: Kubernetes clusters, whether self-managed or provided as a service, require meticulous setup and ongoing management. Ansible excels in orchestrating these multi-server applications, managing upgrades, integrations, and the entire lifecycle of Kubernetes clusters. It supports various modules for interacting with cloud services (e.g., Azure, AWS, and Google Cloud), making it easier to manage clusters across different environments. Projects like Kubespray leverage Ansible for custom Kubernetes cluster builds, demonstrating its versatility and compatibility with diverse infrastructure arrangements​​.
  3. Application Lifecycle Management: Ansible further proves its utility in Kubernetes environments through its role in application lifecycle management. It can be used to develop Kubernetes operators with the Operator SDK, allowing for the encapsulation of complex operational logic into reusable, automated components. This enables precise control over deployment, upgrades, and management tasks within Kubernetes clusters, all while utilizing Ansible's extensive module library and straightforward YAML syntax for playbook creation​​.

Aftermath:

To address the challenge of deploying a Clusterone solution on a Kubernetes cluster across shared hardware with varying Nvidia GPU models, a comprehensive Ansible inventory was created. This inventory facilitated server access and detailed hardware configuration documentation. The solution involved standardizing OS versions, installing necessary software, and setting up containerization and network support. A Kubernetes cluster was established, integrating all servers as nodes and utilizing Ceph for distributed disk storage. Persistent Volume for Ceph was configured for Kubernetes, ensuring GPU-equipped servers had the correct drivers for ML job execution. This implementation met the client's expectations, enabling efficient ML task distribution and utilization of the entire infrastructure.

you may also like

Why ChatGpt didn't want to talk about David Mayer, and why your own LLM solves a lot of problems.

David Mayer case that prove to us that OpenAI models are not that open as they appear to be. Why your own LLM model might be a key to independence and better results for you?

Read full story

OpenAI's Strawberry (O1) is a Game-Changer for AI: Why Inference-Time Scaling is the Future of AI Reasoning

Devopsbay CEO, Michał Kułaczkowski, discusses OpenAI's innovative model, Strawberry (O1), which introduces inference-time scaling. The model separates reasoning from knowledge, using external tools instead of relying on large, pre-trained models. Shifting

Read full story

Case Study: Clusterone's Kubernetes Cluster Implementation

Learn our case-study how to configure Kubernetes clusters for high efficiency, from selecting configurations to ensuring high availability and scalability.

Read full story