Moving Beyond GenAI APIs: How SkyPilot Kickstarted the ML Infra Behind Our AI-Native Game

Mar 31

Our early prototypes of Retail Mage’s AI-driven interactions rapidly accrued hundreds of dollars in API expenses during a single playtest session. High costs forced many studios pursuing similar ideas to abandon their experiments entirely.

Initially, cloud-based LLM APIs (such as ChatGPT) offered flexibility, but with significant drawbacks:

Excessive inference costs made iterative experimentation financially unsustainable.
Limited model availability restricted testing different architectures under consistent gameplay conditions.
Context-length constraints in fine-tuned GPT-3 models limited deeper and more systematic gameplay evaluation.

To thoroughly explore our research, we needed:

Evaluation of multiple AI methodologies (planning, dialogue generation, state management).
Cost assessment across various base models and fine-tuning methods (e.g., 7B vs 12B vs 24B, Bfloat16 vs Quantization, Full Fine-tuning vs QLoRA).
Determination of whether fine-tuned open-weight models could match or surpass the game experience quality of proprietary APIs.

This inevitably led us to self-hosted open-weight models, introducing new challenges in securing affordable and reliable GPU infrastructure across multiple cloud providers while leveraging available startup credits.

What is SkyPilot

SkyPilot is an open-source framework designed to streamline the deployment and management of AI and batch workloads across multiple cloud providers and on-premises clusters. It abstracts infrastructure complexities with a unified, intuitive interface. It allows teams to:

Rapidly Provision Compute Resources: Launch clusters quickly across clouds without extensive provider-specific configurations.
Infrastructure as Code: Define and manage environments and workloads through portable YAML specifications.
Efficient Job Management: Automatically schedule, execute, and recover workloads, reducing manual oversight.
Fault Tolerance: Automatically handles interruptions from spot instance preemptions by checkpointing workloads and resuming them seamlessly.
Resource Optimization: Efficiently manages autoscaling, GPU allocation, and network configurations to minimize developer overhead.

For our team, SkyPilot became the unified control plane that eliminated cognitive overhead from managing diverse cloud environments, maximizing our startup credits and accelerating our research velocity.

Managing Multi-Cloud AI Workload with SkyPilot

Previously, managing multi-cloud ML training required manual effort and days of configuration to spin up GPU clusters and move workloads between providers. I have been there myself—digging through cloud documentation to provision the right GPU instance and manually configuring the matching CUDA and Torch runtime.

With SkyPilot, we transformed this into a seamless, one-command process:

💡 Example :

sky jobs launch job.yaml -n train-job --use-spot

Using this SkyPilot YAML configuration, this single command automatically handles:

Selecting the best cloud provider based on availability and cost
Spinning up instances
Syncing our codebase
Launching the training job

By codifying our workflow in SkyPilot YAMLs, we could run multiple parallel training jobs consistently, significantly reducing iteration times.

Achieving 80% Cost Savings with Spot Instances

High-end GPUs like A100s and H100s are costly and hard to secure, particularly in early 2024 and particularly for early-stage startups. Spot instances offer substantial cost savings but risk unexpected preemptions.

SkyPilot’s fault-tolerant automation handles these interruptions gracefully by:

Automatically resuming interrupted jobs on new instances
Integrating seamlessly with PyTorch checkpointing, ensuring training continues from the last saved state
Using multi-cloud fallback to instantly switch to another provider if GPU capacity is unavailable

💡 Example: We successfully ran multiple long-running, repeatedly interrupted model fine-tuning jobs that would have cost over $10,000 with on-demand GPUs — for just a fraction of the cost.

Maximizing Cloud Credits Across AWS, GCP, and Azure

As a startup, cloud credits are a significant asset, but managing them across multiple providers—with different APIs, rules, and systems—can be challenging.

SkyPilot's Unified Multi-Cloud Workflow

SkyPilot enabled us to effortlessly:

Run workloads seamlessly across AWS, GCP, and Azure using a single interface
Exhaust one provider’s credits before switching seamlessly to another
Optimize GPU usage and allocation across clouds

Since our training pipeline was fully defined in SkyPilot YAML configurations, switching between cloud providers required changing only a single line of code.

Takeaways: Engineering Insights

Several important lessons emerged from our experience. Firstly, careful planning is essential to manage inter-cloud data transfer costs effectively, as transferring large datasets between clouds can significantly reduce expected cost savings.

Adopting MLOps best practices and emphasizing reproducible ML experimentation was crucial. Specifically, we recommend:

Standardizing job definitions to ensure portability across cloud environments without provider-specific customizations.
Establishing infrastructure for rapid A/B testing of models directly within applications to accelerate experimentation.
Developing systematic evaluation frameworks to consistently compare models against defined gameplay metrics.

Additionally, we found it important to start with the best available GPU resources to achieve optimal performance, rather than relying on quantization or reduced precision methods initially.

Finally, to maintain reproducibility and efficient experiment tracking, avoid directly editing YAML job files. Instead, create new experiment folders by copying job configurations and codebases.

Beyond Cost Savings: Enabling Systematic Gameplay Innovation

The most significant impact wasn't financial but methodological. By removing GPU constraints, we established a systematic approach to AI experimentation that transformed how we evaluated whether GenAI could truly enable new forms of gameplay:

We could methodically test different approaches to NPC behavior, dynamic narratives, and player interaction models
Different model architectures could be benchmarked in real gameplay scenarios
We could evaluate whether fine-tuning improved specific gameplay elements or if base models were sufficient

This infrastructure approach didn't just save money—it gave us the empirical foundation to answer our fundamental question about whether GenAI enables genuinely new forms of gameplay. For Retail Mage, the evidence points to yes—but only because we could systematically evaluate different approaches without being constrained by API costs or single-cloud GPU availability.

While SkyPilot isn't the only solution for multi-cloud ML orchestration, it is a high-quality product built by people with deep experience building LLMs and tackling real-world engineering challenges. It provided us the flexibility we needed for rapid game development iteration. Without it, we couldn’t have learned what we needed to in order to ship Retail Mage.

By solving the GPU accessibility puzzle, we could focus on what truly mattered: determining which AI approaches created the most compelling new gameplay experiences. SkyPilot didn’t just unlock GPUs—it unlocked our ability to learn, experiment, and iterate. We’ll be sharing more of what we learned in future posts.

This is part of a series of articles exploring the technical and design challenges we encountered building Retail Mage and other AI-driven games at Jam & Tea Studios. If you're interested in these topics or exploring our INFUSE platform, you can find us at jamandtea.studio. You can find Retail Mage on Steam.

Aaron Farr