Optimizing AI/ML Workloads with Google Cloud Storage in Vertex AI

This demonstration explores the crucial role of storage solutions within AI/ML workloads, specifically within the Vertex AI ecosystem. Managing largescale datasets and AI models presents significant challenges. This article illustrates how Google Cloud Storage (GCS) efficiently addresses these complexities, particularly in the context of Polygeema models. We'll show how to optimize the management of necessary datasets, enabling efficient model deployment and training.

Polygeema Model Demonstration

First, we showcase the capabilities of the Polygeema model as it exists within the Google Cloud Model Garden prior to any custom finetuning. This serves as a baseline. We upload an image of Machu Picchu and use the visual question answering mode to describe the image as a comedian would to a child. The output provides a simplified yet accurate description of the ancient ruins. This highlights Polygeema's ability to quickly analyze visual data and provide a relevant initial response that can be further customized.

Next, we observe the finetuned Polygeema model using the same Machu Picchu image. The enhanced custom model delivers a much richer and more detailed description, significantly exceeding the baseline's “ancient ruins.”

Demo Architecture: From Data Ingestion to Archiving

The demo architecture outlines a process encompassing data ingestion, preparation, model training and validation, serving, and archiving.

Data Ingestion

The first step is to use a Vertex AI Colab Enterprise notebook script to automate the data transfer process from AWS S3 to Google Cloud Storage using Storage Transfer Service.

Data Preparation

After data ingestion, we access the dataset from Google Cloud Storage (GCS) and use observability metrics to track data access patterns. This includes monitoring the data transfer from AWS into GCS, observing the unpacking of images and writing of processed data back to GCS, and finally tracking the dataset split within the GCS bucket.

Model Training and Validation

We copy the Polygeema model to GCS and then create a FileStore instance for checkpoints to save the training progress. This allows us to resume training if needed. To prepare for training, we configure the VM's resources and mount the file system. This enables data access and storage connections for the training workload.

During training, we gain realtime visibility into training workload performance by monitoring VM metrics. Network spikes correlate with checkpoint writes to FileStore, while GPU and memory utilization peaks align with training cycles.

Monitoring dashboards show training behavior, starting from VM boot, then data caching using GCS Fuse, and finally checkpoint writes to FileStore. The demonstration shows a restarted training job with a cleared VM cache to observe the impact. Now, training data loads from the GCS Fuse cache, not directly from GCS, demonstrating the differences.

Serving

Posttraining, we select the lowest loss checkpoint for serving. We then copy this checkpoint from FileStore to Google Cloud Storage. As we prepare for serving, we configure the serving GCS bucket and enable Anywhere Cache to optimize read performance. For serving, we configure the GKE cluster, and the point of access is copied for later serving requests.

We demonstrate the enhanced image descriptions generated by the finetuned Polygeema model. These results are notably richer than those produced by the initial baseline.

To further test the trained model, we use images of more historical landmarks, including Mount Rushmore National Memorial, the Taj Mahal, the Colosseum, and Easter Island.

To demonstrate the advantages of GCS Anywhere Cache, we expand the GKE cluster and illustrate how data loads directly from Anywhere Cache during serving tasks to improve performance. **This significantly reduces latency and improves the overall user experience.**

Archiving

To conclude, we use Storage Transfer Service to transfer training checkpoints from the FileStore instance to Google Cloud Storage for archiving. **This ensures longterm data retention and availability for future use.**

Conclusion

This demonstration showcased the power of Google Cloud for AI/ML workloads, including Vertex AI and its robust storage solutions. We highly recommend delving deeper into the features and products demonstrated today to unlock their full potential. **Leveraging Google Cloud Storage effectively streamlines the entire AI/ML lifecycle, from data ingestion to model serving and archiving, leading to improved performance, scalability, and cost efficiency.**