Exploring the Evolution of Gen AI Infrastructure at Amazon AWS

By AWS Events · 2024-02-26

Join us for an in-depth look at the development of gen AI infrastructure at Amazon AWS, from its early stages to the latest innovations. Learn about the key decisions, customer needs, and hardware progression that have shaped this cutting-edge technology.

Behind the Scenes of Gen AI Infrastructure at Amazon

The discussion revolves around the development of gen AI infrastructure at AWS, focusing on the history, hardware progression, customer needs, and key decisions taken by the team at Anuna Labs within AWS.

Anuna Labs, a team within AWS, is responsible for building purpose-built chips like Nitro, Graviton, and Inferentia. The team emphasizes the importance of understanding the software layers and ecosystem to complement their hardware-centric products.

The discussion delves into the key principles followed by Anuna Labs, including portability, reusability, ease of use, and cost-effectiveness to address the future needs of customers.

The team incepted the idea of building machine learning chips in 2017, based on the growing demand from businesses within Amazon and from external customers. They identified optimization, performance, and integration as the main opportunities to assist customers.

The design process involved setting up a new team within Anuna Labs, akin to a startup within a startup. The team faced challenges including a lack of chip engineers, compiler or application engineers, and minimal previous ML experience.

The development and deployment of INF1 servers within AWS data centers involved meticulous manufacturing, testing, and assembly processes, leading to the scale deployment of INF1 servers across 23 global regions at AWS.

The discussion touches upon the configurations and deployment of INF1 and TRN1 servers, emphasizing the complexity and scale of the infrastructure put in place to support machine learning workloads.

Behind the Scenes of Gen AI Infrastructure at Amazon

Key Points About Tranium and Inferentia Chips

Tranium and Inferentia chips are key components inside the servers, with each server containing multiple chips for high-performance computing.

The Neuron Link technology in the servers allows interconnection of all the chips in a 3D hypercube and TOD tours configuration, ensuring maximum two hops between each chip for minimal latency and high performance.

Tranium chips are optimized for running training workloads, while Inferentia 2 chips provide 3x the performance compared to Inferentia 1, making them ideal for inference workloads.

The Tranium chip features advanced packaging technology and a large HBM (High Bandwidth Memory) to maximize compute performance and memory bandwidth, while the Inferentia 2 chip is designed for inference workloads with lower compute and higher memory bandwidth.

Each Tranium chip has two neuron cores, a tensor engine for compute operations, large SRAM on-chip memory to minimize memory movement, and 16 general-purpose SDe cores for programmable operations.

The servers built with Tranium and Inferentia 2 chips offer significant cost benefits for training and deploying models, with up to 50% lower cost for training on Tranium and almost 5x lower cost for deploying models using Inferentia 2.

The Neuron SDK, which supports open-source models and frameworks like PyTorch and TensorFlow, provides a user-friendly approach for training and deploying models with Tranium and Inferentia 2 chips.

Key Points About Tranium and Inferentia Chips

New Feature: Neuron Kernel Interfaces

The developer is aiming to make the model more performant and efficient on the hardware by moving the inner loops to hardening mapping, which lowers them to a language that the hardware can run. This involves training INF frenum and scheduling an allocation to parallelize the computation for increased efficiency.

A new feature called Neuron Kernel Interfaces (Nikki) is being introduced, allowing developers to write their own high-performance kernels on top of tranium. This bare metal interface bypasses most of the compiler steps, enabling performance optimization on tranium.

The Neuron Kernel Interfaces uses the same APIs as Triton from open AI, making it familiar to those who have experience with Triton. It provides the flexibility to develop innovative solutions and optimize performance on tranium.

The example of optimizing softmax performance on top of Trum using Nikki indicates its internal usage, and detailed documentation of the hardware will soon be available to customers. This feature will empower customers to invent new models with enhanced performance on tranium.

New Feature: Neuron Kernel Interfaces

The Future of AI and Data Analytics

The development of an intelligence platform aims to make data analytics more accessible to non-experts, allowing users to query data in natural language, eliminating the need for coding in SQL.

Generative AI is expected to revolutionize the way humans interact with computers, making tasks more efficient rather than replacing human roles, especially in regulated industries where control over the model and data is crucial.

The rise of generative AI is changing the future of work, with 75% of CEOs seeing it as a competitive advantage, particularly in operational aspects with a potential impact on product development.

The scalability and cost-effectiveness of AI solutions are key factors driving the adoption of generative AI, enabling enterprises to customize models for their specific use cases with lower costs.

The platform also caters to domain-specific tuning, providing the necessary control over models, which is particularly important in regulated industries like finance and healthcare.

Furthermore, the use of multi-cloud and multi-hardware platforms is aimed at delivering faster, cheaper, and more flexible solutions, aligning with the AWS objective of delivering more for less.

The Future of AI and Data Analytics

Introduction to Leonardo and INF1

Leonardo is a generative AI company focused on generative visual assets such as images, video, and textures for 3D models. It has millions of users and hundreds of thousands of community models trained on the platform, producing millions of images daily.

The company uses specialized models like 'text image' and 'photoal' to generate stunning images easily. They also utilize a model called 'Alchemy' for enhancing images with proprietary refinement methods. The demand for their service has been rapidly increasing, which led them to explore new hardware options for acceleration.

As a part of their search for suitable hardware, Leonardo found INF1 to have an appealing price-performance profile. With a strong commitment to innovation, they integrated INF1 into their architecture, achieving significant benefits in image generation speed and cost reduction without compromising quality.

The architecture integration involved routing job requests to real-time inference endpoints running on AWS Sagemaker. This resulted in a remarkable 80% lower cost for processing a job on INF1 instances compared to their existing fleet, while maintaining image quality and job time.

To further expand their usage of AI technology, Leonardo plans to get popular models up and running on INF2, experiment with data parallelism, and transition from Sagemaker to ECS and EC2. They aim to leverage the transformative technology and improve accessibility to it for their users.

Introduction to Leonardo and INF1

Conclusion:

The development of gen AI infrastructure at Amazon AWS is revolutionizing the future of data analytics, machine learning, and generative AI. With a focus on cost-effectiveness, performance optimization, and future scalability, AWS is paving the way for innovative solutions in the field of AI.