Fullstack Data Engineer
Posts
Reliability, Scalability, and Performance: The Core Principles of Akamai's Content Delivery Network

Reliability, Scalability, and Performance: The Core Principles of Akamai's Content Delivery Network

Christian Acuna
March 08, 2023

Understanding the history and evolution behind a particular innovation in the Internet will give you deeper insight into how you can leverage it in your system architecture designs. CDNs are often discussed as third-party services that you can plug and play into your architecture, such as AWS CloudFront, Fastly, Cloudflare, and Google’s Cloud CDN. However, a CDN is a complex distributed system built on top of the Internet, leveraging specialized hardware and software abstractions to provide the best possible delivery of content close to users worldwide.

Daniel Lewin and Tom Leighton created Akamai and introduced the concept of a Content Delivery Network (CDN) in the late 1990s. Originally, CDNs were designed to improve website performance by caching static web assets close to users at the edge of the Internet. CDNs have grown to enable much more than basic web caching, such as edge computing, video streaming, analytics, authoritative DNS services, and cybersecurity. The paper "The Akamai Network: A Platform for High-Performance Internet Applications” dives deep into the design, architecture, and principles for designing geographically disbursed distributed systems for content delivery.

System Design Principles

A CDN is a virtual network built as a software layer abstraction over the underlying Internet infrastructure, which provides no guarantees of end-to-end reliability or performance.

The Internet comprises a global network of interconnected computer networks that exchange data and information via standardized communication protocols such as TCP/IP, DNS, and HTTP/S. It consists of various hardware servers and devices owned and operated by organizations such as Internet Service Providers (ISPs), cloud service providers, government agencies, and universities. Changes to the underlying infrastructure and protocols of the Internet are slow. For example, IPv6 was created in the early 1990s but did not start to see adoption until 2007.

When designing a content delivery network on top of the Internet, you cannot assume any changes to the underlying hardware. You must rely on a software layer that can be easily modified and adapted to the changing requirements of your business and customers.

"The Akamai network is a very large distributed system consisting of tens of thousands of globally deployed servers that run sophisticated algorithms to enable the delivery of highly scalable distributed applications. We can think of it as being comprised of multiple delivery networks, each tailored to a different type of content—for example, static web content, streaming media, or dynamic applications."

The Akamai Network: A Platform for High-Performance Internet Applications, page 4.

The Internet is inherently unreliable. It is vulnerable to multiple failures or disruptions, such as network congestion, hardware failures, software bugs, cyber-attacks, natural disasters, and human error. When working with distributed systems, you assume that things will fail, and you design your system and components to respond to and recover automatically from these faults.

Akamai built a virtual network over the existing Internet as a software layer that required no changes to the underlying network. Software can be easily modified and adapted to future requirements as the needs of the Internet and its users changes. Akamai built multiple delivery networks targeted for specific types of content, such as static web content, streaming media, or dynamic web applications. Each type of online application or content that CDN power requires different hardware and software components. By building multiple types of delivery networks, Akamai could sever various use cases while using the same high-level design and system design principles for each network.

When designing a distributed system for content delivery, you design for reliability, scalability, and performance while limiting human intervention with automation to configure and manage a large network of servers.

Reliability, scalability, and performance are essential nonfunctional requirements for designing a distributed system. In Chapter 1 of Designing Data-Intensive Applications (Kleppmann, 2017), the author discusses that data-intensive applications and data systems should be reliable, scalable, and maintainable, with performance being a part of scalability. These are core pillars of any system you design, and you need to keep these requirements in mind as you take functional and business requirements from stakeholders.

As a content delivery provider to thousands of customers, Akamai aimed to design a reliable system that gave 100% end-to-end availability. They had a fundamental assumption that components throughout the system would fail frequently and in unpredictable ways. As a result, they built in multiple levels of fault tolerance to ensure that each component had full redundancy with no single point of failure that could bring the entire system to a halt. They achieved this by using decentralized leader election with PAXOS, a distributed consensus algorithm used in decentralized systems to achieve agreement among nodes.

Scalability is the ability of a system to handle an increase in load, content, and customers reasonably without a decrease in performance or stability. This is usually achieved by adding more servers in portion to the increased load or optimizing the existing resources to handle the load. If you grow your users linearly but need to increase your hardware to match the load exponentially, your system is not reliably scalable and most likely can’t sustain a 10x growth.

In 2010, Akamai had 61,000 servers across nearly 1,000 networks in 70 countries. To serve hundreds of billions of Internet interactions daily, Akamai designed the communications and control system for populating status information, control messages, and configuration updates in a fault-tolerant and dynamically scalable way. In addition, a data collection and analysis system was built to collect and process data from tens of thousands of server and client logs, along with network and server information. This analytics system was used for monitoring, alerting, analytics, reporting, and billing.

The core proposition of using a CDN is to optimize the performance of your application. In the early days of the Internet, the World Wide Web was referred to as "World Wide Wait" to describe the slow and often frustrating experience of waiting for web pages to load over the Internet. Akamai's CDN technology improved the performance, reliability, and security of websites and web applications by reducing latency and improving the user experience.

Akamai built their own transport system for their delivery network that used a highly distributed set of edge servers that served as a high-performance overlay network on top of the Internet. Akamai improved path optimization for routing traffic by dynamically selecting potential nodes based on Akamai’s performance data and internet topology, reducing packet loss inherent in TCP connections for latency-sensitive applications. They also added a proprietary transport-layer protocol to overcome the inefficiencies in TCP and HTTP for long-distance server communication. These are some of the many innovations Akamai introduced to reduce end-to-end performance from the origin servers, where the content resides, to the last-mile servers and, ultimately, to a user's device.

System Architecture

Origin Server
- The place that stores the original source content, such as web pages, images, videos, or other files, that users request.
Storage System
- A high-availably storage system to store static objects, large media libraries, and backup sites. This system is designed to have no single point of failure and servers clusters in multiple geographic locations.
Parent Clusters
- Akamai designed a tiered distribution platform for customers with extensive libraries of infrequently accessed content. These parent clusters are well-provisioned and have high connectivity to edge clusters. When an edge server does not have the content, it fetches it from the parent cluster instead of the origin server, significantly reducing the load on the origin server.
Transport System
- This system is responsible for moving data and content over the long-haul Internet with high reliability and performance from the origin servers to edge servers.
Edge Servers
- Each edge server is part of a larger edge server platform, a global deployment of servers located around the world close to users. They are reasonable for taking incoming requests from users and serving the content.
Monitoring Agents
- Akamai has a global monitoring infrastructure to measure stream quality metrics from a user standpoint, composed of agents deployed worldwide. Each agent tests the steam quality by continually playing streams. Results are fed into the analytics system.
DNS Servers
- Akamai deploys a globally distributed system of highly-available authoritative DNS servers for dynamic and static answers to DNS queries. Dynamic answers are based on Akamai mapping decisions, which direct end-users to the nearest available server to deliver the requested content. Static authoritative answers, however, provide DNS resolution for customer zones, specific domains, or subdomains managed by Akamai's customers.
Management Portal
- Allows enterprise customs to have fine-grained control over how their content and application are delivered to users. Customers update configurations across the edge platform from this portal. They also gain visibility into how their users interact with their applications and content with reports on user demographics and traffic metrics.
Data Collection and Analysis
- This system ingests and processes all data from the server, client logs, and network and server information. This data is then used for monitoring, alerting, analytics, reporting, and billing. Many servers are located in unmanned, third-party data centers, which might go offline unexpectedly or experience poor connectivity at any moment. Metrics must be sent quickly, consistently, and reliably in the face of these conditions.

User Flow

A user types a URL into their browser, and the domain name of the URL is converted to the IP address of the edge server to serve the content.
- A mapping system translates the URL to an IP of an edge server and bases its response based on historical and current data collected about global network and server conditions. The goal is to pick an edge server close to the end user with minimal latency.
The edge server, part of an edge serve platform, uses the transport system to download the required data or content to serve the request to the user.
- The transport system is responsible for moving data and content over the long-haul Internet. It gets the data from parent clusters to reduce the load on the origin servers. If it does not exist there, it fetches the content from the origin server and caches it in a tiered way, depending on the access pattern for the content.
An enterprise customer logs onto the management portal to view how their content and applications are served to users. They can dynamically update confirmations rolled out to the entire edge platform in a fault-tolerant and timely manner via the communications and control system.