Cover photo for Kumar Thangudu

Engineering a New PDU from the Ground Up

Kumar Thangudu

Designing and Launching a Smart Modular Power Distribution Unit for High Performance Compute
Whitfield Systems

Note: This post is a detour from the normal. Been wrangling on tech in O&G gas with my friends Reggie and Amir for a good bit and we designed a new type of PDU so I can’t help but talk about it and in advance warning to readers, you may find this post a bit drier than my normal stuff. It comes from experience dealing with a crypto mining and high performance compute data center operations 

TL;DR:
Two buddies and I designed, architected, and assembled a new type of Section 889 compliant, dielectric fluid immersion-friendly, smart modular PDU - power distribution unit (Zeus - gen 1 and Athena - gen 2). 

These are critical for vigilant, real-time data center monitoring, and avoiding costly downtime in compute times. 
It is not our team’s first time working on mission critical energy infrastructure. We’ve worked on engineering that impacts terawatt hours of the US Oil and Gas supply infrastructure, but that’s for another blog post. 
—----------
What is a PDU- Power Distribution Unit?
In the world of data centers, a Power Distribution Unit (PDU) might look like a simple surge protector, but it serves multiple critical purposes:
  • Power Distribution: Ensures even and efficient distribution of power to multiple devices.
  • Monitoring: Tracks power usage and identifies issues in real-time.
  • Management: Provides remote control capabilities for managing power to various devices.
The market for PDUs is large and very RELEVANT to avoiding future energy waste.

How Big is the PDU problem?

When compute capacity drops unpredictably due to unavoidable defects in silicon, it results in significant electricity waste. 

We’re talking about anywhere from 10 to 40 terawatt hours being lost annually due to this problem.

To give you an idea, this is significant, about .25% to 1% of all US energy consumption annually just being thrown away. 
The Density Problem: Need for New Infrastructure
The next generation of GPU servers is using double, if not triple, the watts of legacy servers. This surge in power requirements presents a significant density problem:
  • The current power infrastructure in many data centers is inadequate for these high-power servers.
  • Upgrading infrastructure is critical to support these servers and avoid power-related issues.
  • Smart PDUs provide the necessary advanced power management solutions to handle these increased power demands.
The Reboot Problem: Downtime Due to Manual Power-Cycling
Manual power-cycling of servers is a major issue in data centers:
  • Servers must often be rebooted to resolve technical issues or apply updates. When servers become inaccessible from the network, the only options left are watchdog devices or straight manual power cycling. 
  • Manual power cycling requires physical intervention, leading to significant round-the-clock labor demand to combat downtime.
  • This downtime can be costly and disruptive to data center operations.


Significance of the Reboot Problem:
  • Downtime due to manual power cycling results in extreme financial losses and operational inefficiencies.
  • Remote reboot functionality, provided by smart PDUs, eliminates the need for physical intervention, reducing downtime and improving efficiency. 
But fixing the reboot problem is only half the struggle. Increased infrastructure demand has highlighted a lack of robust real-time monitoring solutions for datacenters.

The Importance of Vigilant Monitoring
Managing power distribution and preventing outages is a critical challenge for datacenters. The increasing power demands driven by AI and high-performance computing (HPC) have made power distribution more complex [7]. 
Challenges Include:
  • Coordinating dense electrical equipment.
  • Managing power distribution across compact sites.
  • Meeting regulatory scrutiny [19].
  • Handling increased power-per-rack, which ranges between 20 to 40 kW per rack and is expected to double in the future [5][18].
The significance of real-time monitoring in data centers cannot be overstated.
  • Provides continuous oversight of power usage.
  • Enables quick responses to potential issues [4].
  • Helps optimize power distribution, reduce energy waste, and prevent outages by identifying and addressing power anomalies promptly [8].
  • According to a study by the Uptime Institute, a significant portion of data center outages are caused by electrical failures, which often stem from undetected anomalies in power distribution [1].
For example, we faced a situation where a double electricity charge nearly cost us everything.

This was on a crypto-mining project for a client in a third-party colocation facility. Despite the known uptime, power consumption numbers just didn’t add up.

It started subtly. Our early power line monitors, the precursors to Zeus and Athena, hinted at a lurking issue. The amps were off the charts.


Further investigation revealed more:
  • The facility’s transformer was compromised.
  • The transformer was serving only 115V instead of the standard 240V.
  • This discrepancy meant our servers were drawing double the amps, leading to an unexpected surge in electricity usage and costs.
Without our monitors, we’d have been blindsided by tens of thousands in phantom electricity fees, not to mention the risk of damage to the colocation provider’s lines.
This scenario is not unique; many data centers face similar challenges, underscoring the need for vigilant, real-time monitoring systems. Inadequate power monitoring can lead to significant operational disruptions and financial losses [4]. Without real-time monitoring, data centers get blindsided by unexpected power surges or failures, leading to costly outages and equipment damage [9].

The Role of Smart PDUs
This incident underscored the need for smart PDUs like Zeus and Athena. Traditional PDUs lack the ability to provide detailed, real-time insights into power consumption. Smart PDUs, however, offer advanced features such as:
  • Remote Monitoring: Allows real-time tracking of power usage and quick identification of issues.
  • Power Cycling: Enables remote rebooting of devices, reducing the need for physical intervention.
  • Energy Efficiency: Helps optimize power usage and reduce energy costs [6][20].
Per industry surveys, power disruptions top the list of root causes for outages [1]. The adoption of smart PDUs in data centers is on the rise due to their advanced features and benefits. Smart PDUs offer remote monitoring, power cycling, and enhanced energy efficiency, making them a silver bullet for this uptime issue [2][10][11].
  • Our smart PDUs are equipped with an API that allows any data center to incorporate remote reboot functionality and power monitoring statistics directly into their existing management platform.
  • This integration ensures seamless installation and minimal operational roadblocks.


Beyond Cost Savings: Ensuring Operational Integrity
By implementing smart PDUs, data centers can get closer to the silicon in a sense and ensure operational integrity and prevent similar incidents. 
A report by the Ponemon Institute highlighted that The average cost of a severe data center outage exceeds $100,000.per incident [3]. Power-related issues are a major contributor to these costs with primary causes including power failures and inefficient energy management, both of which can be mitigated with smart PDUs [3].
Conclusion: 
High-density compute environments that are grappling with the complexities of GPU servers or crypto mining have a need for a robust smart PDU. 

These tools are not just about cost savings; they are about ensuring the reliability and efficiency of data center operations remotely and while ensuring modularity in the deployment. 
References: 
  1. Uptime Institute Annual Outage Analysis 2022
    1. Data center outages are decreasing but remain costly, with over 50% of severe outages costing more than $100,000, and 16% exceeding $1 million.
    2. Power disruptions are a leading cause of these outages.
    3. Uptime Institute Annual Outage Analysis 2022 
  2. Gartner Report on Data Center Infrastructure
    1. The adoption of smart PDUs is expected to grow significantly, driven by the need for better power management and efficiency.
    2. Smart PDUs provide remote monitoring, power cycling, and enhanced energy efficiency.
    3. 2022 Gartner Report on Data Center Infrastructure
  3. Ponemon Institute Cost of Data Center Outages 2021*
    1. The average cost of a severe data center outage exceeds $100,000.
    2. Power-related issues are a major contributor to these costs.
    3. Ponemon Institute Cost of Data Center Outages 2021
  4. Data Center Knowledge Article on Power Management**
    1. Real-time monitoring systems are crucial for optimizing power distribution and preventing outages.
    2.  Inadequate power monitoring can lead to significant operational disruptions and financial losses.
    3. Data Center Knowledge Article on Power Management
  5. Uptime Institute Global Data Center Survey 2022**
    1. The power-per-rack in data centers has increased significantly, with current ranges between 20 to 40 kW per rack.
    2. This increase in power density necessitates advanced power management solutions like smart PDUs.
    3. Uptime Institute Global Data Center Survey 2022
  6. TechTarget Article on Smart PDUs
    1. Smart PDUs offer several advantages over traditional PDUs, including remote monitoring, power cycling, and improved energy efficiency.
    2. The adoption of smart PDUs is on the rise due to these benefits.
    3. TechTarget Article on Smart PDUs
  7. Data Center Frontier Report on Power Distribution
    1. Data centers face challenges in managing power distribution due to increasing power demands driven by AI and HPC.
    2. Smart PDUs help in addressing these challenges by providing better power management and efficiency.
    3. Data Center Frontier Report on Power Distribution 
  8. IEEE Spectrum Article on Data Center Power Management**
    1. Real-time monitoring systems help in reducing energy waste and preventing outages by identifying and addressing power anomalies promptly.
    2. IEEE Spectrum Article on Data Center Power Management
  9. Data Center Dynamics Article on Power Issues**
    1. Power-related issues can lead to severe financial implications, with unplanned outages costing data centers millions of dollars.
    2. Implementing smart PDUs can mitigate these costs by providing better power management.
    3. Data Center Dynamics Article on Power Issues
  10. Network World Article on Data Center Efficiency**
    1. Smart PDUs contribute to improved energy efficiency and reduced operational costs in data centers.
    2. Network World Article on Data Center Efficiency
  11. Uptime Institute Report on Data Center Trends**
    1. The adoption of smart PDUs is becoming standard in data centers, driven by the need for better power management and efficiency.
    2. Uptime Institute Report on Data Center Trends
  12. Gartner Forecast on Data Center Power Management**
    1. Gartner predicts significant growth in the adoption of smart PDUs, with data centers increasingly recognizing their benefits.
    2. Gartner Forecast on Data Center Power Management
  13. Ponemon Institute Report on Data Center Costs**
    1. Power-related issues are a major contributor to the high costs of data center outages.
    2. Implementing smart PDUs can lead to significant cost savings by optimizing power usage and reducing energy waste.
    3. Ponemon Institute Report on Data Center Costs
  14. Data Center Knowledge Case Study on Smart PDUs**
    1. Case studies have shown that smart PDUs can prevent operational issues and reduce costs in data centers.
    2. Data Center Knowledge Case Study on Smart PDUs
  15. Uptime Institute Insights on Power Management**
    1. Industry experts emphasize the importance of advanced power management in data centers.
    2. Smart PDUs play a critical role in enhancing data center operations and reducing the risk of power-related issues.
    3. Uptime Institute Insights on Power Management
  16. Data Center Dynamics Report on Regulatory Compliance**
    1. Data centers must comply with various regulatory requirements related to power management, such as Section 889.
    2. Compliance ensures the security and reliability of data center operations.
    3. Data Center Dynamics Report on Regulatory Compliance
  17. TechTarget Article on Offshore PDUs**
    1. Using offshore PDUs can pose risks, including potential security vulnerabilities and lower quality standards.
    2. It is crucial for data centers to source reliable, high-quality equipment.
    3. TechTarget Article on Offshore PDUs
  18. IEEE Article on Data Center Power Density**
    1. The power-per-rack in data centers is expected to double in the future, necessitating advanced power management solutions like smart PDUs.
    2. IEEE Article on Data Center Power Density
  19. Data Center Frontier Insights on Power Distribution**
    1. Data centers face challenges in coordinating dense electrical equipment and managing power distribution across compact sites.
    2. Smart PDUs help in addressing these challenges by providing better power management and efficiency.
    3. Data Center Frontier Insights on Power Distribution
  20. Network World Report on Smart PDUs**
    1. Smart PDUs are becoming increasingly popular in data centers due to their advanced features and benefits.
    2. They provide remote monitoring, power cycling, and enhanced energy efficiency.
    3. Network World Report on Smart PDUs