Designing and Launching a Smart Modular Power Distribution Unit for High Performance Compute
Note: This post is a detour from the normal. Been wrangling on tech in O&G gas with my friends Reggie and Amir for a good bit and we designed a new type of PDU so I can’t help but talk about it and in advance warning to readers, you may find this post a bit drier than my normal stuff. It comes from experience dealing with a crypto mining and high performance compute data center operations
TL;DR: Two buddies and I designed, architected, and assembled a new type of Section 889 compliant, dielectric fluid immersion-friendly, smart modular PDU - power distribution unit (Zeus - gen 1 and Athena - gen 2).
These are critical for vigilant, real-time data center monitoring, and avoiding costly downtime in compute times.
It is not our team’s first time working on mission critical energy infrastructure. We’ve worked on engineering that impacts terawatt hours of the US Oil and Gas supply infrastructure, but that’s for another blog post. —----------
What is a PDU- Power Distribution Unit? In the world of data centers, a Power Distribution Unit (PDU) might look like a simple surge protector, but it serves multiple critical purposes:
Power Distribution: Ensures even and efficient distribution of power to multiple devices.
Monitoring: Tracks power usage and identifies issues in real-time.
Management: Provides remote control capabilities for managing power to various devices.
The market for PDUs is large and very RELEVANT to avoiding future energy waste.
How Big is the PDU problem?
When compute capacity drops unpredictably due to unavoidable defects in silicon, it results in significant electricity waste.
We’re talking about anywhere from 10 to 40 terawatt hours being lost annually due to this problem.
To give you an idea, this is significant, about .25% to 1% of all US energy consumption annually just being thrown away.
The Density Problem: Need for New Infrastructure
The next generation of GPU servers is using double, if not triple, the watts of legacy servers. This surge in power requirements presents a significant density problem:
The current power infrastructure in many data centers is inadequate for these high-power servers.
Upgrading infrastructure is critical to support these servers and avoid power-related issues.
Smart PDUs provide the necessary advanced power management solutions to handle these increased power demands.
The Reboot Problem: Downtime Due to Manual Power-Cycling
Manual power-cycling of servers is a major issue in data centers:
Servers must often be rebooted to resolve technical issues or apply updates. When servers become inaccessible from the network, the only options left are watchdog devices or straight manual power cycling.
Manual power cycling requires physical intervention, leading to significant round-the-clock labor demand to combat downtime.
This downtime can be costly and disruptive to data center operations.
Significance of the Reboot Problem:
Downtime due to manual power cycling results in extreme financial losses and operational inefficiencies.
Remote reboot functionality, provided by smart PDUs, eliminates the need for physical intervention, reducing downtime and improving efficiency.
But fixing the reboot problem is only half the struggle. Increased infrastructure demand has highlighted a lack of robust real-time monitoring solutions for datacenters.
The Importance of Vigilant Monitoring
Managing power distribution and preventing outages is a critical challenge for datacenters. The increasing power demands driven by AI and high-performance computing (HPC) have made power distribution more complex [7].
Challenges Include:
Coordinating dense electrical equipment.
Managing power distribution across compact sites.
Meeting regulatory scrutiny [19].
Handling increased power-per-rack, which ranges between 20 to 40 kW per rack and is expected to double in the future [5][18].
The significance of real-time monitoring in data centers cannot be overstated.
Provides continuous oversight of power usage.
Enables quick responses to potential issues [4].
Helps optimize power distribution, reduce energy waste, and prevent outages by identifying and addressing power anomalies promptly [8].
According to a study by the Uptime Institute, a significant portion of data center outages are caused by electrical failures, which often stem from undetected anomalies in power distribution [1].
For example, we faced a situation where a double electricity charge nearly cost us everything.
This was on a crypto-mining project for a client in a third-party colocation facility. Despite the known uptime, power consumption numbers just didn’t add up.
It started subtly. Our early power line monitors, the precursors to Zeus and Athena, hinted at a lurking issue. The amps were off the charts.
Further investigation revealed more:
The facility’s transformer was compromised.
The transformer was serving only 115V instead of the standard 240V.
This discrepancy meant our servers were drawing double the amps, leading to an unexpected surge in electricity usage and costs.
Without our monitors, we’d have been blindsided by tens of thousands in phantom electricity fees, not to mention the risk of damage to the colocation provider’s lines.
This scenario is not unique; many data centers face similar challenges, underscoring the need for vigilant, real-time monitoring systems. Inadequate power monitoring can lead to significant operational disruptions and financial losses [4]. Without real-time monitoring, data centers get blindsided by unexpected power surges or failures, leading to costly outages and equipment damage [9].
The Role of Smart PDUs
This incident underscored the need for smart PDUs like Zeus and Athena. Traditional PDUs lack the ability to provide detailed, real-time insights into power consumption. Smart PDUs, however, offer advanced features such as:
Remote Monitoring: Allows real-time tracking of power usage and quick identification of issues.
Power Cycling: Enables remote rebooting of devices, reducing the need for physical intervention.
Energy Efficiency: Helps optimize power usage and reduce energy costs [6][20].
Per industry surveys, power disruptions top the list of root causes for outages [1]. The adoption of smart PDUs in data centers is on the rise due to their advanced features and benefits. Smart PDUs offer remote monitoring, power cycling, and enhanced energy efficiency, making them a silver bullet for this uptime issue [2][10][11].
Our smart PDUs are equipped with an API that allows any data center to incorporate remote reboot functionality and power monitoring statistics directly into their existing management platform.
This integration ensures seamless installation and minimal operational roadblocks.
By implementing smart PDUs, data centers can get closer to the silicon in a sense and ensure operational integrity and prevent similar incidents.
A report by the Ponemon Institute highlighted that The average cost of a severe data center outage exceeds $100,000.per incident [3]. Power-related issues are a major contributor to these costs with primary causes including power failures and inefficient energy management, both of which can be mitigated with smart PDUs [3].
Conclusion:
High-density compute environments that are grappling with the complexities of GPU servers or crypto mining have a need for a robust smart PDU.
These tools are not just about cost savings; they are about ensuring the reliability and efficiency of data center operations remotely and while ensuring modularity in the deployment.
References:
Uptime Institute Annual Outage Analysis 2022
Data center outages are decreasing but remain costly, with over 50% of severe outages costing more than $100,000, and 16% exceeding $1 million.
Power disruptions are a leading cause of these outages.