Server Motherboard

General observation

Motherboard view is shown on the picture below:

compute node

Motherboards for water-cooling supercomputers are different than for air-cooling server. Why? Many reasons, but main point is: we don't have circulating coolig air there, so components cooling can be done in two ways:

- acquire heat from components by water-cooled coldplate

- heat from components leaks into the PCB (and then it is acquired by a coldplate from PCB)

Coldplate is quite expensive, from one side and it should be very thin (to have less overall rack's volume and weight), from another side. To achive these both targets, one coldplate was used for cooling 4 motherboards. They are assembled like a "sandwich".



Nevertheless, there is one important problem evolved. How we can cool memory without circulating cooling air around memory modules, as in "usual" air-cooled motherboard. Thus, we will get memory overheating and errors instantly. Where is the decision?
Here is the custom memory module (patent RU 123203, I'm one of its authors):

 memory module

 We can see here that all memory modules are placed on the top of the board, so we can press cooling plate, through thin thermal rubber, of course, to memory chips to cool them.



Let's consider heat removal from motherboard in more details.

Here is temperature distribution (by simulation) on coldplate under maximum compute load (we have 4 board on one coldplate, 2 on each side):

motherboard cooling

To acheive this (which wasn' simple), complex coldplate was developed. Look at it.

It has that form, with protuberances to ensure direct contact with low-profile components, and cutouts, to avoid intersections with high-profile components. There are complex water channels inside the coldplate are shown (picture from patent RU 2522937):

coldplate channels

To ensure that most important components are cooled enough, 25 sensors monitor the board: 14 sensors inside chips and 11 external chips. All sensors have adjusted threshold level to report about potentially problematic situation, and all their state takes into account simultaneously. Additionally, if any of their values exceeds threshold level, this event writes to log on SD card storage by BMC.


BMC (Board Management Controller)

 We need to have some independent controller chip on board, for monitor if the main processor is working properly or having some problems with powering or overheating. Additionally, the chip should have his own network controller, to let supercomputer's control system know, if some problems arise. This chip named Board Management Controller, or BMC. It has its own ecosystem, include memory, network chip and power source which are always on.
Its main functions are:
- monitor of non-computational parameters for all components on motherboards: voltage and temperature
- control powering-on process on board: which components should be powered and in which order
- to have its own independent network to provide remote actions from operator, if main CPU is in down state and therefore cannot process any tasks
- to have its own data protocol, to communicate with upper and lower system's components. This protocol named IPMI
- to route and control most configuration paths that control data flow use
- provide uniform interface for outer control, to let operator configure all nodes in cluster at once, for example, via SNMP protocol
- provide (with some GUID chip) unique ID for node, to identify it for centralized control purposes
- provide KVM service to let an operator to remote log in onto the node and configure the node separately, in graphical mode, if required
- to log all monitoring data (called System Event Log - SEL) periodically to SD card, to let repair service analyze reasons of node's damage "afterlife"
Obviously, all these functions cannot be implemented from the scratch. Like UEFI BIOS, BMC has a lot of functions that are same for all server motherboards. One example of ready-to-use BMC frameworks is AMI Megarack. The art of implement BMC firmware is to use basis functions from existing frameworks. Otherwise, developer will have to pay additionally a lot of money (to same framework’s provider, of course) for implementation and test new BMC functions. However, sometimes the custom functions are necessary, for example: if we need to monitor newest GPU card via I2C and corresponding functions are not implemented in BMC framework yet


Power converters

 Power converters for motherboard are not so simple, as it seems at the first glance. They should convert voltage from 48V input to many voltages, include 1V CPU core.
First question is about number of DC/DC conversion steps. Of course, we would prefer to have one-step converter from 48V to each of required voltages. However, there are many drawbacks with this approach. Main problem in that case is capacitors. Electrolytic capacitors for high voltages are big, and if we will use them for each of DC/DC converter, we lost a lot of space inside a node. In addition, there is cost problem - tantalum capacitors for 48V have enormous cost.
To solve these problems, 2-steps scheme was used: 48-> 12V and then 12->required voltages.
Really, this scheme has their own advantages and drawbacks. Main advantages that we can use standard quarterbricks 48 -> 12, which means that we can use commodities and reduce costs dramatically. Drawbacks are that we have 500W DC-DC converter, placed at far (from cooling water paths) place which leads to overheating them (power efficiency is about 95%). We solve the problem by splitting it onto two 250W bricks, which doubled contact area for contact cooling:



Unfortunately, I can't show there a process how we changed high-profile electrolytic capacitors placement, to let water channel be routed by optimal paths. Otherwise, these waterpaths would be overlapped by cutouts on coldplate, which were made for these capacitors.


SSD on board

You won’t find one more detail on "usual" server boards. See M.2 connector. A SSD is placed on the motherboard, because supercomputer is not a datacenter. Supercomputing calculations is one whole thing, and all computational nodes should calculate and exchange to each other with their results. Now suppose that in the middle (say, after a week) of the calculations one of nodes broken. What should we do with this problem? Re-calculate all results from the start? However, what if we get another error on another node again?  By statistics, IBM's Sequoya gets error node every 22 hours.

The decision is: we insert several many "control points" during execution. If our monitoring system sends a message that some node is down, we frozen program execution, remove problem node, remove from it our SDD with intermediate results, with FRU (which contains GIUDS, MACs and so one, to avoid system reconfiguration), insert them into new node and continue calculations from last saved control point.



Assembled computational module

As I already mentioned, assembled computational module has approx. 20x20x2 inches size. Looks nice, isn't it?



©2021 Nikolay Bodunov. All Rights Reserved.