ISC 2014: HPC, liquid cooling and energy reuse
July 2, 2014
Representatives of RenewIT (451 Research and TUC) attended the recent ISC 2014 conference in Leipzig, Germany.
Among the companies present were several that have developed liquid-cooling technology. The main test-bed for the RenewIT project is based on liquid-cooling technology from German supplier Megware. The project opted to adopt liquid cooling due to the improved heat transfer benefits (for heat reuse) from liquid cooling.
Along with liquid- cooling, here are some of the main themes the RenewIT project took away from ISC 2014:
* Experience shows that the relationship between FlopsPeak and FlopsrealApplications is 7:1
* Accelerators are a very big topic. Most of the top 10 Clusters are using either Intel Phi or nvidia CUDA systems
* There is a slowdown in the HPC market. This year was the year with the lowest number of new systems in the top 500.
* Both European and American HPC systems are getting older
* More and more of the total calculation power is concentrated in the top 6 machines. Top 500 List has a Gini-Coefficient of 77.
* Almost 100 Systems in the top 500 are not clusters but single machines.
* The director of the K Supercomputer in Japan showed that the average traffic for the entire internet is approximately equal to the internal traffic in the K supercomputer due to the huge communication between nodes.
* The quantum computers from D-Wave which are installed at NASA/Google and Lockheed Martin are operational and there is major research into the area going on. It’s been independently confirmed that the computers have true quantum entanglement and are actually working.
* IBM sold their server manufacturing to Lenovo and are (probably) out of the business.
* The total computation power of the HPC top 500 list has been growing at a steady rate of factor 2 per year over the last 10 years. Processors have “only” grown by a factor of 1.34. This caused a huge gap between high power HPC systems and average cloud farms.
* The next big developments to look out for are optical interconnects on the chips with up to 20 TBit/s and the Intel Omnifabric communication bus which will be directly integrated into future Xeons
* BMW is moving their main data center to Iceland because of the cheap energy, free cooling and better data protection laws.
* The big thing until 2020 in HPC is the Exascale challenge where people are trying to put together a 1 Exaflop Cluster. Right now that would take 1000 MW of IT power though.
* Water Cooling was one of the BIG topics at ISC’14.
* There is a giant performance gap between Super Computers and Cloud Systems when using them for distributed calculation on everything except “embarrassingly parallel” problems mainly due to the interconnect. Supercomputers tend to have something like a 40GBit/s Infiniband Fabric that gives each node the opportunity to talk with every other node at full speed and very low latency. Cloud systems use 1GBit/s Ethernet with the added disadvantage that the switches have maximum throughput rates of for example 160 GBit/s compared to the Terabits/s of an Infiniband fabric.
* The Supercomputing center in Poznan, PL managed to reach a PUE of 1.018 with an Iceotope Cluster and a dry cooler.
* Bull Computing is working on liquid cooled racks with up to 80 kW/Rack
* Providers offering water cooling are among others: Megware, CoolIT, HP, Bull, IBM and RackCDU (Asetek). Even more extreme are submersion solutions where the entire server gets submerged into a dielectric fluid. Examples for this are 3M, Iceotope, Green Revolution Cooling, and the most efficient HPC in the world, the Japanese Tsubame-KFC.
* Asetek offers a completely sealed water cooled system with a complete miniature air cycle including heat exchangers.
* One interesting idea with water cooling is to use controllable miniature pumps on the CPU so that each CPU gets an optimal level of cooling.
* All vendors are working on solutions to control the waste heat temperature to keep it constant for better heat reuse. Nobody has any good ideas about what to do with all the heat except do a little heating.
* HP showed their new Apollo 8000 System with a combination of heat pipes and cold plates which combines all the advantages of each system. They get up to 80 kW / Rack and use internal high voltage DC for the power supply.
* LRZ is experimenting with energy aware scheduling. They are measuring how much a jobs slows down if they slow down the processor and if the job doesn’t slow down too much ( less than 12%, then they reduce processor frequency from 2.7 to 2.3 Ghz)
* LRZ saves energy with that but at the cost of longer calculation jobs. They did the math and if they include total cost of ownership in the optimization, then no slowdown is justified.
* Internal studies by IBM and LRZ say that there is up to 20% difference in power consumption between identical processors when running the same job due to manufacturing tolerances. There are no publications about this since Intel doesn’t allow it. I’m trying to get more data.
* There are big initiatives for HPC efficiency, but the consensus of all the application programmers at the conference was that the one thing they optimize for is time, not energy.
* Studies by Intel show that at lower CPU temperatures applications run faster since turbo boost can engage longer.
* While in theory HPC applications have a consistent 100% load profile, in reality they spend up to 80% of their time waiting for communication depending on the application.
* Right now to fully simulate an HIV Virus with 3 mio. Atoms takes about 1d real time per 20 ns of simulation time on a cluster with more than 10.000 cores.
* Ubercloud is working on dynamic HPC Clusters * Systemburn and Firestarter (http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/firestarter) are tools for Linux to cause maximum stress