Report

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science Introduction GPU-based accelerators Quickly spreading in PCs and even handheld devices Widely used in high-performance computing Power and energy efficiency Heat dissipation is a problem Electric bill and battery life are of growing concern Exascale requires 50x boost in performance per watt Important research area Need to develop techniques to reduce power and energy Have to be able to measure power/energy of programs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 2 GPU Power Sensors Hardware High-end compute GPUs include power sensors For example, K20/K40 Tesla cards have built-in sensor These cards are the target of this talk Software Can query sensor with NVIDIA Management Library http://developer.nvidia.com/nvidia-management-library-nvml Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 3 Problems Power sensor data behaves strangely Running the same kernel twice yields different energy First launch: 114 J, second launch: 147 J (29% more energy) Running a kernel 2x as long more than doubles energy 1x input: 732 J, 2x input: 1579 J (8% above doubling) Power sensor sampling rate varies greatly Ranges from 0.266 ms to 130 ms (7.7 Hz to 3760 Hz) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 4 Methodology Hardware Two K20c, two K20m, two K20X, and two K40m GPUs Measurement Query power and time in loop on “idle” CPU core Test code Compute-intensive regular n-body kernel Constant computation rate of over 2 TFlops on a K20c No data dependences; vary n to adjust kernel runtime Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 5 Expected Power Profile Kernel starts executing Kernel stops executing Measurement loop runtime Accurate Power and Energy Measurement on Kepler-based Tesla GPUs GPU idle power 6 Measured Power Profile Macroscopic phenomena 3s 5s Power ramps up slowly 4s Switch to step shape Power ramps down slowly Idle power reached Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 7 Energy = Area Under Power Curve Unclear how big energy is Missing energy? Delayed energy? Integrate to where? Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 8 Ramp-up Behavior of 2 Short Runs Ramp down doesn’t follow 2nd run starts higher but also follows curve Short run same as longer run Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 9 Ramp-down Behavior of Several Runs 160 t2 t3 Measured Power [W] 140 Shape depends on power at t2 120 t4 Driver lowers power level 100 Shape always the same 80 60 Steps down every second 40 Power increases after kernel done 20 0 16.2 17.2 18.2 19.2 20.2 Shifted Runtime [s] Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 21.2 22.2 23.2 10 Sampling Interval Lengths t2 t3 t4 Measured Power [W] 140 80 70 120 Very long interval 100 Driver activity can prevent sampling 60 50 80 40 60 30 Wide range of intervals 40 Short intervals 20 0 10.7 12.0 13.3 14.6 15.9 17.2 18.5 Runtime [s] 19.8 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 21.1 22.4 Sampling Interval [ms] 160 t1 20 10 0 23.7 11 120 12 100 10 80 Identical values 60 Very long interval 40 20 0 12.030 Sampled power only ever changes after long interval Many short intervals 12.035 12.040 8 6 4 Sampling Interval [ms] Measured Power [W] Sampling Interval Lengths (zoomed-in) 2 12.045 12.050 Runtime [s] Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 12.055 0 12.060 12 Correcting the Measurements Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 13 Sampling Frequency Eliminate redundant samples Only sample once every 15 ms (66.7 Hz) Cannot accurately measure kernels under ~150 ms Account for the variation in interval length Use high-resolution time stamps Dotted (fixed intervals): 1205 J Solid (variable intervals): 1066 J 13% discrepancy Accurate Power and Energy Measurement on Kepler-based Tesla GPUs t1 t4 140 Measured Power [W] Example: energy from t1 to t4 160 120 100 80 60 40 20 0 10.7 12.0 13.3 14.6 15.9 17.2 18.5 Runtime [s] 19.8 21.1 22.4 14 23.7 True Power Sensor hardware Seems to asymptotically approach true power Reminiscent of capacitor charging True instant power Ptrue is a function of the slope of the power profile dP/dt and the power measured by the sensor Psensor Ptrue = Psensor + C × dPsensor/dt “Capacitance” of sensor C ≈ 0.84 s on all tested K20 GPUs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 15 Back-calculated from Expected Profile Minimized absolute errors to determine C ‘Capacitor’ function matches measured values perfectly Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 16 Corrected Power Profile t1 t2 t3 160 140 Wobbles due to sampling errors Power [W] 120 100 ‘Active idle’ power level 80 Corrected profile matches expected rectangular profile 60 40 20 0 13 14 15 16 17 Time [s] 18 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 19 20 21 17 Correction of 2 Short Runs t1a t2a t1b t2b t3b 160 Corrected power profile matches expected profile 140 Power [W] 120 100 80 60 40 20 0 111 112 113 114 115 Time [s] 116 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 117 118 119 18 Second K20c GPU t1 t2 t3 160 140 Power [W] 120 100 80 Identical to original K20c 60 40 20 0 16.5 17.5 18.5 19.5 20.5 Time [s] 21.5 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 22.5 23.5 19 K20m GPU t1 t2 t3 180 160 140 Power [W] 120 100 80 Similar profile but higher power level 60 40 20 0 62.7 63.7 64.7 65.7 66.7 Time [s] 67.7 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 68.7 69.7 20 K20X GPU t1 t2 t4 200 180 Profile is good, no correction needed! 160 Power [W] 140 120 100 Huge 600 ms gap 80 60 40 20 0 128 129 130 131 132 133 Time [s] 134 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 135 136 137 21 K40m GPU K40m again requires correction Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 22 Application to Full CUDA Program Implementation of Barnes Hut n-body algorithm Taken from LonestarGPU benchmark suite Contains multiple regular and irregular kernels Highly optimized, but still suffers from load imbalance, divergence, and uncoalesced accesses Main kernel is ‘regularized’ (warp-based) NASA/JPL-Caltech/SSC Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 23 Barnes Hut Power Profile (1 Step) Slow then fast drop-off “Wave” in profile Original profile is hard to interpret Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 24 Barnes Hut Power Profile (Kernels) Slow then fast drop-off “Wave” in profile Original profile is hard to interpret Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 25 Corrected Barnes Hut Power Profile 160 a b cd ef Corrected profile reveals important info 140 Power [W] 120 Two similar irreg. kernels 100 Regularized main kernel Decrease due to load imbal. 80 One more irreg. kernel 60 Very short regular kernel 40 20 0 61.7 62.7 63.7 64.7 65.7 Time [s] 66.7 Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 67.7 68.7 26 K20Power Tool Output Corrected profile and corresponding ‘active’ energy Features Computes instant power using ‘capacitor’ formula Employs high-resolution time steps Samples at true frequency of 66.7 Hz Dissemination Open source, research license http://cs.txstate.edu/~burtscher/research/K20power/ Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 27 Marcher System Tool will be part of Marcher system at Texas State NSF-funded green computing infrastructure Marcher is a power-measurable cluster system 832 general-purpose cores 12,000 GPU and MIC cores 1.2 TB of DDR3 with power throttling and scaling 50 TB of hybrid storage with hard drives and SSDs Component-level power measurement tools (e.g., CPU, DRAM, Disk, GPU, Xeon Phi) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 28 Summary Correctly measuring K20/K40 power and energy Sample at 66.7 Hz and include time stamps Compute true power with presented formula Use neighboring power samples to approximate slope Compute true energy by integrating true power Over intervals where power is above ‘active idle’ K20Power tool Software tool that implements this methodology Paper at http://cs.txstate.edu/~burtscher/papers/gpgpu14.pdf Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 29 Acknowledgments Collaborators Ivan Zecena and Ziliang Zong U.S. National Science Foundation DUE-1141022, CNS-1217231, and CNS-1305359 NVIDIA Corporation Grants and equipment donations Texas State University Research Enhancement Program Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Nvidia 30