Sample Header Ad - 728x90

Why does higher CPU-usage slow down a task?

3 votes
2 answers
779 views
I am using whisper.cpp to transcribe some sound files. It is a very CPU heavy process so I try to find some optimal settings and therefore I have done some tests with the thread setting (-t) but the results are super confusing. This is the command I execute: date; time ./main -t [number of threads] -m ggml-model.bin -f 5min-16kHz.wav; date I run this on a Macbook Pro with an Intel i7 with 6 cores (+ 6 hyperthread cores). I have tried with the default settings (4 threads), 6 and 12 threads (and 14 threads but that didn't produce any output although all CPUs run at 100 %). Here is the result: Threads | output from time -|- 4 | 1750.84s user 11.02s system 564% cpu 5:11.87 total 4 | 1862.04s user 18.63s system 553% cpu 5:39.58 total 6 | 2199.42s user 16.79s system 720% cpu 5:07.51 total 6 | 2212.72s user 14.49s system 722% cpu 5:08.22 total 12 | 4595.03s user 22.21s system 1053% cpu 7:18.47 total 12 | 4298.11s user 22.53s system 1059% cpu 6:47.85 total As you can see, the CPU load increases as I increase the number of threads. You would expect the real time to decrease proportionally to the increase in CPU load (100 % for a minute should, approximately, correspond to 200 % for half a minute and 50 % for two minutes) but that doesn't happen here. Instead I get approximately the same real time results with 4 and 6 threads while the CPU usage time increases with ≈ 25 % when running 6 threads. And 12 threads are even worse, the CPU-time doubles compared to 6 threads and the real time increases with 40 %. I don't understand this. Of course, more threads don't scale linearly but **CPU time should remain quite constant when performing the same task, independently of how many threads, shouldn't it? And real time should decrease when CPU-usage increases?** **And considering the task and my hardware, what should be a reasonable setting for the number of threads to use?** I was expecting it to be the number of cores + a little extra in case a thread waits for I/O. The sound file I process is 10 MB, whisper.cpp uses ≈ 3,6 GB on a computer with 32 GB (currently about 10 GB unused, memory pressure is "green"). ---- Edit: corresponding values using only one thread (-t 1): 1619.90s user 20.86s system 197% cpu 13:48.78 total Note that one thread used almost 200 % CPU. Not sure I understand that. But 13 minutes real time makes sense. Edit 2: adding more CPUs (-p) made the performance worse. -t 6 -p 3 - 6804.14s user 38.58s system 1040% cpu 10:57.84 total (twice as much real time, 3-4 times more CPU-time) -t 8 -p 2 - 10573.58s user 57.47s system 1018% cpu 17:23.63 total (more than 3 times as much real time and 6 time as much CPU-time) -t 4 -p 2 - 2962.38s user 28.65s system 854% cpu 5:50.01 total (approximately the same as with -t 4) I think -p only should be used if you want to limit how much this task affects the computer. Otherwise, it will just use as many processers as it can. I don't think it is I/O. It reads 3,08 GB in the first 5-10 seconds and then less than 10 MB for the rest of the run (that lasts at least 5 minutes). Edit 3: using -t 13, that is, one more thread than my CPU supports, generates very odd results: 93213.70s user 450.23s system 978% cpu 2:39:36.88 total No, I am not joking, more than 50x as much CPU-time as -t 4, while CPU-usage is almost twice as high (978 % vs 564 %) and real-time increased more than 30x. If I compare with -t 12 CPU-time increased by more than 20x, CPU usage is approximately the same, and real-time also increased by more than 20x. By just adding ONE more thread. Something is iffy here, isn't it? **Edit 4:** Selected benchmark data ./bench -m ./models/ggml-small.en.bin -t 4 system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | whisper_print_timings: load time = 540.82 ms whisper_print_timings: encode time = 3490.52 ms whisper_print_timings: total time = 4031.40 ms **5 threads are ≈ 8 % faster than 4 threads:** whisper_print_timings: load time = 547.27 ms whisper_print_timings: encode time = 3193.27 ms whisper_print_timings: total time = 3740.58 ms **6 threads are 1 % slower than 5 threads:** whisper_print_timings: load time = 591.16 ms whisper_print_timings: encode time = 3158.88 ms whisper_print_timings: total time = 3750.10 ms 7 threads are 15 % slower than 6 threads. And it is downhill from there. I guess this task only uses the 6 "real" cores I have, not the hyperthreading cores. I theory I guess 6 threads should be faster than 5 but I guess the computer performs some other tasks that interrupts one of the threads and uses one core from time to time when running this benchmark. **Edit 5:** Running the benchmark with a -20 nice value gave some interesting results (just listing the total time here) Threads Total time (ms) ∆ (negative is better) 4 3512 -13% 5 3510 -6% 6 3251 !! -13% 7 3962 -8% ∆ is compared with the same number of threads with normal priority. 6 threads with high priority is 19 % faster than the default settings with normal priority.
Asked by d-b (2047 rep)
Feb 10, 2023, 04:49 PM
Last activity: Feb 14, 2023, 04:52 PM