Why does higher CPU-usage slow down a task?
3
votes
2
answers
779
views
I am using whisper.cpp to transcribe some sound files. It is a very CPU heavy process so I try to find some optimal settings and therefore I have done some tests with the thread setting (-t) but the results are super confusing. This is the command I execute:
date; time ./main -t [number of threads] -m ggml-model.bin -f 5min-16kHz.wav; date
I run this on a Macbook Pro with an Intel i7 with 6 cores (+ 6 hyperthread cores).
I have tried with the default settings (4 threads), 6 and 12 threads (and 14 threads but that didn't produce any output although all CPUs run at 100 %). Here is the result:
Threads | output from time
-|-
4 | 1750.84s user 11.02s system 564% cpu 5:11.87 total
4 | 1862.04s user 18.63s system 553% cpu 5:39.58 total
6 | 2199.42s user 16.79s system 720% cpu 5:07.51 total
6 | 2212.72s user 14.49s system 722% cpu 5:08.22 total
12 | 4595.03s user 22.21s system 1053% cpu 7:18.47 total
12 | 4298.11s user 22.53s system 1059% cpu 6:47.85 total
As you can see, the CPU load increases as I increase the number of threads. You would expect the real time to decrease proportionally to the increase in CPU load (100 % for a minute should, approximately, correspond to 200 % for half a minute and 50 % for two minutes) but that doesn't happen here.
Instead I get approximately the same real time results with 4 and 6 threads while the CPU usage time increases with ≈ 25 % when running 6 threads. And 12 threads are even worse, the CPU-time doubles compared to 6 threads and the real time increases with 40 %.
I don't understand this. Of course, more threads don't scale linearly but **CPU time should remain quite constant when performing the same task, independently of how many threads, shouldn't it? And real time should decrease when CPU-usage increases?**
**And considering the task and my hardware, what should be a reasonable setting for the number of threads to use?** I was expecting it to be the number of cores + a little extra in case a thread waits for I/O. The sound file I process is 10 MB, whisper.cpp uses ≈ 3,6 GB on a computer with 32 GB (currently about 10 GB unused, memory pressure is "green").
----
Edit: corresponding values using only one thread (-t 1):
1619.90s user 20.86s system 197% cpu 13:48.78 total
Note that one thread used almost 200 % CPU. Not sure I understand that. But 13 minutes real time makes sense.
Edit 2: adding more CPUs (-p) made the performance worse.
-t 6 -p 3
- 6804.14s user 38.58s system 1040% cpu 10:57.84 total
(twice as much real time, 3-4 times more CPU-time)
-t 8 -p 2
- 10573.58s user 57.47s system 1018% cpu 17:23.63 total
(more than 3 times as much real time and 6 time as much CPU-time)
-t 4 -p 2
- 2962.38s user 28.65s system 854% cpu 5:50.01 total
(approximately the same as with -t 4)
I think -p only should be used if you want to limit how much this task affects the computer. Otherwise, it will just use as many processers as it can.
I don't think it is I/O. It reads 3,08 GB in the first 5-10 seconds and then less than 10 MB for the rest of the run (that lasts at least 5 minutes).
Edit 3: using -t 13
, that is, one more thread than my CPU supports, generates very odd results: 93213.70s user 450.23s system 978% cpu 2:39:36.88 total
No, I am not joking, more than 50x as much CPU-time as -t 4
, while CPU-usage is almost twice as high (978 % vs 564 %) and real-time increased more than 30x.
If I compare with -t 12
CPU-time increased by more than 20x, CPU usage is approximately the same, and real-time also increased by more than 20x. By just adding ONE more thread.
Something is iffy here, isn't it?
**Edit 4:**
Selected benchmark data
./bench -m ./models/ggml-small.en.bin -t 4
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
whisper_print_timings: load time = 540.82 ms
whisper_print_timings: encode time = 3490.52 ms
whisper_print_timings: total time = 4031.40 ms
**5 threads are ≈ 8 % faster than 4 threads:**
whisper_print_timings: load time = 547.27 ms
whisper_print_timings: encode time = 3193.27 ms
whisper_print_timings: total time = 3740.58 ms
**6 threads are 1 % slower than 5 threads:**
whisper_print_timings: load time = 591.16 ms
whisper_print_timings: encode time = 3158.88 ms
whisper_print_timings: total time = 3750.10 ms
7 threads are 15 % slower than 6 threads. And it is downhill from there. I guess this task only uses the 6 "real" cores I have, not the hyperthreading cores. I theory I guess 6 threads should be faster than 5 but I guess the computer performs some other tasks that interrupts one of the threads and uses one core from time to time when running this benchmark.
**Edit 5:**
Running the benchmark with a -20 nice value gave some interesting results (just listing the total time here)
Threads Total time (ms) ∆ (negative is better)
4 3512 -13%
5 3510 -6%
6 3251 !! -13%
7 3962 -8%
∆ is compared with the same number of threads with normal priority. 6 threads with high priority is 19 % faster than the default settings with normal priority.
Asked by d-b
(2047 rep)
Feb 10, 2023, 04:49 PM
Last activity: Feb 14, 2023, 04:52 PM
Last activity: Feb 14, 2023, 04:52 PM