Running
wc
with multiple files is an order of magnitude faster than running it file by file. For example:
> time git ls-files -z | xargs -0 wc -l > ../bach.x
real 0m2.765s
user 0m0.031s
sys 0m2.531s
> time git ls-files | xargs -I {} wc -l "{}" > ../onebyone.x
real 0m57.832s
user 0m0.156s
sys 0m3.031s
*(The repo contains ~10_000 files so xargs runs wc few times, not just once, but that's not material in this context)*
In my naivety I think that wc
needs to open and process each file so the speedup must be from multi-threading only. However, I've read that there may be some extra files system magic going on here.
Is three file system magic going on here or is it all multi-threading or is it something else?
----
### Startup penalty
Following up on @muru's comment I can see that a) a single execution is ~8ms and b) running wc
in a loop scales linearly:
> time wc -l ../x.x > /dev/null
real 0m0.008s
user 0m0.000s
sys 0m0.016s
> time for run in {1..10}; do wc -l ../x.x; done > /dev/null
real 0m0.076s
user 0m0.000s
sys 0m0.000s
> time for run in {1..100}; do wc -l ../x.x; done > /dev/null
real 0m0.689s
user 0m0.000s
sys 0m0.063s
Since the multi-file run is much faster per file (*10_000f@3_000ms* => *1f@0.3ms*) there seem to be a (huge?) startup penalty for wc
which is not related to actually counting the \n
s.
Asked by tmaj
(101 rep)
Mar 8, 2024, 06:27 AM
Last activity: Mar 8, 2024, 07:05 AM
Last activity: Mar 8, 2024, 07:05 AM