I have a really large archive consisting of really small files, concatenated into a
single text file, with a "" dilimiter. For smaller archives, I would
split
the archive using "" as a pattern, and then work on the resulting files.
However, in this archive there are on the order of magnitude of a hundred million
such files -- clearly, too much for putting them all into a single directory. I
have created folders aa
, ab
, etc. for trying to move them into directories as
they are created. However, I ran into issues. Things I've tried:
1) There is no command for split
to execute any command on the resulting file. So
I have to do it by hand.
2) Moving the files into the **
directory using `find . -name "xaa*" -exec mv {}
aa \+ does not work because
{}` is not at the end of the line.
3) The -t
flag, for reversing the source and the destination, is not available in
my version of Unix.
4) I had to pipe the output of find
into xargs
, for it to work out.
However, this is too slow -- files are being created way faster than they are moved
away.
5) I suspect that xargs
is processing less files at a time than using a \+
after
find -exec
. I tried adding a `-R 6000' flag, for running 6000 entries at a time;
however, I don't think it made a difference.
6) I decreased the priority of split
to the lowest possible. No change in the
amount of CPU it consumed, so probably no effect either.
7) I open up to seven command prompts for running the mv
commands (last four
letters per command prompt) -- however, this is still not nearly enough. I would
open more, but once the system gets to seven, the response is so slow that I have to
stop the split. For example, the source archive got copied to a USB all while
waiting for a ls -l | tail
command to return something.
So what I've been doing is, stopping the split
at that point, waiting for the mv
commands to catch up, and then restarting the split. At that point I would use
find -exec rm {} \+
to delete the files I already have; this is a bit faster, so
when it gets to the files I don't have, there's less files around.
So the first such iteration lasted ~3 million files, the next one ~2 million, the
next ~1.5. I am sure there should be a better way, though. Any ideas for what else
to try?
Asked by Alex
(1220 rep)
Nov 10, 2023, 02:31 PM
Last activity: Nov 10, 2023, 03:17 PM
Last activity: Nov 10, 2023, 03:17 PM