Sample Header Ad - 728x90

Slow down a `split`

0 votes
1 answer
93 views
I have a really large archive consisting of really small files, concatenated into a single text file, with a "" dilimiter. For smaller archives, I would split the archive using "" as a pattern, and then work on the resulting files. However, in this archive there are on the order of magnitude of a hundred million such files -- clearly, too much for putting them all into a single directory. I have created folders aa, ab, etc. for trying to move them into directories as they are created. However, I ran into issues. Things I've tried: 1) There is no command for split to execute any command on the resulting file. So I have to do it by hand. 2) Moving the files into the ** directory using `find . -name "xaa*" -exec mv {} aa \+ does not work because {}` is not at the end of the line. 3) The -t flag, for reversing the source and the destination, is not available in my version of Unix. 4) I had to pipe the output of find into xargs, for it to work out. However, this is too slow -- files are being created way faster than they are moved away. 5) I suspect that xargs is processing less files at a time than using a \+ after find -exec. I tried adding a `-R 6000' flag, for running 6000 entries at a time; however, I don't think it made a difference. 6) I decreased the priority of split to the lowest possible. No change in the amount of CPU it consumed, so probably no effect either. 7) I open up to seven command prompts for running the mv commands (last four letters per command prompt) -- however, this is still not nearly enough. I would open more, but once the system gets to seven, the response is so slow that I have to stop the split. For example, the source archive got copied to a USB all while waiting for a ls -l | tail command to return something. So what I've been doing is, stopping the split at that point, waiting for the mv commands to catch up, and then restarting the split. At that point I would use find -exec rm {} \+ to delete the files I already have; this is a bit faster, so when it gets to the files I don't have, there's less files around. So the first such iteration lasted ~3 million files, the next one ~2 million, the next ~1.5. I am sure there should be a better way, though. Any ideas for what else to try?
Asked by Alex (1220 rep)
Nov 10, 2023, 02:31 PM
Last activity: Nov 10, 2023, 03:17 PM