Sample Header Ad - 728x90

Is it possible to set larger buffers for file access on linux

0 votes
1 answer
107 views
I have a process thats reading the whole filesystem and hashing files. It slows down (by 4x or so) because the reads are causing a lot of seeking. Small chunks of each file are being read by each of 4 threads but if I test a sequential read by copying (cp) I can read far faster. CPU utilisation is at 25% so its not cpu bound. I am fairly sure seeking is the problem. I've read that the kernel has quite sophisticated disk reading strategies to speed up access so I wonder if kernel buffers are restricting their use here, and if they could be increased to allow it to buffer more. I assume the program I am using is only requesting a fairly small chunk of data with each read call so I don't know if this would be effective. I imagine reading each file into memory fully one by one would be most effective but I can't rewrite the application at the moment (its not mine, and its large and bloated imo). But could I get the OS to read each file as it is opened into buffers entirely (or even partially, like 100 - 500MB at a time), sequentially, so that the application threads are only making a call to memory with each of their small reads rather than a call to disk (causing a seek)? ADDED LATER: @Artem cache does not seem to do the job here and I guess I can understand why. The kernel is trying to be 'sensible' and saying "I'm not going to read a whole 500MB file into memory, just because the user has requested the first MB". Which makes sense. What is loaded in will be cached indeed so if its used again (for example by another process) it can be fetched from memory. But what I want is the kernel to load the whole file in to cache on the first read (that first read being what, 2MB maybe?). So the system call is read(fd, buf, size). If I'm programming C I would never put a huge buffer as size and I doubt many programmers would. So it was probably written using a more normal sort of buffer size, a meg or two. So the user process gets a MB or two and enters the hashing function which keeps it busy for a while, and it stops pestering the kernel for a disk read. Meanwhile theres another disk read queued by a different thread for a read of a different part of the disk. So the kernel services that now, and the disk seeks to a different part of the disk taking ~15ms. Whats a shame is that files are generally held in quite large extents of sequential blocks on disk. So a sustained read of the disk for that first file would probably have read hundreds of thousands, even a million blocks, tens or hundreds of MB, without any seeking. That there is high performance disk reading and its what I want to encourage. But no the way things are working, the processes are requesting small chunks of data, the kernel is trying to be sensible and not read massive amounts of data that no one has even asked for (holding up waiting processes in by doing so), and as a result its seeking around like mad and spending all its time seeking. Contrast this with 'cp -r' - only 1 thread is asking the kernel to read files. So theres no one telling the disk head to seek to a different part of the disk every MB or two, so when subsequent reads come in to the kernel the drive is in a position to get the data quickly. The code could be rewritten with much larger buffers so thats one option for me. But as I say I was wondering if it was possible to instruct the kernel to buffer 'ahead' much more. Kind of like 'read ahead caching'. Predicting that files, once opened, are going to be read in their entirety, so filling read buffers with at least n bytes before stopping the physical disk read to kernel buffers for each file read.
Asked by Pete (153 rep)
Apr 7, 2024, 10:00 AM
Last activity: Apr 8, 2024, 09:27 AM