Sample Header Ad - 728x90

Does write(fd with O_SYNC) only flush data of THAT fd instead of all caches caused by other fds of same file?

0 votes
1 answer
464 views
I was using dd command to change a single byte of a block device(not the partition block device), such as /dev/nvme0n1, at a specific position (not managed by normal file).
dd of=${DEV:?DEV} seek=${POS:?POS} bs=1 count=1 oflag=seek_bytes conv=notrunc status=none
I encountered an issue of sync command, it hangs or takes too long time to finish on some machines. Seems the sync command involves caches of all files, this obviously will be slow, or even hang up due to some inconsistent kernel management. Especially there are several big VMs are running on the host, the sync will be very slow, some times 30minutes. Then I started think I should not call sync command direct, I should instead tell dd to sync the part it involved only, by the oflag=sync, like this:
dd of=${DEV:?DEV} seek=${POS:?POS} bs=1 count=1 oflag=sync,seek_bytes conv=notrunc status=none
Since it is not obvious of the difference between oflag=direct, oflag=sync, conv=fsync, I dived into the source of dd, turns out that - oflag=sync will cause open output file with O_SYNC flag, each write syscall will will automatically cause fsync(fd). - conv=fsync cause an additional fsync syscall on each write. - oflag=direct require the block size be multiplied of 512 etc, for my case, it is just 1 byte, dd just turn off the flag, changed it to conv=fsync. All seems good, but I am not sure about one thing: ### If the output file /dev/nvme0n1 has many files cached by Linux, then will my dd command trigger it eventually sync all files? (I actually just want dd sync the 1 byte to the device, not other contents.) I checked the kernel source, guess the write(fd with O_SYNC flag) eventually calls [fs/sync.c#L180)(https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/fs/sync.c#L180) (at least this is what the fsync syscall eventually calls)
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
	struct inode *inode = file->f_mapping->host;

	if (!file->f_op->fsync)
		return -EINVAL;
	if (!datasync && (inode->i_state & I_DIRTY_TIME))
		mark_inode_dirty_sync(inode);
	return file->f_op->fsync(file, start, end, datasync);
}
but then I was stuck at
file->f_op->fsync(file, start, end, datasync)
I am not sure how does the file system driver handle the fsync, whether it involves all caches caused by other fds, it is not obvious. I will continue check kernel source and append EDIT later. EDIT: I am almost sure that the vfs_fsync_range is the one eventually called by write syscall. The stack is like this - [fs/read_write.c#L649](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/fs/read_write.c#L649)
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}
- [fs/read_write.c#L637](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/fs/read_write.c#L637)
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
...
		ret = vfs_write(f.file, buf, count, ppos);
}
- [fs/read_write.c#L584](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/fs/read_write.c#L584)
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
...
	if (file->f_op->write)
		ret = file->f_op->write(file, buf, count, pos);
	else if (file->f_op->write_iter)
		ret = new_sync_write(file, buf, count, pos);
...
}
- [block/fops.c#L551](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/block/fops.c#L551)
static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
...
	ret = __generic_file_write_iter(iocb, from);
	if (ret > 0)
		ret = generic_write_sync(iocb, ret);
...
}
- [include/linux/fs.h](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/include/linux/fs.h#L2466)
static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
{
	if (iocb_is_dsync(iocb)) {
		int ret = vfs_fsync_range(iocb->ki_filp,
				iocb->ki_pos - count, iocb->ki_pos - 1,
				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
		if (ret)
			return ret;
	}

	return count;
}
To be continued... - [block/fops.c#L451](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/block/fops.c#L451)
static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
		int datasync)
{
	struct block_device *bdev = filp->private_data;
	int error;

	error = file_write_and_wait_range(filp, start, end);
	if (error)
		return error;

	/*
	 * There is no need to serialise calls to blkdev_issue_flush with
	 * i_mutex and doing so causes performance issues with concurrent
	 * O_SYNC writers to a block device.
	 */
	error = blkdev_issue_flush(bdev);
	if (error == -EOPNOTSUPP)
		error = 0;

	return error;
}
### It should be the above blkdev_fsync doing the sync work. From this function, it becomes hard to analyze. Hope some kernel developers can help me. The above function further call functions in [mm/filemap.c](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/mm/filemap.c) and [block/blk-flush.c](https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/block/blk-flush.c#L459) , hope this helps. I will do a test, but the test can not make me confident... that is why I come here to ask this question. Tested, but since the sync command itself also quickly finished, I can not tell the if dd oflag=sync is safer than sync command. EDIT: ### I have managed to confirmed that dd oflag=sync is safer and quicker than sync command, I believe the answer of this question is yes. > Does write(fd with O_SYNC) only flush data of THAT fd instead of all caches caused by other fds of same file? YES. The test is like this: - repeatedly create big file with random data
for i in {1..10}; do echo $i; dd if=/dev/random of=tmp$i.dd count=$((10*1024*1024*1024/512)); done
- in another term, run sync to confirm that it will be very slow, just like hang up there. Interrupt the sync command. - create a test file, get its physical LBA.
echo test > z
DEV=$(df . | grep /dev |awk '{print $1}')
BIG_LBA=$(sudo debugfs -R "stat $PWD/z" $DEV | grep -F '(0)' | awk -F: '{print $2}')
- in another term, run the dd command, confirm it is very fast.
dd of=${DEV:?DEV} seek=$((BIG_LBA*8*512)) bs=1 count=1 oflag=sync,seek_bytes conv=notrunc status=none <<<"x"
But I still hope someone can point out where in the source code that I can confirm the answer.
Asked by osexp2000 (622 rep)
May 10, 2023, 11:32 AM
Last activity: May 10, 2023, 03:55 PM