the experiment on linux, x86 32-bit.
so suppose in assembly program, need periodically (for instance every time after executing 100000 basic blocks) dump array in .bss section memory disk. starting address , size of array fixed. array records executed basic block's address, size 16m right now.
i tried write native code, memcpy .bss section stack, , write disk. seems me tedious , worried performance , memory consumption, say, every-time allocate large memory on stack...
so here question, how can dump memory global data sections in efficient way? clear enough?
first of all, don't write part of code in asm, esp. not @ first. write c function handle part, , call asm. if need perf-tune part runs when it's time dump 16mib, can hand-tune then. system-level programming checking error returns system calls (or c stdio functions), , doing in asm painful.
obviously can write in asm, since making system calls isn't special compared c. , there's no part of of that's easier in asm compared c, except maybe throwing in mfence around locking.
anyway, i've addressed 3 variations on want happen buffer:
- overwrite same buffer in place (
mmap(2)/msync(2)) - append snapshot of buffer file (with either
write(2)or probably-not-working zero-copyvmsplice(2)+splice(2)idea.) - start new (zeroed) buffer after writing old one.
mmap(2)sequential chunks of output file.
in-place overwrites
if want overwrite same area of disk every time, mmap(2) file , use array. (call msync(2) periodically force data disk.) mmapped method won't guarantee consistent state file, though. writes can flushed disk other on request. idk if there's way avoid kind of guarantee (i.e. not choosing buffer-flush timers , on pages don't written except msync(2).)
append snapshots
the simple way append buffer file simply call write(2) when want written. write(2) need. if program multi-threaded, might need take lock on data before system call, , release lock afterwards. i'm not sure how fast write system call return. may return after kernel has copied data page cache.
if need snapshot, writes buffer atomic transactions (i.e. buffer in consistent state, rather pairs of values need consistent each other), don't need take lock before calling write(2). there tiny amount of bias in case (data @ end of buffer later time data start, assuming kernel copies in order).
idk if write(2) returns slower or faster direct io (zero-copy, bypassing page-cache). open(2) file with o_direct, write(2) normally.
there has copy somewhere in process, if want write snapshot of buffer , keep modifying it. or else mmu copy-on-write trickery:
zero-copy append snapshots
there api doing zero-copy writes of user pages disk files. linux's vmsplice(2) , splice(2) in order let tell kernel map pages page cache. without splice_f_gift, assume sets them copy-on-write. (oops, man page says without splice_f_gift, following splice(2) have copy. idk if there mechanism copy-on-write semantics.)
assuming there way copy-on-write semantics pages, until kernel done writing them disk , release them:
further writes might need kernel memcpy 1 or 2 pages before data hit disk, save copying whole buffer. soft page faults , page-table manipulation overhead might not worth anyway, unless data access pattern spatially-localized on short periods of time until write hits disk , to-be-written pages can released. (i think api works way doesn't exist, because there's no mechanism getting pages released right after hit disk. linux wants take them on , keep them in page cache.)
i haven't ever used vmsplice, might getting details wrong.
if there's way create new copy-on-write mapping of same memory, maybe mmaping new mapping of scratch file (on tmpfs filesystem, prob. /dev/shm), snapshots without holding lock long. can pass snapshot write(2), , unmap asap before many copy-on-write page faults happen.
new buffer every chunk
if it's ok start zeroed buffer after every write, mmap(2) successive chunk of file, data generate in right place.
- (optional)
fallocate(2)space in output file, prevent fragmentation if write pattern isn't sequential. mmap(2)buffer first 16mib of output file.- run normally
- when want move on next 16mib:
- take lock prevent other threads using buffer
munmap(2)buffermmap(2)next 16mib of file to same address, don't need pass new address around writers. these pages pre-zeroed, required posix (can't have kernel exposing memory).- release lock
possibly mmap(buf, 16mib, ... map_fixed, fd, new_offset) replace munmap / mmap pair. map_fixed discards old mmapings overlaps. assume doesn't mean modifications file / shared memory discarded, rather actual mapping changes, without munmap.
Comments
Post a Comment