warpme at o2.pl
Fri Feb 14 16:04:03 UTC 2014
On 14/02/14 10:24, Henk D. Schoneveld wrote:
> Maybe I’m totally wrong but could it be that this is the symptom of an underlying problem ?
> In the past before kernel 2.6.33 I didn’t have these problems, later on sometimes more or less. What I discovered then was that kswapd was using 100% of CPU because of waiting for I/O.
Ah - maybe the it is related to Jens Axboe work related to new, more
effective writeback mechanism for Linux Kernel version 2.6.32 ?
It was about per-backing-device based writeback - so since 2.6.32, every
block device has it own pdflush thread ensuring that dirty pages were
periodically written to the underlying storage device.
> There was plenty of memory, I even disabled swap, but the problem persisted. By disabling swap kswapd in theory has no function at all, nevertheless it ‘halted’ the system foor several seconds. Going to a pre 2.6.33 kernel solved my problems. My conclusion FWIW is that kswapd also only is the messenger not the cause.
Right. OS "stalls" for systems with huge RAM is well known "problem".
AFAIK issue is with default pdflush settings.
Looking on defaults:
dirty_background_ratio (default 10):
Maximum percentage of active that can be filled with dirty pages before
pdflush begins to writeback page cache to mass storage.
This means page cache can accommodate up tp 10% of data before flusher
thread will trigger writeback. So if there is 16G RAM - it can be 1.6G
written in one steep by pdflush thread working on top system priority
(and of course causing famous "write hog")
dirty_expire_centiseconds (default 3000):
In hundredths of a second, how long data can be in the page cache before
it's considered expired and must be written at the next opportunity.
Note that this default is very long: a full 30 seconds. That means that
under normal circumstances, unless you write enough to trigger the other
pdflush method, Linux won't actually commit anything you write until 30
So data written to disk will sit in memory until either:
a) they're more than 30 seconds old, or
b) the dirty pages have consumed more than 10% of the active, working
Maybe a) is explaining JYA observations that read thread see data with
25sec delay compared to writer thread - assuming writeback to mass
storage is delayed by default 30sec?
I wasn't looking on MythCode, but quick google-fu tells:
"If you do need guarantees about the consistency of your data on disk or
the order in which it hits disk, there are several solutions: For
file-based I/O, you can pass O_SYNC to open(2) or use the fsync(2),
fdatasync(2), or sync_file_range(2) system calls. For mapped I/O, use
I'm wonder - are we using any from above in reader thread?
When I had old, 512byte sector HDD, following settings allowed me to
have zero "TFW(/myth/tv/8027_20140214090200.mpg:384): write(57528) cnt
38 total 2259196 -- took a long time, 1702 ms" during tests with 16HD
concurrent streams on single SATA HDD.
# The kernel flusher threads will periodically wake up and write `old' data
# out to disk. This tunable expresses the interval between those
# 100'ths of a second.
# Setting this to zero disables periodic writeback altogether.
# by https://bugzilla.kernel.org/show_bug.cgi?id=12309
# every 5 sec kernel looks up for dirty status
# This setting for smooting writebacking. Maybe 100 will be
# even better.
# echo 300 > /proc/sys/vm/dirty_writeback_centisecs
# Default is 500
vm.dirty_writeback_centisecs = 100
# Contains the amount of dirty memory at which the background kernel
# flusher threads will start writeback.
# Note: dirty_background_bytes is the counterpart of
# one of them may be specified at a time. When one sysctl is written it is
# immediately taken into account to evaluate the dirty memory limits and the
# other appears as 0 when read.
# Default is <empty>
vm.dirty_background_bytes = 102400
# This tunable is used to define when dirty data is old enough to be
# for writeout by the kernel flusher threads. It is expressed in 100'ths
# of a second. Data which has been dirty in-memory for longer than this
# interval will be written out next time a flusher thread wakes up.
# Default is 3000
vm.dirty_expire_centisecs = 864000
# Contains the amount of dirty memory at which a process generating disk
# will itself start writeback.
# Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them
# specified at a time. When one sysctl is written it is immediately
# account to evaluate the dirty memory limits and the other appears as 0
# Note: the minimum value allowed for dirty_bytes is two pages (in
# value lower than this limit will be ignored and the old configuration
# dirty_bytes = 16777216
# Contains, as a percentage of total available memory that contains free
# and reclaimable pages, the number of pages at which a process which is
# generating disk writes will itself start writing out dirty data.
# The total avaiable memory is not equal to total system memory.
# Default is 20
vm.dirty_ratio = 2
# This control is used to define how aggressive the kernel will swap
# memory pages. Higher values will increase agressiveness, lower values
# decrease the amount of swap.
# The default value is 60.
vm.swappiness = 0
Now, when I move to 4k sector HDD -default kernel settings seems to be OK.
Honestly speaking, do don't believe in correlation between sector size
and pdfluser efficiency - so maybe there is pure coincidence between HDD
change and good performance on VM defaults. But anyway - You can try to
play with above knobs...
BTW2: I would love to see this thread in MythTV forums - so I can
read/replay anywhere via browser - instead of only in mailer program :-p
More information about the mythtv-dev