[mythtv] mythth 32

Sun Jul 10 12:09:35 UTC 2022

On Sun, 10 Jul 2022 15:01:49 +0800, you wrote:

>Since I asked for opinion I'd not be so rude as to refute any, but some comments <smile>

>> The timings you posted for I/O are for reads. NVME disks are
>> extremely fast always for reads. Where they can be much slower is for
>> writes. That will happen (as on any flash memory) if they run out of
>> erased blocks and have to erase one before they can write. Recording
>> TV to an SSD causes large numbers of SSD blocks to be written to, so
>> it is possible that you are running out of blocks on the erased list.
>> Do you have the SSDs set up to do TRIM only, or have you set them to
>> use the discard option in fstab? If you are doing TRIM only, then
>> that is likely only happening once a day at the most (unless you have
>> changed that), so no blocks will be being erased until TRIM is run.
>> Set the discard option to make the system hand back erasable blocks as
>> soon as they no longer form part of a file, so that the SSD can erase
>> them immediately.
>
>If there are Ts available for writing, surely 'available' is moot?

No, we are talking about the underlying SSD operations here.  Once you
have written more data than the size of the SSD, it will have written
to all its data blocks at least once.  So after that, before it can
write to a block, it will have to erase a block.

In a flash device like an SSD, writes can only change 1 bits to 0
bits, not the other way around.  When a block of flash is erased, it
gets set to all 1 bits and then can be written to by changing all the
1s in that block to 0s wherever the data being written is a 0.  Erase
operations can only be done on a block, not at a bit or byte level.  A
block might be 4096 bytes - it varies from one SSD to another.  Each
flash block can only be erased a certain number of times before it
fails to erase correctly.  The SSD operating system keeps track of
which physical flash block is assigned to which logical flash block.
It remaps the physical blocks to logical blocks so that it can control
which blocks have been erased most often.  This is how it does wear
leveling.

Whenever the operating system stops using a block of data at the
filesystem level, it needs to tell the SSD that it is no longer using
that address space and it can be erased.  That is the function of the
TRIM command that the OS sends to the SSD over its SATA or NVMe
command channel.  The SSD receives the TRIM commands and works out
which of its logical flash blocks are now fully unused by the
operating system.  Whenever it finds a full flash block that is now
unused, it queues it to be erased.  Erase operations are very slow (up
to seconds long) and take a lot of power.  Each group of flash blocks
that makes up the entire SSD can only erase one of its blocks at once.
Once a flash block has been erased, it is put on the list of erased
blocks.

When the OS writes to a logical block on the SSD, either that logical
block already contains data and data in the block needs to be changed,
or the block is not in use and a new physical block need to be
assigned to that logical block by the SSD operating system.  In the
latter case, the SSD just looks at its list of erased blocks and
assigns one and does the write to that block.  In the case of changing
data in an existing logical block, the SSD will check to see if it is
just able to write some more 1 bits down to 0 bits - if so, then it
will do that in the existing physical block.  If there are any bits
that need to change from 0 to 1, then it has to use an erased block
where the bits are all 1s.  So it gets an erased block from its list
of erased blocks, copies all the data from the old physical block to
RAM, writes the new data over the old data in RAM, then writes the
entire RAM block to the erased physical block it has just assigned for
this job.  When the flash write is complete, it moves the old physical
block to the list of blocks to be erased, and assigns the new physical
block to become that logical block.

The problem with TRIM operations is when they are done.  If you do not
have the discard option set for a filesystem, then the OS only sends
TRIM commands when it runs its fstrim program, typically once a day as
a cron or systemd job.

The way fstrim works is that it looks at all files on the filesystem
and works out what logical blocks (filesystem logical blocks here) are
not in use and sends TRIM commands to the SSD to say what those unused
areas are.  The SSD works out the mappings of those TRIM commands to
its logical flash blocks (which are different from the filesystem
logical blocks) and any logical flash blocks that are now unused by
the operating system are unmapped from physical blocks and their
corresponding physical blocks are queued to be erased.

When discard has been specified for a filesystem, any time the
operating system stops using a filesystem level logical block (when a
file gets deleted, for example), the operating system immediately
queues up a set of TRIM commands for all the filesystem logical blocks
to tell the SSD that those areas are no longer in use.  So shortly
after a file is deleted (or shortened), the unused filesystem logical
blocks will have had TRIM commands received by the SSD and the
corresponding physical flash blocks will be on the queue to be erased.
And several seconds after that the erases will be completed and those
physical blocks will be available on the SSD's erased list.  Without
discard, no TRIM commands will have been sent and all the
corresponding physical flash blocks on the SSD will still be marked as
in use.

So for the purposes of a worst case example, say you have a 1 Tbyte
SSD you are writing MythTV recordings to, and it has the minimum free
storage that MythTV leaves on a recording partition, around 20
Gibytes, and the rest of the partition is full.  MythTV wants to start
a new recording and deletes an old 2 Gibyte recording from the
partition to make room, so there now 22 Gibytes of free space that
MythTV can see, so it starts the new recording.  However, fstrim has
not been run for a while, and only the 20 Gibytes of free space that
was there is actually erased on the SSD.  MythTV writes its new
recording happily, and keeps on writing as it is cricket game that is
going to go on for hours.  Every so often, MythTV sees the free space
on the partition drop below its 20 Gibyte limit and deletes another
old recording.  However, fstrim has still not run, so while MythTV is
happily seeing 20 Gibytes or more of free space, the SSD is rapidly
using up its 20 Gibytes of erased flash blocks.  After MythTV has
written 20 Gibytes of new recording file, it has deleted 20 Gibytes of
recording space, but the SSD now no longer has any erased blocks
available.  The next time MythTV writes to that partition, the write
will fail.

So the rule for SSDs is that if you ever expect to write as much data
to the SSD between fstrim runs as there is free filesystem space when
fstrim last ran, then you need to have discard enabled.  Or you need
to run fstrim more often.  In my opinion, if you are doing heavy write
traffic to an SSD, you just need to use discard, as that way you will
not get into trouble.  In the past, discard was not used as having it
enabled slowed down the filesystem delete operations - the system did
not report the delete complete until the TRIMs had been sent to the
SSD.  I think that problem is long fixed, and the TRIMs are now just
queued to be sent in the background and the delete operation will
report complete as soon as the TRIMs have been queued.  And in any
case, the slight slowdown was not normally noticeable unless you were
doing a benchmark.  But for some reason, most of the advice on the net
still recommends only using fstrim and not discard.  And all the
distros still seem to be set up this way by default.