[mythtv-users] Slightly OT: Powering up remotely

Thu Sep 24 06:33:21 UTC 2020

On Wed, 23 Sep 2020 16:41:54 -0400, you wrote:

>Christopher: Yes, the suspense is killing me. The one or two times this
>happened over the last nine years it was always some fsck/error type
>message for the recording drives (two internal and an external drive I send
>some stuff to after recording). It had never been the OS drive. So it's
>sorta frustrating that the boot process will hang (and ssh not start)
>because of something happening to a data drive. You'd think it would let
>you get to the point where you could ssh in and then say "Here's a problem
>with your system. I can't continue past this." It'll be a while before I
>see it and if it's just a "hit y" wait a few minutes and bingo, I'll yell.

The problem where the system does not boot when fsck happens used to
be common on older versions of Ubuntu using systemd.  For a while now
on 18.04 I have not seen it, but I have not been particularly testing
for it.  There may have been a systemd update that fixes it, or makes
it less likely.  What I suspect is happening is that the fscks take a
long time, which causes other things that systemd has waiting to start
to exceed the timeout after which they get started anyway, even if the
preconditions for them starting have not been met.  So they start up
and then fail.  In the case of ssh, networking may not be up when it
gets started, or systemd may find that it is unable to start the
normal system (multiuser mode) and sshd may only be started when
multiuser mode is started.  Systemd or the configuration of these
systemd units should be fixed to prevent this, and may have been at
least somewhat.

However, if you are relying on the system to do fscks after a crash,
power failure or failed shutdown, then you are living dangerously.
There are two distinct problems with relying on automatic fscks
happening on boot when partitions are marked as dirty.

First, the automatic fsck with the fix option on will only be run
once.  When manually running fsck in this situation, I frequently get
problems that are still present after telling fsck to fix everything.
My worst one so far was on my mother's MythTV box where it took 7 runs
of fsck before it ran without finding new errors to fix.  So if the
system automatic fsck does fix things, that does not mean the
partition is safe to be written to.  To be safe, you have to have fsck
run and not find any errors to fix.  The automatic fsck checks at boot
do not do this.  And when manually running fsck, there can be times
when it tells you that a file is damaged and can not be repaired - you
have to write down the file names and manually fix them later. Usually
the files are ones that are installed by packages and I can just copy
the same file from one of my other PCs running the same distro.

Secondly, there can be data errors caused by things having been failed
to be written to disk.  Fsck can have run without errors, but software
with complex use of data in its files can still have things in an
inconsistent or corrupt state.  It depends on each program as to how
it maintains consistency and an error free state.  The place I meet
this all the time is the mythconverg database.  If mythbackend is
recording at the time of a power failure or crash, the recordedseek
table is usually left in a corrupt state, as it is written to all the
time during a recording.  If you are unlucky, other database tables
may have also been being written to at the time of the crash and can
also be left in a corrupt state.  If you make sure to check and fix
all the database tables before running mythbackend again, there is
normally no problem as MySQL/MariaDB will be able to fix the tables.
However, if you run mythbackend without fixing the tables first, you
can cause writes to the database that will go wrong due to the
existing corruption and this can then cause further corruption that
makes the table unfixable.  With the recordedseek table, this is not
so bad as mythbackend can re-create it completely if told to do so (it
takes a long, long time if you have lots of recordings).  But with
most of the other tables, you will have no option but to restore a
backup copy of the database to fix this.  And if you did not notice
the problem for a while, you can find that all your existing backups
of the database are also corrupt and you can lose your entire
database.  I keep both daily and weekly backups to help with this
problem.

So I recommend that MythTV users never, ever rely on the automatic
fscks done at boot to recover from a crash situation like this.  The
right way to recover is to manually boot to a safe state where nothing
is running except the system in read only mode and manually fix
things.  Using the recovery options on the grub menu to boot to a root
prompt might seem like the way to do this, but unfortunately that does
not allow you to run fsck on the boot partition.  So there are two
ways I do it.  One is to have a second boot partition on the same PC
that has the same or later version of the operating system on it.  It
is set up to boot only into its system and not mount any other
partitions.  So from the grub menu I can boot that partition, run apt
to update it to the same versions of packages as the main boot
partition, then use it to run fscks on all the other partitions,
including the normal boot partition.  I run fsck on all partitions and
on any that need fixes I re-run it until it says there are no more
errors.

I normally write myself a set of fsck repair scripts to do that so I
can just run them to get all the partitions being fscked in parallel.
There is one script (chk_all.sh) that runs all the other scripts (one
per drive) in a tabbed window:

root at mypvr:/mnt/ssd1/usr/local/bin# cat chk_all.sh
#!/bin/bash

# Check all partitions in parallel.

xfce4-terminal -H --title=rec1 --command chk_rec1.sh \
--tab -H --title=rec2 --command "bash chk_rec2.sh" \
--tab -H --title=rec3 --command "bash chk_rec3.sh" \
--tab -H --title=rec4 --command "bash chk_rec4.sh" \
--tab -H --title=rec5 --command "bash chk_rec5.sh" \
--tab -H --title=rec6 --command "bash chk_rec6.sh" \
--tab -H --title=rec7 --command "bash chk_rec7.sh" \
--tab -H --title=stardom --command "bash chk_stardom.sh" \
--tab -H --title=ssd --command "bash chk_ssd.sh"

and each of the scripts does fsck on the partitions on one drive.  For
example:

root at mypvr:/mnt/ssd1/usr/local/bin# cat chk_rec2.sh
#!/bin/bash

echo Checking rec2
fsck -C -f /dev/disk/by-label/rec2
echo Checking rec2boot
fsck -C -f /dev/disk/by-label/rec2boot
echo Finished

The chk_stardom.sh script checks all four drives I have in my Stardom
eSATA drive mount.  They are all on the one eSATA connection and can
not be checked in parallel with each other:

root at mypvr:/mnt/ssd1/usr/local/bin# cat chk_stardom.sh
#!/bin/bash

echo Checking vid1
fsck -C -f /dev/disk/by-label/vid1
echo Checking vid1boot
fsck -C -f /dev/disk/by-label/vid1boot
echo Checking vid2
fsck -C -f /dev/disk/by-label/vid2
echo Checking vid3
fsck -C -f /dev/disk/by-label/vid3
echo Checking vid4
fsck -C -f /dev/disk/by-label/vid4
echo Finished

After the fsck checks are all done I manually re-run fsck on any
partitions that needed repairs, and keep doing that until fsck reports
no problems.

Then, before rebooting, I change the normal boot partition to disable
mythbackend from starting by removing the link in
/etc/systemd/system/multi-user.target.wants that points to
mythbackend:

root at mypvr:/etc/systemd/system# ll
multi-user.target.wants/mythtv-backend.service
lrwxrwxrwx 1 root root 42 Jul  2 22:26
multi-user.target.wants/mythtv-backend.service ->
/lib/systemd/system/mythtv-backend.service

so my commands to do the rm are:

mount /dev/disk/by-label/ssd2 /mnt/ssd2
rm
/mnt/ssd2/etc/systemd/system/multi-user.target.wants/mythtv-backend.service

Those commands are normally in my command history so I re-use them
rather than typing them again each time.

Removing that link does exactly what running this command does when
run from the normal boot:

sudo systemctl disable mythtv-backend

so mythbackend will not be started automatically.  Then I reboot into
the normal boot partition.  Immediately after rebooting, you may find
that anacron is running.  This happens if the reboot is done after
midnight and before the normal time that anacron gets run at every
day.  If so, you want to kill anacron as it will want to run its
normal daily (and weekly or monthly) jobs, some of which are for
database backup and repair, and many of which will make the PC very
busy and slow to do your manual repairs.  So to prevent that:

sudo su
pkill anacron

This has to be done immediately after rebooting - anacron waits a few
minutes after it is started before it does anything, so you need to
kill it before it starts its jobs.

It is also possible to have anacron permanently set up to only be run
on its daily timer, rather than also at boot.  To do that, run this
command:

sudo systemctl disable anacron

After that, anacron will still be run from the systemd anacron.timer
unit, but not from the anacron.service unit.  I recommend doing this
for MythTV boxes that are normally left running 24/7.

Then I do this:

cd /etc/cron.daily
./optimize_mythdb

I have my optimize_mythdb and mythtv-database commands in cron.daily,
rather than cron.weekly, so I get daily database checks and backups.
This is also highly recommended.

Normally optimize_mythdb will find that the recordedseek table needs
fixing and will repair it.  Occasionally, other tables will also need
repairs.  If I am very unlucky, optimize_mythdb will report it was
unable to repair recordedseek.  Usually I can then use manual repair
commands to repair it.  It is a number of years now since I have had
to restore it from backup.  Then I can do this:

systemctl enable mythtv-backend
systemctl start mythtv-backend
exit

and everything is working again.

Once mythbackend is running, I may need to run "mythcommflag
--rebuild" on all the recordings that had their recordedseek entries
affected by the corruption of that table.  I have a user job set up
that runs that command, so I can just see what recordings were made
since the last database backup and run mythcommflag on each of them
from mythfrontend.  If you also do commercial skip flagging, that
needs to also be redone by a second mythcommflag command after the
--rebuild one.

After that, everything is back to normal until the next time it
happens.

The other way to do repairs like this is to boot a DVD, USB or PXE
live version of the system rather than having an extra bootable
partition.  In that case, you may find you have to install packages to
get the tools you need to do the repairs.  If so, you need to have
your network set up so that live boots can have Internet access.  I
find that I normally will need to install the jfsutils package to get
the fsck module for my JFS partitions.