[mythtv] Playback next steps

Fri Dec 14 15:16:59 UTC 2018

Peter/David,

I've been digging around and playing with the OpenGL, VAAPI and OpenMax code.

Focusing on OpenGL for now...

I forked master last week - just to keep track of patches etc. You can see what I've been doing at:

https://github.com/mark-kendall/mythtv/commits/master

In summary so far for OpenGL:

- fixed UYVY kernel deinterlacer (I see that's already in master)
- fixed YV12 kernel deinterlacer (pretty sure linear blend is broken as well, it looks terrible)
- patched mythavtest to add double rate deinterlacing support (mythavtest is really useful for performance testing if you haven't used it before)
- some openglbicubic fixes
- minor improvement to the UYVY kernel deinterlacer
- fix for desktop OpenGL ES2

In the pipeline already:

- add support for NPOT textures on GLES2/2.0 - should save a lot of video memory on Pi/Android etc
- optimisations for UYVY and desktop GL2.0
- fix use of glTexImage1D - just use 2D instead (1D not available on ES2.0)

I've also started some extensive debugging/logging code for OpenGLVideo to show exactly what is happening under the hood - it's fairly invasive though.
Does that sound useful?

While digging around and trying to get EGL and OpenGLES2.0 working properly on my system, I noticed the comment about ES2.0 and OpenMax playback - and all the subsequent ifdeffery required to disable QT5 opengl support...

Not tested the theory yet, but I think the reason OpenMax fails with QT5 OpenGL/EGL is because Lawrence creates his own EGL render device for the OSD. If using eglfs, this will interfere with the existing Qt screen (I don't think you can create 2 EGL devices). The simple solution I think is to check the Qt QPA platform and disallow the EGL OSD in VideoOutputOMX if the platform is eglfs. This should allow you to remove the whole OPENGL_QT5 ifdef stuff - which would really clean things up and ensure as many people as possible actually use the ES2.0 renderer (with or without EGL).

The more involved solution is to fix VideoOutputOMX. At the moment Lawrence's code effectively assumes an X11 desktop. He uses the OMXVideoRender component to put images on screen (does that even work with eglfs?) and because of the approach has to handle all sorts of windowing issues/masks etc. He then doesn't like the softblend osd:) so creates an additional render device to display on top of the video.

A relatively simple solution is:
- for egl/fs, create VideoOutputOMXEGL (prob a sub-class of VideoOutputOpenGL) and replace the OMXVideoRender component with the Broadcom specific egl_render. EGL images transferred direct to screen and regular OpenGL OSD thrown in for free.
- for X11/desktop, I would actually remove the MythRenderEGL code and if they don't like the softblend osd, encourage them to use EGL...

There is also some broadcom specific code that is not properly ifdef'd out.

If I get the chance, I'm going to have a play with QT5/eglfs/OpenMax over Christmas.

Back to OpenGL proper, having got my head around the code again, I have a better idea of what is happening in the YV12 code - and can compare it to the other options.

Remember the aim of the game is to take a planar YUV420P/YV12 image in main memory and display it as a packed RGBA image on screen.
So there are three significant operations - repacking from planar to packed, transferring to video memory and YUV to RGB conversion - and just like skinning cats, there are multiple ways of doing it.
And remember that a YV12 image is 12bpp and full RGBA is 32bpp.

The simplest fallback route is to do the entire conversion in memory - repacking and colourspace conversion (note this should never actually happen with the current code):
CPU Load: High
GPU Load: Low
Memory transfer: High - 32bpp image transferred.
Colourspace control: None (using FFmpeg)
Availability: Always

The default option is to repack the frame into a full 32bit, packed format and perform colourspace conversion in the GPU. Repacking requires some custom code - interlaced material needs special handling.
CPU Load: Moderate with MMX support - all other platforms fall back to 'plain c'
GPU Load: Lowish - simple 1 texture sampling and colourspace control
Memory transfer: High - 32bpp
Colourspace control: Full
Availability: Always

The OpenGL 'Lite' route uses custom extensions in the GPU. Taking this route the video frame is repacked into a packed UYVY422 video frame, transferred to video memory and 'magically' converted to RGBA.
CPU Load: Moderate - repack from planar to packed.
GPU Load: ??
Memory transfer: Medium - image is 16bpp
Colourspace control: None
Availability: Variable

The custom UYVY code uses the same UYVY422 packed frame format and uses a custom texture format and shaders to convert to RGBA.
CPU Load: 'moderate' CPU load - repack
GPU Load: Medium - the packed frame only requires 1 texture sample per pixel (no deint) but does require an extra filter stage to ensure exact 1 to 1 mapping between input and output. Any horizontal interpolation breaks sampling (because 2 pixels are encapsulated in one RGBA sample). Video memory usage is lower as frame is half width.
Memory transfer: Medium - 16bpp
Colourspace control: Full
Availability: Always

The YV12 code is actually where I started about 10 years ago:) There is no repacking in main memory - the planar frame is transferred to video memory and repacked and converted to RGBA in the GPU. Sounds nice but...
CPU Load: Low to very low..
GPU Load: High to very high. Each output pixel requires 3 texture samples, 2 of which are non-contiguous - as the video data is still planar. For progressive content this is not too bad but deinterlacing gets ugly really quickly:) see below. Also the GLSL shader cannot use rectangular textures so requires more GPU memory - but I have a fix for that coming.
Memory transfer: Low - 12bpp
Colourspace control: Full
Availability: Always

Texture sampling is the most expensive operation in a GLSL shader - and accessing memory away from the current sample is usually more expensive. So it is best to minimise texture sampling and not to access texture memory 'randomly'.

With the software fallback, default, OpenGL lite and UYVY approach - there is only one, coherent texture sample for progressive content. For OpenGL deinterlacers this increases depending on the deinterlacer: linear blend makes 3 (2 non-contiguous) and kernel 8 (7 non-contiguous) - which is why it is slower.

With YV12 you start with 3 texture samples for progressive - which in my testing offsets the gain from very low CPU usage and memory transfer - but for the kernel deinterlacer that increases to 24 texture samples (21 non-contiguous).

... and that is why I tried to find an alternative. It's fine for progressive content but deinterlacing performance just gets worse and worse.

I settled on the UYVY code - it balances its 'performance' between CPU, memory transfer and GPU.

In summary:
software fallback - why bother unless you have a modern CPU and a 15 year old GPU.
default - custom packing code may not be efficient on non X86 architecture and large memory transfer
opengl-lite - nice if available but colour rendition not great.
UYVY - simple repacking, smaller memory transfer and lower GPU texturing.
YV12 - low CPU (straight copy), smallest memory transfer but worse to terrible GPU texturing.

The code could probably try and make some assumptions about the best route to take depending on reported driver/hardware and compile type. e.g. Intel desktop and Pi have shared CPU/GPU memory so memory transfers probably aren't a bottleneck. A more powerful dedicated video card proably won't blink at the sampling required for YV12. At the end of the day, however, there is no right or wrong solution - as long as it works!

Again, hopefully this is helpful. Any questions, just ask.

Regards
Mark

P.S. Probably worth mentioning that I don't really think the code needs both UYVY and YV12 - and unsurprisingly I would suggest ditching YV12. At the same time the OpenGL code could be simplified greatly by removing OpenGL1 support - I'd be amazed if anyone is actually still using it.