In a previous blog entry about the challenges of mobile video services, we discussed video quality as a perceived metric and how it can be measured subjectively in viewing sessions. In this post, we will take a closer look at automated video quality measurement methods, often called objective measurements, which are applied in real field video quality measurements.
An objective measurement is much more than a simple formula. It is a very complex algorithm that considers many effects on quality and applies sophisticated weightings to match human perception. In general, an automated or objective measurement also provides a video MOS (Mean Opinion Score) value; however, it does not measure the MOS directly. Instead, the algorithm uses the video and image information that influence the viewer’s perception; it weights that information and combines it in a single value that can be mapped to a MOS scale.
There are many video quality measurement tools on the market. They can be roughly divided into two groups, depending on how they retrieve their input information: image-based video quality measurements actually analyze the quality of all displayed images, while bitstream-based video quality measurements look into the transmission parameters at IP or bitstream level.
Bitstream-based video quality measures
It seems obvious that the images themselves would offer the basis for any kind of video quality measurement. Yet, recently standardized measurements do not use images for quality prediction; they derive their quality information from the video bitstream and IP layer level.
Some more complex models actually decode the video with an internal generic decoder. This, however, requires unencrypted video streams, which nowadays is almost never the case anymore. Consequently, the input information for these types of models is very limited. Assumptions about compression artefacts can be made based on codec type, resolution, and bitrate; packet jitter and knowledge about the protocol might provide insight about transmission problems, modelling their effect on video playback.
A bitstream-based video quality measure requires no video images per se, because its measurement point is an IP interface. This makes it applicable for measurements elsewhere in the network, or at an IP node, and enables network and service level agreements monitoring at IP interconnection points. Their accuracy in measuring video quality is restricted, however, due to the following facts:
- The measuring interface, from where the IP stream is obtained, occurs before the stream is received by the actual player and decoder. The influence of the actual jitter buffer and decoding strategies can thus only be modelled by a generic approach. The measures predict the quality as it would be if the stream were decoded by the modelled player’s behavior, but not as if it were actually decoded and displayed on the user’s device. This generalization can be appreciated for some use cases but does not reflect what the user actually sees.
- The “view range” of an IP-based model is restricted to the last transcoding/repackaging step. The IP level gives only information about the current line to which the measure is connected. In case a low-bitrate video is transcoded to a higher resolution and bitrate, the measure may only see the high resolution and bitrate. Consequently, it will consider the video quality as good, even though content-wise it is not. The more transcoding and re-packaging steps are applied, the more unreliable the measure becomes, because it only factors in the last processing step.
Bitstream-based measures are restricted to certain protocols, codecs, and use cases; nevertheless, they have the unique advantage of being applicable for in-service network monitoring for video services. Besides many proprietary models, standardization activities have led to a series of measures such as the ITU-T P.1201, P.1202 and the recent P.1203.
Image-based video quality measures
Compared to bitstream-based models, image-based models receive decoded video images as input signals and derive the quality-relevant information through their analysis. From a measurement application point, these measures are typically “end-point” measures. They interface the video at the client’s side, even directly on the user device. This allows measuring the real end-user experience, since it considers all processing steps, including the player’s actual buffer management, decoder, and all applied scaling on the device.
Image-based measures look at the content; they detect low quality by analyzing the image itself, independent of the degradation’s source. In addition, the display duration of each image gives information about the actual perceived freezing and frame rate. Image analysis, however, consumes a lot of resources: the decoded images and quasi bitmaps have to be handed over to the algorithm for analysis.
Of course, the analysis can be simpler or more complex, but almost all relevant algorithms today analyze spatial and temporal degradation. Spatial issues cause degraded images; temporal degradation includes freezing times, low frame rates, or error propagation effects. The degradation of images is usually the result of lost spatial details, causing the images to become blurry as if downscaled or block-wise reproduced. In case of loss during transmission, the result can be a complete “break up” of the image or the presence of block distortion and color distortion.
In image-based video measurements, visible degradation first needs to be detected and quantified. The second step includes perceptual weighting: How annoying is the physical degradation? Does it affect an important area of the image or not? What is the area surrounding the degradation like, does it get more or less prominent?
Perceptual weighting is also applied to temporal degradation such as freezing or low frame rates, including the video’s jerkiness in a given scene. Finally, the individual types of degradation are combined into a single score and mapped to the MOS scale. This is not simply a linear combination, there are also inter-degradation masking effects; one degradation may dominate another one.
Full-reference versus no-reference measures
The detection and quantization of the different types of degradation can be done in different ways. There are so-called “full-reference” measures that compare the degraded video frame-wise to the original, undistorted input. Differences are very accurately detected and rated. Another approach is called “no-reference” measures. Here, no comparison to an original video is made; instead the image is analyzed for distortions such as the presence of block structures or the absence of sharp edges.
Both approaches have their dedicated areas of application. The full-reference approach is applicable when there is a known, pre-stored video source and the sent video can be reproduced as exactly as possible at the far-end side. Just like in a voice measurement, this can be applied to a video call, where a reference video is inserted at the far-end side and its reception gets analyzed at the other side.
No-reference measures, on the other hand, do not need to know what is inserted at the far-end side; they analyze whatever they get without comparing it to a known, pre-stored input video. Therefore, they can be applied to live streams such as TV and streaming services. They are also more robust for changing images through re-scaling or picture improvements.
For both strategies, there are approaches combining the image analysis with meta information from the bitstream to consider more information and therefore further improve accuracy. However, both strategies, full-reference and no-reference, require access to real images as shown to the user on his screen.
Today, we are experiencing a lot of standardization activity for automated video quality prediction measures, many of them using image analysis as a basis for MOS calculation. Examples are the recently approved ITU J.343 series. These approaches, integrated in real field measurement tools, deliver a valid basis for optimizing and benchmarking real field services.
Video MOS for video quality
The user’s experience is best reflected if the measurement point is close to the user’s interface. This means the video analysis should use the video and video service on the same device as the user, and it should analyze the same images as the user sees on his screen. In this setup, all components influencing video quality are considered, from video compression and transmission to buffering, decoding and scaling on the device itself. Any analysis point prior to the picture shown to the user will not consider all these influences.
Of course, special techniques and highly optimized algorithms are required to capture decoded images in real-time and analyze them fast. Today’s smartphones, however, have very high-performing hardware and enable video quality analysis in real-time if the algorithms are customized accordingly.
A measured video MOS already describes the perceived quality of the video while watching. In combination with other measures, such as “time to first picture” and freezing and re-buffering events, a viewer’s quality of experience can be well described.
Since 2014, Rohde & Schwarz mobile network testing solutions offer a real-time video quality analysis application running on all state-of-the-art Android smartphones, including the latest models that exceed full-HD resolution and support 1440p and UHD video resolutions. The recent J.343.1 no-reference model directly accesses the displayed images and analyses them in real-time. The algorithm is embedded in an automated test structure that is applicable to many video services, including YouTube. Apart from the video MOS reports, many other technical KPIs, such as freezing, frame rate, video resolution, “time to first picture”, buffering time, and more, can be measured.
How these measurement applications can be used for network analysis and what information can be retrieved will be discussed in the next post of this blog series.
Stay online, always!