Performance Comparison Between Early Fusion and Late Fusion in Video Analysis

5 min readMay 14, 2021

The analysis of semantic videos has been very popular for the methods which can index segments of interest automatically. The way of common semantic video analysis is to perform feature extraction from a series of data in the video, feature extraction methods like single information stream or multimodal analysis, or using two or more information streams. According to these extracted features, an algorithm indexes the data with sematic concepts like car, ice hockey, and beach. Now sufficient evidence shows that semantic video analysis yields the most effective index when a multimodal approach is adhered [1, 2].

This article identified two general fusion methods within machine learning trend to semantic video analysis, they are early fusion and late fusion. The question arises whether early fusion or late fusion is better for analyzing semantic videos. In this article, I compared both multimodal fusion approaches and performed a comparative evaluation to see the advantages and disadvantages of these two methods.

Fusion Approaches

1 Early Fusion

Indexing methods that depend on early fusion first extract unimodal features. The extracted features are combined into a single representation after analysis of the various unimodal streams. After the combination of unimodal features in a multimodal representation, early fusion methods rely on supervised learning to classify semantic concepts.

Early fusion can generate a truly multimedia feature representation since all the features are merged from the beginning. Another advantage of early fusion is that it only requires a single learning phase. The disadvantage of this method is the challenge of combining all the features into a common representation. The general scheme for early fusion is illustrated in the following figure.

Early fusion merges temporal information at pixel level. Assuming that the kernel size of the filters isMxN, this is achieved by replacing the MxNx3 filters by MxNx3xT filters, where T is the size of the temporal window, that is, the amount of frames that are processed together. A value of T=10 is used in the described experiments.

Figure 1: Explored approaches for fusing information (early fusion and late fusion) over temporal dimension through the network. Red, green and blue boxes indicate convolutional, normalization and pooling layers respectively.

2 Late Fusion

Indexing methods that rely on late fusion start with feature extraction as well. Compared with early fusion, the features extracted from late fusion are then combined into a multimodal representation, approaches for late fusion learn semantic concepts from the unimodal features directly. In [3] for instance, separate generative probabilistic models are learned for the visual and textual modality. The scores are combined together afterward to general a final detection score. In general, late fusion schemes combined learned unimodal scores into a multimodal representation, it relies on supervised learning to classify semantic concepts.

One of the disadvantages of late fusion is its high computation cost since every modality needs a separate supervised learning stage. What’s more, the late fusion method is the potential loss of correlation in mixed feature space. Figure 1 also illustrates the algorithm of late fusion.

For late fusion, the temporal information is not merged until reaching the first fully connected layer. In the experiments, two single-frame networks at a distance of 15 frames are merged at the last stages, where global motion characteristics can be detected (as the individual towers cannot detect any motion, given their single-frame input).

Fusion Scheme Implementation and Evaluation

This article contains the implementation and evaluation of early fusion and late fusion within the TRECVID video retrieval benchmark. The video archive of the 2004 TRECVID benchmark is composed of 184 hours of ABC World News Tonight and CNN Headline News. The development data includes the remaining 64 hours. Together with the video archive came automatic speech recognition results donated by LIMSI[2]. Figure 2 shows the semantic concepts in the dataset.

Figure 2: Instances of the 20 concepts in the dataset

Average precision has been used to evaluate the accuracy of semantic concept detection at the shot level, following the standard in TERECVID evaluations. The following figure illustrates the detection results for all 20 semantic concepts for both early fusion and late fusion.

Figure 3: Comparison of early fusion versus late fusion for semantic indexing of 20 concepts

As you can see from the figure above, late fusion performs well than early fusion in most of the concepts like golf, boat, ice hockey. But for some concepts, the performance of late fusion is not so good, for example, the absolute difference is 0.1 for car and 0.3 for stock quotes.

For concepts of road and ice hockey, the late fusion method is able to improve results compared with early fusion. From the scores for the visual modality and the textual modality, we can find that the concepts of road and ice hockey are relatively easy to separate. For stock quotes the situation is different, the late fusion can classify a large number of easily separable scores correctly, but it loses track for scores that are less prominent as the huge score difference in the figure, the same example as graphic shown in the figure above.

Conclusions

Based on the analysis of semantic concepts from the data above, we can conclude that a late fusion scheme tends to perform better than early fusion for most concepts, but it comes with the price of an increased learning effort. Moreover, the late fusion experiences difficulty in classifying shots that are close to the decision boundary of the SVM. So if early fusion has a better performance the improvement would be more significant. These results suggest that a fusion strategy on a per-concept basis yields an optimal strategy for semantic video analysis.

Reference

[1] A. Amir et al. IBM research TRECVID-2003 video retrieval system. In Proc. TRECVID Workshop, Gaithersburg, USA, 2003.

[2] G. Iyengar, H. Nock, and C. Neti. Discriminative model fusion for semantic concept detection and annotation in video. In ACM Multimedia, pages 255–258, Berkeley, USA, 2003.

[3] T. Westerveld et al. A probabilistic multimedia retrieval model and its evaluation. EURASIP JASP, (2):186–197, 2003.

Performance Comparison Between Early Fusion and Late Fusion in Video Analysis

Fusion Approaches

Fusion Scheme Implementation and Evaluation

Conclusions

Reference

Written by Chao Zhang

No responses yet