Dictionary Learning for Scalable Sparse Image Representation with Applications

This paper introduces a novel design for the dictionary learning algorithm, intended for scalable sparse representation of high motion video sequences and natural images. The proposed algorithm is built upon the foundation of the K-SVD framework originally designed to learn non-scalable dictionaries for natural images. Proposed design is mainly motivated by the main perception characteristic of the Human Visual System (HVS) mechanism. Speciﬁcally, its core structure relies on the exploitation of the high-frequency image components and contrast variations in order to achieve visual scene objects identiﬁcation at all scalable levels. Proposed design is implemented by introducing a semi-random Morphological Component Analysis (MCA) based initialization of the K-SVD dictionary and the regularization of its atom’s update mechanism. In general, dictionary learning for sparse representations leads to state-of-the-art image restoration results for several diﬀerent problems in the ﬁeld of image processing. In experimental section we show that these are equally achievable by accommodating all dictionary elements to tailor the scalable data representation and reconstruction, hence modeling data that admit sparse representation in a novel manner. Performed simulations include scalable sparse recovery for representation of static and dynamic data changing over time (e.g., video) together with application to denoising and compressive sensing.


Introduction
Over the past couple of decades, image processing applications have undergone significant improvements. A recent critical factor in this growth is the sparse coding paradigm introduced firstly by [1], based on the assumption that signals (e.g., natural images) admit a sparse decomposition over a learned representational basis i.e., dictionary. This so-called sparseland model [2,3,4] has led to numerous state-of-the-art algorithms for several image processing problems [3] specifically in the context of dictionary D ∈ R n×K learning for any image signal class. Commonly, the representation of image Y ∈ R b×b , is broken down into a set of N extracted patches {y i } N i=1 ∈ R n which are in turn sparsely represented. Typically (but not necessarily) it is assumed that dictionary D is overcomplete i.e., the number of its basis vectors (atoms) is greater than the original signal's dimension (K > n). Given one of the pursuit algorithms e.g., [5,6,7,8,9] and a dictionary D, one can estimate matrix X containing sparse approximations {x i } N i=1 ∈ R K for each y i . Hence, a set of weighted linear combinations of few atoms in D satisfactorily approximates each patch y i ∈ Y with image denoted as Y ≈ DX. The applications of dictionary learning [10,11] include areas such as classification [12,13], efficient face recognition [14], inpainiting [15], denoising [16,17], super-resolution [18,19], Morphological Component Analysis (MCA) [20,21] and those designed for sparse color image processing [22,23].
In this paper, we provide a detailed presentation of the scalable and sparse modeling dictionary learning framework which basic outline was originally presented in [24,25]. Our focus is placed on designing a procedure for learning a dictionary capable of adapting both to a specific dataset and providing its effective scalable reconstruction. Given that current work on scalable data recovery is only based on the the conventional predefined dictionaries such as DCT [26] we find that it is important to offer an alternative one in a form of an adaptive dictionary sparse representation. In addition, to the best of our knowledge existing literature just provides the dictionary learning algorithms such as K-SVD [3,10] that only assume fine resolution as the representational output. This is not sufficient nor tailored to provide the progressive image recovery over its trained sparse representation. Thus, we take and extend the classical form of the K-SVD where the proposed learning scheme differs from the regular one [3,10] in the sense that: • Dictionary is initialized in a controlled, semirandom manner using image's MCA properties [20,21]; • A novel learning design is introduced as a regular-56 Dictionary Learning for Scalable Sparse Image Representation with Applications ization of a dictionary training procedure prior to Singular Value Decomposition (SVD) [27,28]; • It allows more flexibility for adapting the scalable representation to specific data by removing constraints originally imposed on redundant atoms (i.e., mutually coherent or rarely used) in [5]; • A firm spatial frequency distribution is enforced over dictionary atoms as a built in feature; • OMP pursuing algorithm is replaced via a simple matrix inversion for sparse coefficient estimation given that proposed dictionary is complete.
Specifically, proposed implementation is carried out by introducing regularization of the K-SVD atoms update stage aiming for scalable sparse image reconstruction which would improve gradually as we take more and more entries per each coefficient x i ∈ X to restore {y i } 1. It tackles for the first time the problem of creating a dictionary tailored to scalable image restoration, offering a novel model for data that admits sparse representation; 2. Enforces specific spatial frequency distribution as a built-in feature over trained dictionary; 3. As a solution to the scalable image restoration problem, this paper provides an extension and upgrade of the K-SVD dictionary learning concept from nonscalable to scalable adaptive image reconstruction by introducing semi-random dictionary initialization based on the MCA activity norm [3] and by regularizing the learning process of dictionary elements overall promoting the HVS perceptual mechanism features; 4. The potential of the proposed method is shown for the adaptive scalable denoising and CS; Evaluation of the scalable recovery is done using highmotion test video sequences and several natural images, successfully attaining progressive frame-to-frame and image scalable restoration. Experimental results confirm that the proposed scalable scheme outperforms significantly conventional K-SVD at different scalable image recovery levels. Specifically, in terms of application, our focus is placed on applying our proposed scalable scheme to denoising [15] and compressive sensing (CS) [5,38,39,40,41]. Mainly, in relation to denoising, we tackle processing and computational demands of the [16] given that experimental results in [42] suggest that objective quality improvement of current state-of-the-art image denoising schemes cannot be improved by more than 0. 1 [dB]. This conclusion is a result of comparison between the lowest error rates given a simple statistical measure derived from a huge image patch distribution [42] and the empirical errors of state-of-the-art denoising algorithms. Lastly, given the CS results in [43,44] we test the performance of proposed scalable dictionary learning method in one of the CS sampling scenarios.

Problem statement and proposed approach
Adhering closely to the notation used in [10], this section provides the detailed description of the proposed dictionary learning scheme for scalable image reconstruction. We build on the regular K-SVD algorithm [10] by altering its initialization and atom's update step. In general, we are given a set of N signals i.e., overlapping image patches of size √ n × √ n vectorized as Y = [y 1 , . . . , y N ] ∈ R n×N . The classical configuration of the K-SVD algorithm aims to approximate representation of these signals in a sparse way as weighted linear combinations of a few dictionary elements i.e., the Note however that this conventional approach is not capable of providing scalable image reconstruction that would be based on progressive recovery of each image patch y i . For instance, one can form {a |1 ≤ a ≤ ⌊K/m⌋ = s} number of recovery layers for each patch leading to reconstructed image denoted as L a . In general, m can vary and take on different values i.e., 1 < m ≤ K resulting in a number of scalable recovery layers having m as the scaling parameter. This leads to a progressive image restoration provided as a sequence of L a image layers each generated as a combination of the truncated versions of sparse representation X and dictionary D. At the beginning of the progressive Advances in Signal Processing 2(2): 55-74, 2014 57 recovery, the base layer L 1 is rebuilt out of the first m sparse coefficients entries per patch. That is, for each patch i we take [x i [1] x i [2] ... x i [m]] while remaining entries are set to zero x i = 0 for m < i ≤ K. These are combined together with the first m corresponding atoms i.e. [d 1 , d 2 , ..., d m ] leading to a compression rate of m/n. Afterwords, while reconstructing each subsequent layer L a (a > 1) additional m coefficients are added. That is, [x i [1] x i [2] ... x i [am]] (x i = 0 for am < i ≤ K) and [d 1 , d 2 , ..., d am ] producing compression ratio of (ma) /n.
The way in which we achieve effective sparse adaptive scalable image reconstruction is by introducing: 1. MCA based semi-random initialization of the dictionary at the very beginning of the training procedure; 2. A regularization scheme over the second K-SVD iterative stage i.e., the dictionary atom's update which enforces significance of the high frequency components during the regularized atom's update.
The following terms will be used in the remainder of this paper: • Y ∈ R n×N -matrix with N overlapping image patches y i ∈ R n ; • D sc ∈ R n×K -proposed scalable dictionary; • D ∈ R n×K -conventional non-scalable dictionary obtained using standard K-SVD [1] [5]; • K -the number of dictionary atoms in D sc or D; • X ∈ R K×N -sparse matrix with sparse coefficient vectors x i ∈ R K .

Dictionary initialization
In classical K-SVD, prior to any of the two training stages, dictionary D is initialized with K randomly extracted image training patches y i [10] from the set of total N . In contrast, prior to initialization we divide the N training patches in two classes C 1 and C 2 each containing smooth and texture image content, respectively. As a classification criteria we use the activity measure similar to TV norm originally used within the K-SVD MCA setup [3] and defined as: Subsequently, Activity is normalized in a way which sets its range from 0 to 1. These values are reflecting the degree of "smoothness" and "textureness" in each image patch [3]. The higher the Activity the higher the level of the texture will be within the patch. Thus, the classification is performed via simple tresholding using heuristically set value A. This value is taken from [3] where it is shown that it provides the best possible classification performance for smooth and texture element separation. Specifically, classifying parameter A indicates classification of patches into two classes C 1 or C 2 . That is: Thereafter, the first K/2 atoms of the proposed dictionary D sc are initialized randomly choosing K/2 image patches from the C 1 class, that is, the smooth group. The rest of the K/2 atoms are randomly picked from the C 2 class i.e., the texture group. In this way, we enforce semi random initialization which directly controls and effects the starting dictionary structure by placing low frequencies (smooth image areas) within its first half of d j atoms (1 ≤ j ≤ K/2) and high ones (texture image areas) within the last half (K/2 < j ≤ K). In return, this sets a foundation for further design which is organized around applying proposed regularization scheme and subsequently tuning dictionary learning to the main HVS perception characteristic.

Sparse coding
The first of the two iterative dictionary learning stages (sparse coding) is posed as a constraint optimization problem defined in [10] as: given the current estimation of the dictionary D which is kept fixed during this process. The expression ∥x i ∥ 0 accounts for the number of non-zero elements in each vector x i by the means of the l 0 pseudo norm where T 0 impose the top limit on ∥x i ∥ 0 . Signal y i ∈ Y (i = 1, . . . ,N ), extracted from the original image, is mapped into its sparse representation x i commonly via OMP [3,6]. However, given that we train a complete dictionary as proposed, OMP is not needed for the sparse coding step. That is, the exact solution for the scalable dictionary is attained via simple matrix inversion as x i = D ′ sc y i by maintaining up to T 0 largest non-zero coefficient entries. Each of K entries x i [j] corresponds to one of the atoms d j ∈ D sc (j=1,. . . ,K ) where nonzero entry x i [j] ̸ = 0 means that particular atom d j participates in the sparse representation of the signal y i [10]. We relax the sparsity constraint, permitting T 0 to take a higher value than in [10] where the relation T 0 << n is still maintained. This allows the scalable signal recovery to be established while introducing a T 0 value on an empirical basis that still promotes the sparsity prior of the signal.

Regularized dictionary update stage
Once stage described in 2.2 is completed, we move to the atom d j update stage. Usually, the new basis atom is estimated by processing a current representational residual E j [10] constructed to account for the error of all N patches when the atom d j is removed: x k T represents coefficients from the k th row in X (3) [9] where x k T [i] ̸ = 0 denotes that the sparse approximation for the patch y i includes atom d k . Prior to update each atom d j is set to zero while the remaining basis elements are kept fixed. Subsequently, error matrix E j is subject to shrinking [10] which results in reduction of her compositional structure to one which only contains error columns of the patches that use d j . Update of the pair is obtained via SVD decomposition of such interchanged matrix. Shrinking is necessary in order to preserve the sparsity constraint. That is, the new vector x j T will keep the sparsity property and it is not going to be fully filled with non-zero entries after subject to SVD. It is performed by identifying all patches that at the moment of the update use atom d j } followed with the formation of the matrix Ω j size N × |ω j |. This matrix contains ones only at the (ω (i) , i) positions. Remaining entries are zeros. Multiplying (3) with Ω j achieves necessary shrinking.
However, as already stated, this is insufficient to generate dictionary tailored for the scalable image restoration. That is why we decide to redefine the structure of the E j (3) by introducing the regularization scheme. The proposed procedure is mainly motivated by the HVS functional mechanism properties. Specifically, human eyes tend to pay more attention to the edges of an object given the high firing rate of the visual cortex neurons at the moment of perception, primarily identifying objects by their bounding shapes [31,32]. Thus, in order to facilitate effective scalable recovery, we find that it is necessary to ensure that the main object shapes are identified from the beginning of the image reconstruction. This effect would resemble to some extent to the object recognition procedures [35] which mostly rely on exploitation of image's high frequency information for this task. Hence, spatial higher frequencies should be more relevant to scalable dictionary learning. We address this by appropriately favoring the significant changes associated with the edges in the image patches (i.e., the texture) during the D sc training. This is carried out by dividing the current sparse approximations of all patches in E j (3) as: (4) Superscript R stands for regularized and pair [v 0 , v 1 ] denotes regularization terms. Each batch corresponds to the low and high frequency components of the training image patches: • First batch (with v 0 ) contains only atoms initialized from the C 1 smooth class; • Second batch (with v 1 ) contains only atoms initialized from the C 2 texture class.
This separation is plausible due to semi-random initialization described in Sec. 2.1. Proposed design is a result of testing various weight pairs [v 0 , v 1 ] under the constraint v 0 + v 1 = 1 in order to avoid degeneracy of the learned representation. Outcomes show that carefully introduced regularization over the smooth and texture image components is able to yield the appropriate dictionary for the scalable data representation (Sec. 3). Further, by introducing: the proposed regularized error matrix (4) can be rearranged as: where Y j represents a subset of the image patches y i from Y with indices given in ω j . Superscripts low and high denote smooth and texture frequency content associated with the weight pair [v 0 , v 1 ] which regularizes contribution of their residual components to the E R j . Consequentially, this separation controls the type of the information used for the d j atom's update.
For all atoms j = 1, ..., K, the proposed update stage is summarized as: 1. STEP 1 -Allocate corresponding image patches which current sparse approximation given as a linear superposition D sc x i includes atom d j as it is done in [10], map them accordingly with ω j and denote as a subset of patches Y j , that is a subset of sparse coefficients X j ; 2. STEP 2 -In contrast to [10], split each current sparse approximation element x i ∈ X j (i ∈ ω j ), associated with atom d j , in two parts using binary vectors T low , T high ∈ R K as: where T low , T high ∈ R K are binary vectors that cancel any x i [l] element for l > K 2 (associated with the dictionary elements initialized with class C 1 ) and l < K 2 (associated with the dictionary elements initialized with class C 2 ) as follows: In this way the smooth and texture patch content are extracted finally as D low sc X low j and D high sc X high j , respectively.
Proposed regularization plays an important role given that weights v 0 and v 1 control which spatial frequency content will be joined to the E R j . Consequently, the SVD decomposition (STEP 4) generates atoms of the scalable dictionary D sc based on the information contained within E R j . By keeping more of the original high frequency info (v 1 < 0.5) and suppressing the lower one (v 0 > 0.5) the algorithm regularizes the learning process which effectively generates dictionary D sc suitable for scalable representation. This enables recovery of the basic image objects shapes from the base layer L 1 resulting in a learning procedure which is tailored to the characteristics of HVS.

Denoising and scalable dictionary scheme
Prior to presenting the scalable denoising process, we inspect the way in which noise is removed during the classical K-SVD dictionary training. Commonly, noise is iteratively discarded throughout two stages: 1. While performing sparse coding, OMP stops when the current approximated sparse solution reaches the sphere of radius √ nCσ in the patches space. This radius constrains the acceptable level of the recovered noise strength i.e., ∥e∥ 2 2 ≤ Cnσ 2 . Going bellow this boundary would result in direct noise reconstruction where C is a heuristically set constant and σ stands for the noise standard deviation; 2. During the dictionary's atom's update, noise is removed via SVD decomposition that estimates new "average" direction for each atom least influenced by the distortion.
The conventional K-VD denoising energy minimization problem [16,17] is given as: We simplify this complex minimization task by relaxing the regularization process with the introduction of the proposed scalable dictionary D sc as follows: arg min In (7) we decide to discard the sparse coding phase while merely performing noise removal during the scalable dictionary D sc update. Our detailed study of denoising scheme in [16,17] suggests that the initial sparseness level i.e., the average number of the non-zero coefficients nearly stays fixed during the dictionary training in the classical K-SVD setup. That is, the one established after the first OMP sparse coding over the initialized dictionary [16]. Furthermore, we impose assumption that the noise less distorts texture than smooth image components due to the high-frequency nature of the texture information. Justification for this is provided in Sec. 4.3 where we illustrate how various level of noise distort smooth and texture image blocks based on their standard deviation estimated before and after noise is introduced. Thus, we promote the idea that, after the initial matrix inversion X = D ′ sc Y (substitute for OMP), we could neglect subsequent ones during the dictionary learning while still obtaining a satisfactory denoising results given that texture information prevails for the modified dictionary update. For this setup, the coefficient entries x j T are only updated during the SVD decomposition employed for the atom's update. Hence, the introduced modification is expected to result in a considerably shorter computational processing time while achieving comparable quality as obtained with the nonscalable K-SVD denoising scheme.

Computational Complexity
The proposed design does not incur the cost of the original dictionary learning in [3,10] in case of training strictly representative dictionary over the noise free image. Given that there are no additional transforms employed but just linear separation of the low and high frequencies components via semi-random initialization and introduced error matrix regularization (as shown in Sec. 2.3) the computational complexity remains of the same order as that of the conventional non-scalable K-SVD. That is, the number of operations per pixel is still O (nT 0 I) where I stands for the number of iterations. By setting the number of atoms K = n and replacing OMP via simple matrix inversion, we manage to even decrease the processing demands while achieving good signal recovery (typically in [3,10] K is equal to 2n, 3n or 4n). This is in particular transparent in relation to when applying proposed scheme to denoising given that sparse coding stage is removed. More details on processing time necessary for denoising are shown in Sec. 3.2.

Simulation results
The performance of the proposed scalable K-SVD method is evaluated in the set of experiments applied to: • Standard CIF high motion video test sequences "Stephan" and "Tempete" at resolution 352 × 288 and a frame rate of 30Hz; • Several natural images e.g,, "Boat" and "Peppers", size 512 × 512.
Variables and parameters for all simulations are summarized in Tab. 1 together with their values and roles. Prior to processing, every frame is broken down into N = 96, 945 or every natural image into N = 255, 025 overlapping patches size 8 × 8 pixels. Thus, the vectorized dimension of the signals used for the scalable dictionary learning algorithm is n = 64 pixels, while both dictionaries D sc and D contain K = n atoms with redundancy factor r = 1. Sparsity level T 0 is set to 10 both for training and reconstruction phase. This provides the best processing effectiveness (in terms of PSNR values) for the proposed scalable learning design after testing the wide range of sparsity levels e.g., Dictionary Learning for Scalable Sparse Image Representation with Applications The number of progressively recovered layers L a is defined with scaling parameter m = 4 as ⌊K/m⌋ = s = 16 for every layer of scalable patch recovery and therefore image with a=1, ..., 16. Starting from the first scalable version of the processed image i.e., base layer L 1 , the reconstruction is carried out using only first m = 4 entries (i.e., 6.25%) per each sparse coefficient x i . This first level of truncated coefficients from X is denoted with X 1 ∈ R (4)×N . Along this, we employ a truncated version of trained dictionary D sc : D 1 sc with only first m=4 atoms i.e., The remaining recovery levels are progressively enhanced by adding four (i.e., m) additional entries in each representational vector and four (i.e., m) additional atoms. In general, (for any m value) the progressive recovery of the each image patch y i for new scalable layer L a starts by first taking all m(a − 1) entries from the associated sparse coefficient x i previously employed for the estimation of the y i scalable version at the level L a−1 . This continues by adding subsequent m values from the sparse coefficients x i that are indexed as: with reconstruction for scalable patch at L a given as The end result is that each recovered patch at the new layer L a will contain the first m(a − 1) reconstructed elements as patches in L a−1 and newly estimated m. For shown case, this is done until the final L 16 restoration level is attained using full sparse representation X 16 = X and all atoms in dictionary D 16 sc = D sc . That is, the recovery scheme for each image layer is given as: sc X 1 : 4 atoms and entries per sparse coefficient; • L 2 = D 2 sc X 2 : 8 atoms and entries per sparse coefficient; • . . .; sc X 15 : 60 atoms and entries per sparse coefficient; 16 sc X 16 : 64 atoms and entries per sparse coefficient.
Unlike the classical, non-scalable sparse dictionary learning where practice is to train an overcomplete dictionary (K >> n, r > 1), we promote training of a complete one. One of the main reasons for this arises from our observation that whether we train a complete or overcomplete basis D sc , the achieved restoration quality for scalable image representation is highly comparable. Tab. 2 and Tab. 3 show the averaged comparison at ever scalable recovery level L a given the complete and overcomplete scalable scheme for video sequence "Stephan" and image "Peppers", respectively. The number of atoms for the overcomplete D sc dictionary is K = 128 (r = 2) thus having greater number of the recovery levels ⌊K/m⌋ = s = 32 than with the complete scheme given the same scaling factor m = 4. On average, when we take into account all testing results, the difference of the highest recovered layers i.e., L 16 goes around 0.15[dB] (for frame size 352 × 288) and 0.66 [dB] (for image size 512 × 512) in favor of the overcomplete D sc dictionary. We can see a comparison of every two recovery levels of the overcomplete D sc dictionary with one of the recovery level of the complete D sc in Tab. 2 and Tab. 3 (e.g., L 25 and L 26 with L 13 ). Conclusion follows that the scalable performance of the complete D sc overruns the overcomplete D sc at all recovery levels (bold values) except for the final L 16 that is L 32 for "Stephan" and almost all levels for "Peppers". Having this in mind together with the fact that lesser number of trained atoms: • Minimizes the amount of information necessary for training and signal's recovery; • Lowers computational complexity; we chose K = 64. Prior to defining the proposed scalable scheme, we performed exhausting simulations in order to evaluate performance given the various P i regularization parameters pairs [v 0 , v 1 ] listed as: searching for the one which provides the most effective dictionary D sc for scalable image restoration. Fig. 1 and Fig. 2 present the averaged PSNR and SSIM estimates at every recovery layer L a of scalable restoration given the high motion video sequence "Stephan" and 10 averaged iterations of the natural image "Peppers" with K = 64 number of atoms. As we can see, out of seven P i regularization (1 ≤ i ≤ 7) scenarios ( Fig. 1 and Fig. 2), the P 7 results with the dictionary that is most effectively tailored to the scalable sparse image representation given that, overall, results with the highest PSNR and SSIM restoration values. Similar results are achieved for video sequence "Tempete" and several other natural images such as "Boat". For each of the testing, Activity measure is set to the A = 0.27 as in [3]. In addition, we provide results in Fig. 3 (PSNR) and Fig. 4 (SSIM) for all P i regularization variations when training overcomplete D sc dictionary again with K = 128 atoms for the K/m = 32 number of L a recovery levels. Again, P 7 achieves most optimal scalable recovery performance.
To reiterate, the v 0 is associated with the D low sc elements, which capture spatial low-frequencies. These atoms represent a compositional structure of patches extracted from large, smooth, low-variance areas, lacking in harsh edges i.e., the tennis field in the "Stephan" sequence, or the sky background in the "Tempete" sequence. On the other hand, v 1 weights the contribution atoms that contain higher spatial frequencies, that is, the areas of high detail with many contrasting edges such as the audience in "Stephan" or the flower object in "Tempete". By looking at (5) in Sec. 2.3, with P 7 parametrization ([v 0 , v 1 ] = [1, 0]), we can conclude that the regularization process will in each iteration: • Discard texture sparse approximation D high sc X high j given the v 1 = 0; • Keep the smooth part D low sc X low j with v 0 = 1.
This will determine the final content of the regularized error matrix E R j where the texture patches (Activity (y i ) > 0.27) are dominant information being directly included into the error synthesis rather than being just a part of the representational residual as the smooth one (Activity (y i ) ≤ 0.27). Furthermore, Sec. 4 provides a detailed discussion on effective structural differences between trained dictionaries and how D sc is better tailored to the HVS perception system than the non-scalable, conventional dictionary D. Some of these dictionaries are shown in Fig. 5a and Fig. 5b illustrating examples of scalable ("SC K-SVD") and non-scalable ("NSC K-SVD") dictionary trained over the first frame of the "Stephan" sequence. Given the first frame for either of sequences i.e., the training frame, the intro- Table 3. Averaged PSNR quality assessment for scalable restoration given the "Peppers" natural image for two sizes dictionary, K = 64 and K = 128.  duced training scheme is carried out only once generating the D sc dictionary. Subsequently, while reconstructing each incoming frame we use this single D sc thus, not training any new dictionary. This approach considerably reduces the computational complexity of the scalable sparse video representation, since training is done only once instead for each incoming frame. This is immensely important in the context of real-time scalable image/video applications development. It is necessary to mention that in general, when the video scene undergoes significant changes with respect to the training frame, a new training frame should be inserted. This is necessary in order to accommodate for the difference in the compositional structure of previous frames and newly changed one.

Scalability Performance
The comparison of the restoration quality is done for the proposed regularized scalable "SC K-SVD" and conventional non-scalable "NSC K-SVD" algorithm. Fig. 6a and Fig. 6b illustrate the PSNR estimates for video sequences "Stephan" and "Tempete" respectively. Shown results are averaged over all frames given each of 16 recovery {L a } 16 a=1 layers. Minor exception is the "Stephan" sequence which frames are divided into two groups: [1,270] and [271,300]. This frame separation is carried out in order to demonstrate the variation in the quality of the restored image, when a new object is introduced in the frame 271. We would expect certain degradation in the restoration quality given that D sc is trained over the training frame which does not contain a newly introduced visual object.
Depicted results clearly demonstrate that the proposed regularized scheme considerably outperforms the standard [10] over all recovery levels L a , where average gain of 11.32 [dB] ("Stephan", first 270 frames) and 8 [dB] ("Tempete", all frames) is achieved. This proves superiority of the proposed work in terms of scalable recovery. Once a new image object appears (e.g., the tennis net in the "Stephan" sequence ) a noticeable drop in the scalable recovery quality can be noticed in Fig. 6a for the second frame group [271, 300]. Specifically, on average "SC K-SVD" PSNR declines for 1.84 [dB] while still outperforming the "NSC K-SVD" for 9.43 [dB]. Overall, only in the case when all the information on the sparse coefficients is available (X 16 ∈ R (64)×N ), the regular K-SVD algorithm has a slight advantage over the proposed scheme. Besides the standard objective quality assessment i.e., PSNR, we consider an alternative quality measure, so-called Structural Similarity Index (SSIM) [37]. It is designed to quantify the degree to    which image structural information is degraded by calculating a quality index ranging from 0 (denoting highest distortion) up to 1 (denoting no distortion). This measure is specially appealing for the evaluation of the proposed scalable image restoration framework due to the fact that the SSIM is based on the discussed HVS characteristics. Specifically, it takes into account local pixels distortions of the luminance and contrast information. The higher the SSIM index value gets, the more successful retrieval of the HVS perception information at each scalable layer L a will be. This results in a better visual information thus providing progressive image recovery of better quality. Therefore, SSIM index values shown in Fig. 7 quantify the degree of the degradation of structural information in a frame at each scalable reconstruction level L a . Once again, these estimates are averaged over all frames for both testing video sequences. As in the case of PSNR, in Fig. 7a , for the "Stephan" sequence, we can see that the proposed scalable method surpasses in general the non-scalable for 0.37 (first frame 64 Dictionary Learning for Scalable Sparse Image Representation with Applications (a) "Stephan" video sequence (b) "Tempete" video sequence Figure 6. Frame-average PSNR of the scalable reconstructed video test sequences ("Stephan" and "Tempete" ) given for each layer La of the scalable video reconstruction using the scalable and non-scalable K-SVD algorithm.
(a) "Stephan" video sequence (b) "Tempete" video sequence group) and 0.28 (second frame group). Similarly, we can see in Fig. 7b, the SSIM difference of 0.27 for the "Tempete" sequence over all recovery levels L a between two dictionary learning algorithms. Interestingly, for SSIM evaluation we have a different trend than in the case of PSNR where once we switch to the second frame group the quality assessment shows a high drop for restoration levels L 14 , L 15 , L 16 . In contrast, Fig. 7a denotes a high similarity in the SSIM values for L 14 , L 15 , L 16 at around 0.94 given both frame groups, meaning that the structural information of the image is preserved despite the fact that we have new visual object in the scene.
Visualization of the results is provided in Fig. 8, Fig. 9 for "Stephan" and in Fig. 10, Fig. 11 for "Tempete" sequence in order to emphasize the subjective perceptual quality. These figures illustrate the scalable reconstruction outcome at every recovery level L a for the socalled training frame (Fig. 8 and Fig. 10). The last frames for both video sequences are shown in Fig. 9 and Fig. 11. Here one can observe the visual variations in the restoration quality when the new object containing the high-frequency content structure is introduced (i.e., the tennis net in Fig. 9) or the more spatial low-frequencies are added i.e., the background in "Tempete" in Fig. 11. All scalable restorations are performed over the single trained dictionary given the first training frame. From these figures one can notice that the proposed scalable scheme is able to recover the frame at a recovery level L 3 (D 3 sc ∈ R (64)×12 and X 3 ∈ R (12)×N ) whereas [1] fails to show any scalable characteristics overall up to L 15 (D 15 sc ∈ R (64)×60 and X 15 ∈ R (60)×N ) for "Stephan" and L 8 (D 8 sc ∈ R (64)×32 and X 8 ∈ R (32)×N ) for "Tempete". It should be said that the "NSC K-SVD" does show slight scalability with the "Tempete" sequence. However, this is still far from the performance of the proposed method that keeps its reconstruction efficiency consistent for quite different video sequences, hence showing its processing stability.

Application to image processing 1: denoising
This experimental section justifies and validates advantages of the proposed scheme for the denoising application. We compare introduced scalable (denoted as "SC") over the non-scalable (denoted as "NSC") together with the overcomplete classical K-SVD dictionary "Org". For the aforementioned algorithm setups, we discuss objective quality assessment and time processing complexity. However, unlike in Sec. 3.1 where dictionary training is done only once over the first non distorted frame in the video sequence, for denoising, the dictio- (a) Last frame i.e., 300th, "Stephan" test sequence ("SC K-SVD") (b) Last frame i.e., 300th, "Stephan" test sequence ("NSC K-SVD") Figure 9. Visual assessment of the scalable reconstruction using the scalable and non-scalable K-SVD at every recovery level La.
(a) Training frame, "Tempete" test sequence("SC K-SVD") (b) Training frame, "Tempete" test sequence ("NSC K-SVD") Figure 10. Visual assessment of the scalable reconstruction using the scalable and non-scalable K-SVD at every recovery level La.

66
Dictionary Learning for Scalable Sparse Image Representation with Applications (a) Last frame i.e., 260th, "Tempete" test sequence("SC K-SVD") (b) Last frame i.e., 260th, "Tempete" test sequence ("NSC K-SVD") Figure 11. Visual assessment of the scalable reconstruction using the scalable and non-scalable K-SVD at every recovery level La. nary is trained for each incoming noisy frame, likewise in [15]. All provided results are averaged over two video sequences frames i.e., "Stephan" and "Tempete" together with the additional estimates for the several conventional images i.e., "Boat" and "Peppers". For experiments we consider the range of five different noise standard deviations: σ = [20,40,60,80,100]. The restoration of every scalable level L a is carried out in the same way as in the previous section. Starting from Tab. 4 to Tab. 7 we can see comparison for denoising outcomes at every scalable recovery layer L a for all mentioned data. Additionally, each level is compared against the denoising estimates of the overcomplete K-SVD scheme "Org" (red, bold values) in order to emphasize the effectiveness of the proposed scheme. From the provided results conclusion follows that PSNR values of the "SC" at the final restoration level L 16 (Tab. 4 to Tab. 7) are, at most cases, comparable or surpass (black bold values) denoising performance of the classical K-SVD setup once the noise passes value σ = 60. This better performance indicates that the higher frequencies are less influenced by the noise since they are enforced as the most important content of the trained dictionary, contributing most to the restored frame or image unlike in the conventional K-SVD. Overall, the proposed method achieves better denoising performance with lowest and highest gain of 0.1 [dB] and 5.7 [dB], respectively.
In addition, we performed testing for the scenario where the sparse coding stage is also removed from the classical non-scalable K-SVD scheme in order to further validate the practicality of the proposed scalable design. After simulations final estimates show that there is a drop of 2 [dB] for "NSC" without sparse coding stage when compared to the best denoising results of the "SC". Hence, newly introduced regularization scheme is efficient when it comes down to noise removal given that we only keep atom's update out of two iterative stages for dictionary learning over the corrupted image. The greatest benefit of the scalable denoising is direct reduction of both, computational complexity and processing time where Tab. 8 shows the total denoising run times in seconds for two image sizes: 1. 352x288 -size of the video sequences frames; 2. 512x512 -size of the conventional images.
Illustrated times are outcomes of processing on the Dell operating system with 64 bit Intel core, 8 GB RAM memory and 2.40 GHz processor. The number of iterations for the provided results is fixed and set to sixteen. Based on the averaged run times we can see reduction in: 1. approximately 6.5 times for data of size 352x288 when comparing "SC" vs "Org"; 2. approximately 7.3 times for data of size 512x512 when comparing "SC" vs "Org"; 3. approximately 10.8 times for data of size 352x288 when comparing "SC" vs "NSC"; 4. approximately 11 times for data of size 512x512 when comparing "SC" vs "NSC"; provided that we achieve still highly comparable (lower levels of noise) or better results (higher levels of noise). The forth column of the Tab. 8 illustrates the time for the error matrix formation per each iteration. These numbers are aiming to show that introduced modification of the atom's update in the form of a new error matrix scheme influences processing complexity on a minor scale by being increased on average for two seconds. Finally, Fig. 12, Fig. 13, Fig. 14 and Fig. 15 illustrate visual preview for all discussed data at the recovery level L 16 after the noise σ = 40 is removed given the scalable, non-scalable complete or overcomplete K-SVD scheme. After additional subjective quality assessment we can conclude that provided results are highly comparable where, as emphasized, proposed method "SC" put considerably less computational demands than both classical K-SVD setups. Table 4. PSNR quality assessment for scalable denoising via the scalable and non-scalable K-SVD dictionary, "Stephan" sequence.  Table 5. PSNR quality assessment for scalable denoising via the scalable and non-scalable K-SVD dictionary, "Tempete" sequence.  Table 6. PSNR quality assessment for scalable denoising via the scalable and non-scalable K-SVD dictionary, "Boat" image.

Application to image processing 2: compressive sensing
Following closely the experimental layout suggested in [26], we investigate the effectiveness and the per-formance of the proposed scalable CS video acquisition scheme when combined with a learned basis rather than a predefined one. In particular, the proposed framework aims for the frame-by-frame progressive CS recov-  [26], the image is processed block by block. Mainly, we take into consideration two cases of the CS scalable recovery: • With the proposed scalable K-SVD dictionary tailored to this task; • With the conventional non-scalable K-SVD dictionary.
Rather than taking the full number of measurements [5,45,46] over every incoming frame, CS sampling is carried out in incremental steps. Note that this is applicable only for the CS scalable sensing scenario. Given the sufficient number of progressive measurements per patch as s 1 , s 2 , . . . , s L (s i < n) we are able to recover the frame gradually after incrementally retrieving entries of sparse vector coefficients in X i via OMP. Furthermore, each incremental number of samples s i satisfies the fundamental result of the CS theory [2] that imposes the limit on the necessary number of measurements for satisfactory signal reconstruction.
Unlike the conventional CS for our testing we apply specially structured sampling matrix Φ. This aims to achieve efficient scalable acquisition of samples over each image layer commonly denoted as y CS = Φy = ΦD sc x. Implementation is carried out via the systematic nonadaptive approach as in [26] that generates the structural sampling matrix Φ optimally suited for the scal- . Frame-average PSNR of the scalable CS reconstruction for two test video sequences as a function of the measurement procent using the scalable ("SC K-SVD") and non-scalable ("NSC K-SVD") algorithm.
able task at hand. For each recovery step (as in [26]) we scale sampling matrix size-wise into its truncated versions as Φ i ∈ R si×n . Once the sampling is done we attain a group of samples each denoted as y i CS . The sampling is structured in a way that the basic level is collected via Φ 1 that contains binary entries generated from the Gaussian distribution. Remaining measurements are sampled via Bernoulli binary distributed entries of Φ i consecutively added up to the basic layer for the scalable restoration. Again, starting from a base level i = 1 and with y 1 CS = Φ 1 y = Φ 1 D sc x (approximately 15% of original patch image size y i ) we advance through enhancement layers by uniformly collecting additional number of samples (e.g., s 2 , s 3 , ..., s L = S) in each step until the total number of S < n samples is reached. Hence, given the single trained dictionary D sc (as in Sec. 3.1) learned over training frame for either of video test sequences, one can define an arbitrary number of sampled layers over extracted image patches. Fig. 16 shows reconstruction results obtained via the proposed adaptive scalable CS approach averaged over the frames starting with s 1 = 10 and adding five more samples per each patch as frame recovery progresses (e.g., s 2 = 15, s 3 = 20, etc.). Therefore, we define in total nine sampling levels resulting in nine patch, that is, frame reconstruction layers. Thus, full number of measurements is S = 50 (n > 50) which accounts for roughly 80% of the information of the sampled signal y i . The gap between the performance of the two methods is evident for the layers sampled both at low (e.g., 15%, 23% , 31% and 39%) and high subrates (47%, 55%, 62%, 70% and 80%) of sampling information, at around 3.03 [dB] in case of "Stephan" sequence and for the "Tempete" frames at 2.96 [dB]. We can see that the proposed design is successful for the subsampling factors at different rates whereas the conventional K-SVD has a comparable but not better performance as more measurements are added.

Discussion
Training the dictionary for the scalable sparse data representation and applying it to the denoising adopts a different approach than the one originally introduced by K-SVD [3,9,15]. Mainly, the atom update illustrated in Sec. 2.3 and denoising proposed Sec. 2.5 are grounded in the following assumptions: • The progressive and quality wise scaled recovery of the image/frame can be attained via learned dictionary by modeling the main HVS perception mechanism properties and integrating them during dictionary's training; • This implementation should be taken forward by MCA based semi-random initialization, allocation, separation and regularization of low and high spatial frequencies information captured by the atoms during the dictionary training procedure; • Texture image components are less distorted by noise than the smooth ones thus with the newly introduced design SVD over proposed regularized error matrix E R j is sufficient for noise removal. These hypotheses give rise to a series of questions: 1. How are spatial frequencies distributed over scalable and non-scalable dictionary's atoms?; 2. Could this distribution be denoted as a built-in property of the trained dictionaries?
3. Does the proposed design properly adopt the HVS perception mechanism properties?
4. To what degree noise effects smooth and texture image properties?
The following sections aim to look into some answers to these questions.

Spatial frequencies distribution
In Sec. 2.1 we gave a detailed explanation on semirandom dictionary initialization where we enforce allocation and separation of the dictionaries atoms into smooth and texture ones. As explained, the classification criteria we use is formulated via Activity norm in [3]. Thus, we further assess the spatial frequencies distributions for both dictionaries by looking at and analyzing Activity trend once the training is done. This is illustrated with Fig. 17. Whether we consider frames of the video sequence ( Fig. 17a and Fig. 17b) or some conventional image ( Fig. 17c and Fig. 17d) we can conclude that classical K-SVD scheme results in dictionaries which do not show any specific structural features in terms of how smooth and texture information are learned and allocated. In contrast, proposed design shows clear distinction between atoms that carry: • Low spatial frequency: Activity(d j ) j=K/2 j=1 ≤ A = 0.27 (first K/2 atoms); • High spatial frequency: Activity(d j ) j=K j=K/2 > A = 0.27 (last K/2 atoms); thus successfully implementing this specific distribution as a built-in property of the scalable dictionary D sc unlike the classical K-SVD scheme.

Contrast variation
Proper integration of the HVS sensitivity properties is done adequately if the proposed scalable design reinforces learning of the spatial high-frequency components (see Sec. 2.3) which represent regions of a high contrast variation [33,34]. By examining in what ways atoms of D and D sc differ in terms of their composition structure (i.e., contrast variation) we verify the credibility of HVS properties modeling. This is taken forward by estimating contrast levels captured within the atoms during dictionary learning procedure. Assessment of the contrast levels is done by finding the standard deviation (std) over the atom's pixel intensity. Estimates are averaged over several dictionaries trained over the frames of the same video sequence or several times over the same image. The proposed computation is adopted from [37] where authors use std as a measure to estimate contrast image levels. Likewise, in Fig. 14 we depict standard deviation values for contrast levels of atoms both from D and D sc for the same set of data as in the Sec. 3.1. We can notice distinct pattern for the contrast levels of the scalable "SC K-SVD" dictionary for all results likewise for Activity values shown in Sec. 4.1. Specifically, for the first K/2 atoms of each of the presented scalable dictionaries D sc the contrast is considerably lower with some slight fluctuations (Fig. 18, notation "SC K-SVD") with highest contrast variation of 0.06. The remaining atoms reach quite high contrast levels with a steep jump up to around 0.13, creating a distinct threshold in distributed contrast variation over the all four D sc dictionaries. The clear contrast variation borderline which clearly splits atoms in two groups, e.g., those with low and those with high contrast variation, is the final processing effect of the enforced semi-random initialization and regularization. In case of the conventionally K-SVD i.e., "NSC K-SVD" shown trend does not exist. Thus, this directly proves that proposed design complies with the characteristic of the HVS perception mechanism [29,30] given that it is more efficient in extracting contrast information from the training images. In addition, this is significant since a proper visual understanding of the scene at hand [31,32,34] depends on how well contrast variations are captured with the image representational elements, that is atoms.

Noise distortion of the smooth and texture image patches
We posed an assumption in Sec. 2.5 that noise effects more smooth than texture image components. Specifically, oscillatory components of the scene i.e., texture exhibit regularity in terms of the frequency content that repeats to some extent over the image. Thus, noise which represents random signal (without any consistency in its change) should have a higher impact on image parts which do not exhibit periodic spatial variations i.e., smooth one. This is shown by estimating changes in std variation before and after noise is added to specific image blocks of smooth and texture areas. Several of these blocks are depicted in Fig. 19 where first row represents smooth and second texture image blocks of size 30 × 30. Given the five noise levels as in Sec. 3.3, in Tab. 9 we show how averaged std of the texture and smooth image patches varies before and after noise is added. Given the smooth group we can see rel-72 Dictionary Learning for Scalable Sparse Image Representation with Applications Figure 19. Visual overview of the image patches size 30x30 used for std noise impact analysis. First row represents smooth image content while seconds one depicts texture.

Conclusion
This work introduces a design for learning dictionary for scalable image recovery by enforcing semirandom dictionary initialization and regularization of the K-SVD atom's update stage. To the best of our knowledge this problem has not been addressed before i.e., creating learned sparse representations for the purpose of the scalable image restoration. The proposed technique is evaluated over two different video test sequences, "Stephan" and "Tempete" and several conventional images. We demonstrate its practical utilization for dynamic data changing over time given single trained dictionary D sc . Mainly, three potential applications schemes are tested:scalable video recovery, scalable denoising and scalable CS. Interestingly, the proposed approach for learning dictionary for scalable image recovery, significantly outperforms or it is highly comparable with the classical K-SVD setting for the all aforementioned purposes: (i ) best for 11.32 [dB]; (ii ) achieves comparable results with best decreased computational demands for 7.3 times; (iii ) best for all subsampling levels at around 3.03[dB]). Future work will address joint design and optimization of the scalable CS sampling matrix and dictionary for scalable image recovery aiming for considerably decreasing current coherence level (i.e., 7.7). Overall, the proposed method can be successfully used for real-time scalable image/video display applications, where video streams, tailored to the needs of a divers user pool operating heterogeneous display equipment, are required.