A New Technique for Archiving Manuscript Documents

In the specific context of safeguard of old documents encountered by specific problems of conservation (damp patches, folds, tears, shade… etc.), we are interested in the diagnosis of the methods of compression of the two families: reversible (without loss) and irreversible methods (with loss). Particularly, we are interested in the current methods like RLE, Compression LZ and LZW, Huffman, Deflate, CCITT Group 3 and 4, JPEG, JPEG2000, png, and fractal Compression. The goal of this work is to develop a semi-automatic tool able to quantify the quality of a compression of old documents. Then, can answer objectively for the choice of a compression method from a collection of images. The choice of this collection must be representative in terms of a number of images and typology of images of the same context, like here the old documents. In this paper, we propose a semi-automatic method that operates at the first time on a supervised classification of the original image to raise K groups with distinguished characteristics. The choice of K is previously determined by the user: writing, bottom, tear….etc. In the second step, we calculate the movements (intra class and inter class). The last step consists in calculating of the degradation measurement.


Introduction
All treatment applied on digital images must be validated by a process of quality. In the context of images compression, and especially concerning the compression of the images of ancient manuscripts, this process is based on measures of the quality of the rebuilt image. For applications in which images are finally degraded, the only correct method for measuring of the visual quality of image is by the subjective assessment. In practice this assessment is realized by subjective criteria such as the Mean Opinion Score (MOS) [1]. This measure i s still heavy, and very costly. However, this assessment cannot be in any case used in a scheme of image coding [2]. In order to avoid this problem, it is necessary to assess the quality of the compressed images, this is realized by MSE: Means Square Error or PSNR: Peak Signal Noise Ratio. Until today the measure of PSNR is considered as the most used criteria for evaluating the quality in image processing, however, the PSNR is a quantitative measure and need sometimes a subjective assessment of the degradation.
From quantitative measures, the objective methods assessing the perceptual quality of the image have traditionally attempted to measure the visibility of the errors between an image deteriorated and a reference image using a set of properties inspired by the human visual system (HVS) [3][4] [5]. Our contribution is interested in distinguish in groups the elements of the image to associate a strong weighting for the relevant groups, and shall be less important for the groups less relevant according to HVS.
Our method is going to take advantage of both types (objective and subjective), then we developed a semiautomatic tool capable of quantifying the quality of a compression in front of archive documents. So, it allows answering objectively the choice of a method of images compression. The choice of this collection must be representative in terms of number of images and typology of images of the same context, as here the old documents.

Process of Classification
The techniques of automatic classification are intended to produce groups of objects by an algorithmic step. The objects to be classified are in general individuals whom we wish to divide into disjoined homogeneous classes. The methods used are different and many but the total step articulate on three principal axes:  Identify the criteria of selection of the individuals  Calculate the similarities between the individuals  Use the heuristic ones to classify the individuals. Moreover, we distinguish two types of classifications:  Supervised classification: the classes and their properties are known with priori. A semantic is generally associated to each class.  Unsupervised classification: semantics associated with the classes is more difficult to determine. The classes are founded on the structure of the elements. These methods are called methods of clustering or methods of regrouping.
Here, we are informed on the characteristics of the image. Supervised classification is more suitable.

Figure1. Example of image
For example of the figure 1 we observe that are three classes in the image: writing, bottom, tear.

Supervised Classification
Our algorithm uses a method of classification overseen more exactly the method of distance minimum. With this method, the rule of decision for the affectation of a pixel in a class is the distance minimum between the value of the pixel and the center of a cloud of point representing a class. So pixels are allocated to the class to which the average is closest. This nearness spatial and often are measured by the Euclidian distance.

Movements Measurement
After every compression we redo the same process (supervised classification) and we calculate the movements of pixels. We notice two types of movements.
Intra class: the pixel moves in the same class.
Inter class: the pixel moves towards another class. In Our work, the movement of pixels plays an important role to measure the quality of the compression, because our measure is calculated from the calculation of the movements of pixels.

Measure of degradation
After the stage of the movements' measurement, it still remains to us the last stage, calculating our measure. Our measure is calculated by the following formula: (2)

General schema
We carry out our regrouping on the basis of property of coding of the color of the image. In this first version in mode RGB (component Red, Green, Blue). The problem of regrouping in this case, can be seen like a problem of separation of a set of pixels in a certain number of groups according to their components RGB. According to the schema given (figure 4), let us recall the architecture which we propose for a tool for evaluation of the quality of compression of archival manuscript documents: a) Starting of corpus of images, a decider must bring the answer on which method compression to choose? Our system known as "semi-automatic" makes it possible to answer this question with objective evaluations (movements measurement) and subjective ones (notations of the decision maker). Of course, these evaluations are operated on a subset of the selected corpus in terms of representativeness of the typology of documents treated. b) Each image of the subset must initially pass by the phase known as automatic, then the phase known as semi-automatic insofar as it utilizes human (appraiser).  Automatic phase: This phase consists in making a distinctive regrouping of pixels, i.e.: the pixels relating to the writing will form the same group, those of the bottom of the image another group; it will be the same for the tasks, shades, and other spots of conservation. This operation called classification is ensured by a process explained in section 2. Remain to determine the number of groups to build. (three in our case)  Semi-automatic phase: Once the various groups are well distinguished in the preceding phase. An appraiser gives more weight to the important groups for the legibility of the document like the writing and the bottom. The task of compression should not degrade these groups. By opposition, the groups translating of the noise are slightly balanced in terms of importance; a degraded compression is probably without incidence. c) Then it acts to measure degradation after the choice of a method of compression. Several methods should be tested to have the best compromise between the qualitative factor and the size one, after compression.

Movements Measurement of Pixels for Several Methods of Compression
We constituted 6 documents to lead 6 sets of tests on the model which we propose. The tested methods of compression are:  GIF,  PNG-8,  PNG-24,  JPEG,  JPEG2000 Three tables of digital tests (intra class, inter class and measure of degradation) should give on the method of compression to be chosen. Tables (individuals / variables) below which individuals represent the images tests and variables represent the methods of compression, give numerical values representing the number of pixels having made of the movements after compression. For every method of compression, it is important to observe the values averaged on the sample of images, as well as the values " min " and "max" translating an interval of fluctuation which concretely gives an idea onto the correlation of the method 32 A New Technique for Archiving Manuscript Documents of compression towards the nature of the image.
The results indicate initially, a correlation between the size of the original and degradation. Certainly, more the image is voluminous; more the number of movements are high.
We also notice that the number of movement increases with regard to the method of compression, what indicates that our model is really sensitive to the slightest degradation which affects the image after a compression. The following table (table 2) shows Number of interclass movements  The table 2 shows well the difference marked between the methods as the JPEG and the JPEG 2000 and others. We know the power of the JPEG model as an irreversible model but the loss of which is planned in a sensible way, by taking into account capacities of the human eye. For our model it's rather clear that the JPEG and JPEG2000 models presents fewer movements inter class than the rest of the methods, what is already a good comparative measure.
The table 3 shows the stability of the tests, what allows us to conciliate for the same compression ratio as 2 methods (PNG, TIFF) remain nearby and less powerful. The interest is carried for the JPEG and JPEG 2000 methods which give good results in that case of tests. The economic aspect is also determining to slices between these two formats

Variation of the Degradation According To Compression Ratio
The following table (table 4) shows the variation of the average of degradation according to 5 selected levels of compression (1:12, 1:24, 1:30, 1:40, 1:50) According to our model, the table shows well that degradation believes very quickly with the increase in the compression ratio, in particular for methods GIF, PNG-8, and PNG-24. In less marked, it will be the same for method JPEG. Method JPEG 2000 after a compression 1:50 one remains with a tiny degradation. Clearly, the effectiveness of this method was already proven in the literature, it is not the object of this section. This being said, that makes it possible to note the relevance of the model of evaluation which we set up.

Conclusion
We presented a new method of measure of the compression quality, based on a supervised classification and movement's measurement.
The results of the evaluation of this method show initially, logic of the model of evaluation compared to the results noted through other method. Also, the numerical results coming from the model reflect the conclusions noted in the literature for each studied model. Some experiments control highlighted the major impact of certain post processing on the visibility of the artefacts of compression. The principal post processing of which the use must be supervised in combination with compression is the reinforcement of contour (for the description of the writing).
These evaluations are positive as for the application of the measure of the compression quality. They deserve to be supplemented by comparative studies complementary compared to the purely objective or purely subjective methods on benchmarks of the literature.