Objectively defining the image parameters for facial detection, recognition, and identification in digital imagery

Objectively defining the image parameters for facial detection, recognition, and identification in digital imagery

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Axis communications have given guidelines on minimum operational requirements of a camera to allow a human observer to identify, recognise, and detect human subjects ( They have cited the Swedish National Laboratory of Forensic Science, and IEC International Standard IEC 62676-4 in explanation of their recommendations, however i cannot find these specific texts.

Are there any peer-reviewed studies that observe the effects of the pixel density of an image on a person's ability to identify, recognise, and detect human subjects (specifically facial)?

I have searched various sources, but only seem to find results that address image resolution and face-detection software efficacy.

Results of pedestrian detection in images

To see our pedestrian detection script in action, just issue the following command:

Below I have provided a sample of results from the detection script:

Figure 2: The first result of our pedestrian detection script.

Here we have detected a single person standing next to a police car.

Figure 3: Detecting a single person in the foreground and another person in the background.

In the above example we can see a man detected in the foreground of the image, while a woman pushing a baby stroller is detected in the background.

Figure 4: An example of why applying non-maxima suppression is important.

The above image serves an example of why applying non-maxima suppression is important. The detectMultiScale function falsely detected two bounding boxes (along with the correct bounding box), both overlapping the true person in the image. By applying non-maxima suppression we were able to suppress the extraneous bounding boxes, leaving us with the true detection

Figure 5: A second example demonstrating non-maxima suppression in action.

Again, we see that multiple false bounding boxes are detected, but by applying NMS we can remove them, leaving us with the true detection in the image.

Figure 6: Detecting pedestrians in a shopping mall.

Here we are detecting pedestrians in a shopping mall. Notice two people are walking away from the camera while another is walking towards the camera. In either case, our HOG method is able to detect the people. The larger overlapThresh in the non_maxima_suppression function ensures that the bounding boxes are not suppressed, even though they do partially overlap.

Figure 7: Detecting people in a blurred image.

I was particularly surprised by the results of the above image. Normally the HOG descriptor does not perform well in the presence of motion blur, yet we are still able to detect the pedestrians in this image.

Figure 8: Detecting pedestrians outdoors, walking along the street.

This is another example of multiple, overlapping bounding boxes, but due to the larger overlapThresh they are not suppressed, leaving us with the correct person detections.

Figure 9: Detecting four members of a family.

The above image shows the versatility of our HOG + SVM pedestrian detector. We are not only able to detect the adult male, but also the three small children as well. (Note that the detector is not able to find the other child hiding behind his [presumed to be] father).

Figure 10: Detecting a depiction of pedestrians.

I include this image last simply because I find it amusing. We are clearly viewing a road sign, likely used to indicate a pedestrian crossing. However, our HOG + SVM detector marks the two people in this image as positive classifications!

What's next? I recommend PyImageSearch University.

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

  • &check 20 courses on essential computer vision, deep learning, and OpenCV topics
  • &check 20 Certificates of Completion
  • &check 32h 10m on-demand video
  • &check Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
  • &check Pre-configured Jupyter Notebooks in Google Colab
  • &check Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • &check Access to centralized code repos for all 400+ tutorials on PyImageSearch
  • &check Easy one-click downloads for code, datasets, pre-trained models, etc.
  • &check Access on mobile, laptop, desktop, etc.

Face Recognition and Drunk Classification Using Infrared Face Images.

Thermoregulation is a process in which a biological organism modifies its internal temperature within certain limits and is commanded by the hypothalamus. For humans, the temperature reached in a normal state is approximately 36.7[degrees] (internal) and 33.5[degrees] (skin). In the case that the internal temperature is above 36.7[degrees], thermoregulation generates two processes for heat loss, sweating and vasodilation. If the temperature is less than 36.7[degrees], the thermoregulatory system generates thermogenesis processes (vasoconstriction and piloerection) to increase the temperature.

Some research studies have shown that the thermoregulatory system can be altered depending on mood or the consumption of certain foods [1]. In a series of publications, it is concluded that alcohol alters the correct operation of the system responsible for thermoregulation [2, 3], generating an induced vasodilation in the skin, which increases heat loss through convection, leading to a decrease in body temperature that is directly related to the amount of alcohol consumed.

The identification of drunk people has its basis in biology, medicine, and toxicology. Alcohol causes motor disturbances and disturbances in the psychic system, resulting in abnormal behaviour on a biological level, such as dilation of blood vessels [2-6] and increased blood pressure. In the case of the human face, there is a temperature increase in capillary density, such as around the nose, forehead, and eyes.

Despite the large number of applications in machine learning, such as face recognition, facial expressions, and personal identification, computer systems applied to the classification of drunk people have not been widely studied. The most notable work done is that of researchers at the University of Patras, Greece [7-11], who have tried to distinguish a sober or drunk person based on the variations in facial features. In simple words, in [7, 10] and [11], it is shown that the frontal region and nose are the most appropriate for acquiring information to classify people as drunk or sober, using a neural network to perform the classification task. In [9], it is concluded that alcohol causes an increase in the temperature of the eyes, which could be useful for classification. In [8], the extraction of the vascular network is proposed based on the works of Buddharaju et al. [3-6]. When analysing the total area in pixels of the vascular network of a sober subject compared to a drunk subject, it may be possible to obtain an indicator (feature) to identify whether the subject has consumed alcohol. In [10], the aim is to demonstrate that it is possible to differentiate between sober and drunk persons using the intensities of pixels located in certain regions or areas of the face (forehead, nose, and mouth). A space of separable features can be generated using these intensities however, this study used only a small number of subjects (8 individuals) and the individuals met certain characteristics (similar weight and height), and as such, it is not possible to ensure generalisation of the classifier.

Again in [10], a method is proposed to find regions with higher thermal variation in the face by comparing intensities of a person while sober and while in a drunken state. It is concluded that the forehead region shows an increase in temperature relative to the region of the nose. The feature extraction used in [10] is mainly based on analysing a number of pixels around interesting areas of the face, to which methods to reduce the dimensionality of the features are applied, such as linear discriminant analysis (LDA) or principal component analysis (PCA). The space generated by LDA, called "DrunkSpace," is used to distinguish between the states of sober and drunk.

In this context, the purpose of this study is to generate a classification system, based in the "DrunkSpace" proposed in [10], to identify whether an individual is drunk, using approaches of computer vision and pattern recognition. The primary objective is to extract features (patterns) of the face of thermal images obtained from a drunk person to build a Bayesian classifier based on a Gaussian mixture model (GMM) [12, 13]. The importance of thermal imaging is that it can be used to obtain patterns based on the thermal information of the face, which is linked to the processes of thermoregulation of the human face and the amount of alcohol consumed. Besides, since there are few databases available with a reduced number of thermal images of drunk people, it is proposed to create a public drunk thermal database to study the drunk person classification (available in

This section describes in detail how the Pontificia Universidad Catolica de Valparaiso-Drunk Thermal Face database (PUCV-DTF) was acquired.

2.1. Recruitment. An open call was made through posters at the school of Electrical Engineering at the Pontificia Universidad Catolica de Valparaiso. People who attended the call were informed of the research protocol and signed an informed consent form which was previously approved by the Ethics Committee of the university.

2.2. Participants. 46 individuals, 40 men and 6 women, were selected. The average age of individuals was 24 years with a standard deviation of approximately 3 years (the minimum age was 18 years, and the maximum was 29 years), being in good health without problems related to alcohol consumption. This analysis was performed by a screening test designed to exclude people who consume alcohol regularly.

2.3. Procedure. Subjects attended the robotics lab where they rested for 30 minutes to stabilise the metabolism to the temperature conditions of the laboratory. The subject then consumed a 355 mL can of 5.5[degrees] beer, waited another 30 minutes in the laboratory, and then repeated the procedure until four beers had been consumed. See Table 1 for the whole capturing procedure.

Upon completion of the procedure, subjects with about 0.8 g/L of blood alcohol (drunk) should remain in place until the alcohol percentage decreased to under 0.2 g/L. This was verified by measuring with a breath test. It should be noted that during the entire experiment a paramedic was in the lab to verify the status of the individuals.

2.4. Thermal Imaging. The thermal camera used was a FLIR TAU 2 [14] with a resolution of 640 x 480 pixels, a frame rate of 30 frames per second, thermal sensitivity of 50 mK, and a spectrum range between 7.5 and 13.5 [micro]m. The database included 46 individuals with five subsets, a total of 250 images per subject and 50 images per subset. The set is classified into 5 subsets: "Sober," "1 Beer," "2 Beers," "3 Beers," and "4 Beers," corresponding to the procedure of capturing the database, as summarised in Table 1. Following the acquisition of thermal imaging, a preprocessing was performed, where all the images were cropped and aligned according to the coordinates of the eyes, which were marked manually, giving a final resolution of 81 x 150 pixels. The thermal images were normalised using (1), which consists of applying a linear mapping to the pixel intensity values in therange [[N.sub.min], [N.sub.max]]:

[I.sub.norm] (i,j) N = [I.sub.(i,j)] - [I.sub.min]/[I.sub.max] - [I.sub.min] ([N.sub.max] - [N.sub.min]) + [N.sub.min], [for all] (i, j) [member of] [OMEGA], (1)

where [I.sub.min] and [I.sub.max] are the minimum and maximum values in the image: [I.sub.min] = [min.sub.(I,j)[member of][OMEGA]]I(i, j), and [I.sub.max] = [max.sub.(I,j)[member of][OMEGA]]I(i, j). In the experiments, the value range of [0,255] is used. An example of a subject with 5 subsets is shown in Figure 1. In order to highlight the information obtained from the alcohol consumption, colour has been applied to the images from the subsets. The figure shows also the forehead temperature (FT), the nose temperature (NT), and the alcohol tester (AT) measures. In the parenthesis, the average values of FT, NT, and AT for each of the classes are shown. Please note that the temperature of the thermal face image varies due to alcohol consumption. However, the nasal contrast varies when the subject has consumed beers. This effect could be attributable to either thermoregulation process or the breathing effect produced during exhalation or inhalation phase or a combination of both of them.

In terms of the alcohol concentration obtained in each breath test measurement, the range of values for each class can be observed in Table 2. Note that the values of alcohol concentration obtained show that the classes are overlapping due to the variability of subjects captured (different weight, height, age, sex, etc.). However, in our experiments, we want to classify whether the subject is sober or has consumed any beer (classes 1 Beer, 2 Beers, 3 Beers, or 4 Beers), regardless of the amount of alcohol concentration that the individual possesses. See in detail the values of temperature and alcohol test measurements in a document attached in the link of the database.

3. Feature Extraction and Classification

The feature extraction process consists of selecting local regions of a thermal face image and then extracting the information using dimensionality reduction methods. As indicated above, the process carried out in this study is related to the generation of a "DrunkSpace," as presented in [10]. In [10], the information from different regions of the face is extracted from a grid of 20 points. The problem with the grid proposed by [10] is that there are no biological details on the location of the feature points. For this reason and inspired by [15], a different grid of 22 points is chosen. In [15], the aim is to generate a thermographic map of the human face, for which a number of sensors are located on the face and neck of each subject. These 22 points are selected at positions where there are capillaries and veins that cross the face, as seen in any of the sample faces shown in Figure 2. Once the grid had been defined, the information is extracted from the thermal face images. As the information in the selected pixel of the grid may be subjected to noise, we decided to consider a neighborhood of 3 x 3 pixels around every point of the grid and compute the average intensity for each of the 22 regions of the face. Since the images obtained from the database are aligned, the grid was used for all subjects of the database as a unique mask.

After having extracted the features of each thermal image, a feature vector of 22 dimensions was generated. Therefore, for a subject, there are 50 feature vectors for each class ("Sober," "1 Beer," "2 Beers," "3 Beers," and "4 Beers"). It is not recommended to use the 22-dimensional vector for classification due to its high dimensionality, the complexity of generating a hyperplane that separates all of these dimensions, and the high computational expense it would entail. Thus, to reduce the dimensionality, the Fisher linear discriminant analysis was used.

3.1. Fisher Linear Discriminant Analysis (FLD). First, it is assumed that there is a classification problem which involves two different classes ([w.sub.1] and [w.sub.2]), and for each class, there are [n.sub.i] m-dimensional samples. Thus, there is a set of n samples: x = <[x.sup.1], [x.sup.2], [x.sup.3], . [x.sup.n]>, [n.sub.1] corresponding to the class [w.sub.1] and [n.sub.2] to [w.sub.2], and so forth. The FLD method aims to obtain a transformation from the x-space to the y-space, through the linear projection of all the samples (x) on a line, using the weights w. However, the line to be selected must maximise the separability of the projected samples between the different classes. The linear combination that allows us to project the samples from the x-space to the y-space is represented in

[mathematical expression not reproducible] (3)

To find the adequate projection line, we must define a separation measurement between the projected data to then maximise this separation. The solution proposed by Fisher [16] is to maximise a function that represents the difference between the mean values of each class, normalised by a measurement of the variability inherent to each class. The objective function to maximise J(w) can be represented as the function of two scatter matrices, [S.sub.W] and [S.sub.B] as shown in

J (w) = [w.sup.t][S.sub.B] w / [w.sup.t] [S.sub.W] w. (4)

The [S.sub.W] (the scatter matrix within classes) can be represented as the function of the original samples (x-space) or as the function of the projected samples. The [S.sub.B] (the scatter matrix between classes) can be defined in terms of the original and projected samples. Both definitions are shown below together with the definition of the covariance matrix. See (5), (6), and (7) for the scatter matrices.

[mathematical expression not reproducible], (5)

where [u.sub.i] is the mean value of the original samples of the ith class and w is the projection weights. Finally, the optimal projection is obtained using the weights w* (8):

w* = arg [max.sub.w] ([w.sup.t][S.sub.B] w/[w.sub.t][S.sub.W]w) = [S.sup.W.sub.-1] ([[mu].sub.1] - [[mu].sub.2]). (8)

This optimal solution w* is given by the eigenvector(s) of [S.sub.X] = [S.sub.W.sup.-1] ([[mu].sub.1] - [[mu].sub.2]), corresponding to the largest eigenvalue. Using FLD in our problem reduced the 22-dimensional vectors to two dimensions, while also maximising the distance between the means of the different classes and minimising the variance of each class. This is seen in the following example in Figure 3. Figure 3(a) shows a plane where two randomly chosen features are projected from a subject. In this figure, it is clear that not all classes are separable, since many of the features overlap between classes. However, when applying the FLD method, the DrunkSpace obtained (see Figure 3(b)) is completely separable and it is possible to identify projected clusters for each class.

3.2. Gaussian Mixture Model (GMM). Once the dimensionality reduction (with FLD) of the data had been carried out, an approach called the Gaussian mixture model (GMM) was then selected to perform the classification. A GMM is a probabilistic distribution whose probability density function is a linear combination of a finite quantity of Gaussian distributions. Each one of these Gaussian distributions represents a different class. In our case, four distributions were used as training classes (Sober, 1 Beer, 2 Beers, and 3 Beers) to generate the GMM. The remaining distribution (4 Beers) was used as a test set. The classification of the test data is performed by analysing the likelihood that the data belongs to each distribution of the GMM. Finally, the training data are assigned to the distribution (class) to which it is more likely to belong.

The theory on which this approach is based is briefly described below. Let Y be a D-dimensional real-valued random variable with a probability density function (pdf) that is written as a linear combination of elementary pdfs (see (8)). If the distributions that compose the mixture are Gaussian, the pdf is known as a Gaussian mixture.

p y/[theta] = [I.summation over (i=1)] [a.sub.i]N (y |C = i, [[beta].sub.i]), (9)

where I represents the quantity of elementary components (C) of the mixture and 0 represents the set of parameters [theta] = <[[alpha].sub.1], . [[alpha].sub.1], [[beta].sub.1], . [[beta].sub.I]>, where [beta] = <[[beta].sub.1], . [[beta].sub.I]>is the set of parameters associated to each distribution that composes the mixture and [alpha] = <[[alpha].sub.1], . [[alpha].sub.1]>is the weight of each distribution of the mixture. The Gaussian density is the components of the mixture, and the mean and covariance [[beta].sub.i] = <[[mu].sub.i], [[summation].sub.i]>are represented by

[mathematical expression not reproducible] (10)

The solution to the classification problem is explained below. First, consider a set of samples y = <[y.sub.1], . [y.sub.J]>where [y.sub.j] [member of] [R.sup.D] is one of the j independent outcomes of the random variable Y then, the likelihood of y is defined by the following likelihood function (for independent and identically distributed observations) given by (11).

[mathematical expression not reproducible]. (11)

Now, the likelihood of y should be maximised. Using some algebraic transformations (see [17]), it is possible to represent the likelihood function as an objective function to be maximised, applied to the Gaussian Mixture Model (12):

This is a hard optimisation problem that is commonly solved using the expectation maximisation (EM) algorithm [18]. More information about Gaussian mixture models is available in [12, 13]. In the present study, the GMM procedure is implemented in Matlab R2015a and the EM algorithm was selected to perform the optimisation. Each Gaussian distribution (of the mixture) is defined by the clusters formed by the projected (FLD) samples of the different training classes ("Sober," "1 Beer," "2 Beers," and "3 Beers"), and the test data, which we want to classify, is the projected samples of the class "4 Beers."

Figure 4 shows an example of a GMM classifier for the same subject used in Figure 3. The figure shows the DrunkSpace classifier generated with the data from the training subsets: "Sober," "1 Beer," "2 Beers," and "3 Beers." The regions shown in Figure 4 were constructed evaluating the DrunkSpace projections from Fisher in the

GMM classifier for the training subsets thus, one DrunkSpace classifier was generated. Note that the regions obtained represent the probability of being in one of the classes: "Sober," "1 Beer," "2 Beers," and "3 Beers." The subset "4 Beers" is used to validate the classification. Figure 4 shows the test set ("4 Beers") in magenta, which is classified primarily in the region belonging to the class "3 Beers." The probabilities obtained for this example are 0% for the "Sober" class, 0% for the "1 Beer" class, 16% for the "2 Beers" class, and 84% for the "3 Beers" class.

4. Experiment and Evaluation of the Proposed System

The proposal for this study consists of two stages: face recognition and identifying a drunk person. In Figure 5, the general outline of the proposed system is shown. The first stage determines the identity of individuals for further classification analysis. Once the faces of the subjects have been recognised in the first stage, the second stage is responsible for performing a feature extraction through the FLD method and then the classification of drunkenness is performed using the GMM classifier. The explanation of each of the steps is shown in detail below. The database used in the study is the PUCV-DTF, which is described in Section 3.

4.1. Stage 1. Face recognition is a crucial stage for the complete system of identifying drunk people because it determines the identity of the individuals in the database. The face recognition system used in this study was analysed using two current descriptors commonly used in the literature: the LBP descriptor [19] and the WLD descriptor [20]. Both methods use the histogram intersection distance (HI) as a dissimilarity measure.

The experiment consisted of using images from the PUCV-DTF database to generate a gallery set and test set. The gallery set is composed of the face images of the subjects in a sober state, while the test set is composed of images of the subjects after drinking beer ("1 Beer," "2 Beers," "3 Beers," and "4 Beers"). The result obtained by the recognition system is shown in Table 3. As can be seen in Table 3, both descriptors used get a 100% yield in the recognition rates for each test set, which is due to the nonexistence of temporal variability in the images of the database, because they were acquired in a lapse of three hours. However, the performance of the face recognition system may decline if the images were acquired in a higher time lapse [21-23].

4.2. Stage 2. Once the identity of the subject is recognised, we proceed to the stage of drunk state identification. To perform this task, the 2nd stage is divided into two substages: feature extraction and classification. The feature extraction substage involves the selection of the proper information from the face, which is then used to determine if the person is sober or drunk. The substage of classification consists of the generation of a classification subspace (called the DrunkSpace), where the extracted features will be projected, and then based on this projection, the classification of the state of the subject will be performed.

As mentioned in Section 3, the feature extraction is performed using a grid of 22 points. The FLD method is then used to reduce the dimensionality of the data by projecting the feature vectors from 22 dimensions to 2 dimensions. The classification space (DrunkSpace) is then generated for each subject based on the Gaussian mixture model approach (see Section 3). Each DrunkSpace was generated using Matlab R2015a. The training data used was that of the projections of the feature vectors from the images of the classes for the sober subject and the subject after drinking 1, 2, and 3 beers. The images of the subject after drinking 4 beers were used as the test set.

The experiment to perform the classification is carried out as follows: the test feature vector, corresponding to the set "4 Beers," is projected onto the DrunkSpace. This projection is performed using the same eigenvectors used to generate the DrunkSpace (FLD) of each subject (using the training sets "Sober," "1 Beer," "2 Beers," and "3 Beers"). Once the test data had been projected, the classifier returned the likelihood of the data which belong to each training class ("Sober," "1 Beer," "2 Beers," and "3 Beers"). The highest likelihood returned by the classifier indicates how the projected test data ("4 Beers") were classified. For instance, if the classifier returned the likelihood of the projected test data to belong to each training class which are 0.1, 0.2, 0.3, and 0.4 ("Sober," "1 Beer," "2 Beers," and "3 Beers," resp.), then it is considered that the projected training data were identified as a drunk subject, with at least 3 beers. The average classification rate for this experiment to all subjects is summarised in Table 4.

From the results shown in Table 4, it is important to note that the system achieved an identification rate of 86.96% for drunk people, if we consider correct classification when the training data was identified as 1, 2, or 3 beers. If we break down this 86.96% into the percentages corresponding to each "drunk" class, we observe that a 9.09% was classified in the "1 Beer" and 29.39% was classified in the "2 Beers" class, while 48.48% was classified in the "3 Beers" class. It is important to highlight the clear trend that can be observed relating to the 86.96% correct identification this trend shows a progressive increase in the classification rates from the class "1 Beer" to the class "3 Beers." This can be attributed to the fact that the test data that are being classified correspond to subjects who drank 4 cans of beer, leading to the conclusion that the projected features in the DrunkSpace follow a certain dynamic and they are moving in certain regions of the DrunkSpace while the subject consumes alcohol. It is because of this that almost half of the test data ("4 Beers") were projected in the region of the DrunkSpace corresponding to the highest level of alcohol consumption.

This article presents a computer vision system that identifies people in a drunk state. The system is composed of two main stages, one for face recognition and the other for drunk classification. The face recognition stage provides the identity of an individual previously stored in a database, while the classification stage identifies the state of the individual, indicating if the subject has consumed alcohol. Inspired by [10], the classification stage uses the Fisher linear discriminant (FLD) method to reduce the dimensionality of the feature vectors and generate a subspace called "DrunkSpace." We then use a Bayesian classifier based on Gaussian mixture models (GMM) to identify whether or not the subject is in a drunk state.

The results obtained in this study show that the proposed system to identify drunk people achieves a success rate of approximately 87% that is, the system is capable of identifying if a person drank at least one can of beer. In addition, the proposed system achieves recognition rates of 100% in the face recognition stage using the LBP method or the WLD method. It is important that the face recognition stage should be robust, because obtaining the correct identity of the test subject allows us to select the correct "DrunkSpace" corresponding to the test subject, facilitating the work for the GMM-based classifier that determines if the individual is in a drunk state.

It is important to mention that the good results obtained are mainly because the selected locations of the extracted features present results of metabolic changes in the face of the subjects and are also due to processes related to other biological factors, such as the thermoregulation, which can be observed with a thermal camera.

From obtained results, we hope to encourage other researchers to study the classification of people in a drunk state, because it would lead to noninvasive systems which can be beneficial to society. As a future study, we hope to generalise the problem of the classification of groups of people, that is, to generate a generic classifier which can be used to identify people in a drunk state independent of weight, sex, or height and not individually as was performed in this research.

The authors declare that they have no conflicts of interest.

This work was supported in part by FONDECYT under Grant 11130466, Grant 1161584, and Grant 11150476 and in part by Pontificia Universidad Catolica de Valparaiso DI Regular Code under Grant 039.420/2017.

[1] B. Falk, R. Burstein, J. Rosenblum, Y. Shapiro, E. Zylber-Katz, and N. Bashan, "Effects of caffeine ingestion on body fluid balance and thermoregulation during exercise," Canadian Journal of Physiology and Pharmacology, vol. 68, no. 7, pp. 889-892, 1990.

[2] H. Kalant and A. Le, "Effects of ethanol on thermoregulation," Pharmacology & Therapeutics, vol. 23, no. 3, pp. 313-364, 1983.

[3] P. Buddharaju, I. T. Pavlidis, P. Tsiamyrtzis, and M. Bazakos, "Physiology-based face recognition in the thermal infrared spectrum," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp. 613-626, 2007.

[4] P. Buddharaju and I. Pavlidis, "Multi-spectral face recognition --fusion of visual imagery with physiological information," in Face Biometrics for Personal Identification. Signals and Communication Technology, R. I. Hammoud, B. R. Abidi and M. A. Abidi, Eds., pp. 91-108, Springer, Berlin, Heidelberg, 2007.

[5] P. Buddharaju, I. Pavlidis, and C. Manohar, "Face recognition beyond the visible spectrum," in Advances in Biometrics, pp. 157-180, Springer, London, 2008.

[6] P. Buddharaju, I. Pavlidis, and I. Kakadiaris, "Pose-invariant physiological face recognition in the thermal infrared spectrum," in 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06), pp. 53-60, New York, USA, 2006.

[7] G. Koukiou and V. Anasassopoulos, "Face locations suitable drunk persons identification," in 2013 International Workshop on Biometrics and Forensics (IWBF), pp. 1-4, Lisbon, Portugal, 2013.

[8] G. Koukiou and V. Anastassopoulos, "Facial blood vessels activity in drunk persons using thermal infrared," in 4th International Conference on Imaging for Crime Detection and Prevention 2011 (ICDP 2011), pp. 1-5, London, UK, 2011.

[9] G. Koukiou and V. Anastassopoulos, "Eye temperature distribution in drunk persons using thermal imagery," in 2013 International Conference of the BIOSIG Special Interest Group (BIOSIG), pp. 233-240, Darmstadt, Germany, 2013.

[10] G. Koukiou and V. Anastassopoulos, "Drunk person identification using thermal infrared images," International Journal of Electronic Security and Digital Forensics, vol. 4, no. 4, pp. 229-243, 2012.

[11] G. Koukiou and V. Anastassopoulos, "Neural networks for identifying drunk persons using thermal infrared imagery," Forensic Science International, vol. 252, pp. 69-76, 2015.

[12] J.-M. Marin, K. Mengersen, and C. P. Robert, "Bayesian modelling and inference on mixtures of distributions," Handbook of Statistics, vol. 25, pp. 459-507, 2005.

[13] B. G. Lindsay, "Mixture models: theory, geometry and applications," in NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5, p. i-iii+v-ix+1-163, Institute of Mathematical Statistics, Hayward, CA, USA, 1995.

[14] FLIR, "Tau 2 product specification," 2014, tau2-product-spec.

[15] J. Rustemeyer, J. Radtke, and A. Bremerich, "Thermography and thermoregulation of the face," Head & Face Medicine, vol. 3, no. 1, p. 17, 2007.

[16] R. A. Fisher, "The use of multiple measurements in taxonomic problems," Annals of Eugenics, vol. 7, no. 2, pp. 179-188,1936.

[17] A. Garcia Herrero, Algoritmos para la estimacion de modelos de mezclas Gaussianas, Universidad de Cantabria, Spain, 2015.

[18] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.

[19] T. Ahonen, A. Hadid, and M. Pietikainen, "Face description with local binary patterns: application to face recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, 2006.

[20] J. Chen, S. Shan, C. He et al., "WLD: A robust local image descriptor," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1705-1720, 2010.

[21] S. Farokhi, S. M. Shamsuddin, J. Flusser, and U. U. Sheikh, "Assessment of time-lapse in visible and thermal face recognition," International Journal of Computers Communications & Control, vol. 6, pp. 181-186, 2012.

[22] D. A. Socolinsky and A. Selinger, "Thermal face recognition over time," in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, pp. 187-190, Cambridge, UK, 2004.

[23] X. Chen, P. J. Flynn, and K. W. Bowyer, "IR and visible light face recognition," Computer Vision and Image Understanding, vol. 99, no. 3, pp. 332-358, 2005.

Gabriel Hermosilla, (3) Jose Luis Verdugo, (2) Gonzalo Farias, (1) Esteban Vera, (1) Francisco Pizarro, (1) and Margarita Machuca (3)

(1) Escuela de Ingenieria Electrica, Pontificia Universidad Catolica de Valparaiso, Valparaiso, Chile

(2) Pontificia Universidad Catolica de Valparaiso, Valparaiso, Chile

(3) Universidad de Santiago de Chile, Santiago, Chile

Correspondence should be addressed to Gonzalo Farias [email protected]

Received 25 April 2017 Revised 10 July 2017 Accepted 3 December 2017 Published 30 January 2018

Academic Editor: Paolo Mercorelli

Caption: FIGURE 2: Feature extraction regions used for subjects in the database.

Caption: FIGURE 3: (a) Example of a feature space of two features randomly chosen for a subject. (b) DrunkSpace obtained using FLD for the same subject. The DrunkSpace obtained is completely separable for these two features.

Caption: FIGURE 4: Example of a DrunkSpace classifier from subject 36. The circles represent the test subset "4 Beers." The probabilities obtained for this example are 0% for the "Sober" class, 0% for the "1 Beer" class, 16% for the "2 Beers" class, and 84% for the "3 Beers" class, indicating a high probability of identifying the subject as being in a drunk state.

Face Spoof Attack Recognition Using Discriminative Image Patches

Face recognition systems are now being used in many applications such as border crossings, banks, and mobile payments. The wide scale deployment of facial recognition systems has attracted intensive attention to the reliability of face biometrics against spoof attacks, where a photo, a video, or a 3D mask of a genuine user’s face can be used to gain illegitimate access to facilities or services. Though several face antispoofing or liveness detection methods (which determine at the time of capture whether a face is live or spoof) have been proposed, the issue is still unsolved due to difficulty in finding discriminative and computationally inexpensive features and methods for spoof attacks. In addition, existing techniques use whole face image or complete video for liveness detection. However, often certain face regions (video frames) are redundant or correspond to the clutter in the image (video), thus leading generally to low performances. Therefore, we propose seven novel methods to find discriminative image patches, which we define as regions that are salient, instrumental, and class-specific. Four well-known classifiers, namely, support vector machine (SVM), Naive-Bayes, Quadratic Discriminant Analysis (QDA), and Ensemble, are then used to distinguish between genuine and spoof faces using a voting based scheme. Experimental analysis on two publicly available databases (Idiap REPLAY-ATTACK and CASIA-FASD) shows promising results compared to existing works.

1. Introduction

In the last years, face recognition systems have gained interest due to face’s rich features that offer a strong biometric cue to recognize individuals for a wide variety of applications in both law and nonlaw enforcements [1]. In fact, facial recognition systems are already in operation worldwide, including USVISIT, which is a US Customs and Border Protection (CBP) management system, UIDAI that provides identity to all persons resident in India, and Microsoft Kinect which uses face recognition to access dashboard and automatic sign-in to Xbox Live profile. Similarly, face biometrics is also nowadays being used ubiquitously as an alternative to passwords on mobile devices such as Android KitKat mobile OS, Lenovo VeriFace, Asus SmartLogon, and Toshiba SmartFace.

Despite the great deal of progress in facial recognition systems, vulnerabilities to face spoof attacks are mainly overlooked [2]. Facial spoof attack is a process in which a fraudulent user can subvert or attack a face recognition system by masquerading as registered user and thereby gaining illegitimate access and advantages [1, 3–5]. Face spoofing attack is a major issue for companies selling face biometric-based identity management solutions [6]. For instance, at New York City, nonwhite robbers disguised themselves as white cops, using life-like latex masks, and were caught robbing a cash-checking store in 2014 (see Figure 1, also for other recent face spoof attacks).

Recent study reported in [1] suggests that the success rate of face spoof attacks could be up to 70%, even when a state-of-the-art Commercial Off-The-Shelf (COTS) face recognition system is used. Therefore, we could infer that even COTS face recognition systems are mainly not devised to effectively distinguish spoof faces from genuine live faces. As a matter of fact, this vulnerability of face spoofing to face recognition systems is now enlisted in the National Vulnerability Database of the National Institute of Standards and Technology (NIST) in the US.

Typical countermeasure to face spoof attacks is liveness detection method, which aims at disambiguating human live face samples from spoof artifacts [2, 7]. There exist several face antispoofing or liveness detection techniques [7–15]. However, face spoofing attacks remain a problem due to difficulties in finding discriminative and computationally inexpensive features and techniques for spoof recognition. Moreover, published methods are limited in their scope since they mainly use whole face image or complete video for liveness detection. Nevertheless, often certain face image regions (video frames) are redundant or correspond to the clutter in the image (video), leading thus generally to low performances.

It is thus essential to develop robust, efficient, and compact face antispoofing (or liveness detection) methods, which are capable of generalizing well to discriminative, class-specific information and imaging conditions. To this aim, in this paper, we propose a simple and effective solution based on discriminative image patches. In particular, we propose seven novel fully automated algorithms to highlight regions of interest in face images. We define these regions (or image patches) to be discriminative (i.e., specific to a particular class: live or spoof), consistent (i.e., reliably appearing in different face images or video frames), salient (i.e., conspicuous regions), and repetitive (i.e., frequently appearing in the image set of specific class). The basic notion is “interesting patches are those that are specific to a face image (or video frame) and should contain features that give assistance to discriminate a given live face image from spoofed one.” Based on this definition, two of the seven proposed image patch selection methods (i.e., MAXDIST and DEND-CLUSTER) do not employ any training or prior learning. However, the remaining techniques use simple clustering (i.e., CP and CS), image intensity (i.e., IPI), image quality (i.e., IQA), or diversity filter (i.e., DF) to obtain discriminative patches. For final classification, we exploited four well-known classifiers, namely, support vector machine (SVM), Naive-Bayes, Quadratic Discriminant Analysis (QDA), and Ensemble, using voting based scheme. Experimental analysis on two publicly available databases (Idiap Replay-Attack and CASIA-FASD) shows good results compared to existing works.

The added advantages of the proposed framework are (i) being cheap, (ii) very low complexity, (iii) needing one face image (i.e., the same face sample acquired for face recognition) to detect whether it is genuine or spoof attack, (iv) being nonintrusive, (v) being user-friendly, and (vi) being easy to embed in already functional face recognition systems plus no requirement of new piece of hardware.

The reminder of the paper is organized as follows. Exiting literature works on face liveness detection are discussed in Section 2. The proposed approaches to determine the discriminative image patches and spoof classification schemes are described in Section 3. Experimental datasets, protocols, and results are presented in Section 4. A conclusion is drawn in Section 5.

2. Related Work

Despite great deal of advancements in face recognition systems, face spoofing still poses a serious threat. Most of the existing academic and commercial facial recognition systems may be spoofed by (see Figure 2) (i) a photo of a genuine user (ii) a video of a genuine user (iii) a 3D face model (mask) of a genuine user (iv) a reverse-engineered face image from the template of a genuine user (v) a sketch of a genuine user (vi) an impostor wearing specific make-up to look like a genuine user (vii) an impostor who underwent plastic surgery to look like a genuine user. The most easiest, cheapest, and common face spoofing attack is to submit a photograph of a legitimate user to the face recognition systems, which is also known as “photo attack.”

Typical countermeasure (i.e., face liveness detection or antispoofing) techniques can be coarsely classified in three categories based on clues used for spoof attack detection: (i) motion analysis based methods, (ii) texture analysis based methods, and (iii) hardware-based methods. In what follows, we provide a brief literature overview of published face spoof recognition techniques along with their pros and cons.

(i) Motion Analysis Based Methods. These methods broadly try to detect spontaneous movement clues generated when two dimensional counterfeits are presented to the camera of the system, such as photographs or videos. Therefore, Pan et al. [7] exploited the fact that human eye-blink occurs once every 2–4 seconds and proposed eye-blink based liveness detection for photo-spoofing using (spontaneous) eye-blinks. This method uses an undirected conditional random field framework to model the eye-blinking, which relaxes the independence assumption of generative modelling and states dependence limitations from hidden Markov modelling. It is evident that real human faces (which are 3D objects) will move significantly differently from planer objects, and such deformation patterns can be employed for liveness detection. For example, Tan et al. [8] considered Lambertian reflectance model with difference-of-Gaussians (DoG) to derive differences of motion deformation patterns between 2D face photos presented during spoofing attacks and 3D live faces. It does so by extracting the features in both scenarios using a variational retinex-based method and difference-of-Gaussians (DoG) [9] based approach. Then, the features are used for live or spoof classification. Reported experiments showed promising results on a dataset consisting of real accesses and spoofing attacks to 15 clients using photo-quality and laser-quality prints. Kollreider et al. [10] proposed a liveness detection approach based on a short sequence of images using a binary detector, which captures and tracts the subtle movements of different selected facial parts using a simplified optical flow analysis followed by a heuristic classifier. The same authors also presented a method to combine scores from different experts systems, which concurrently observe the 3D face motion approach introduced in the former work as liveness attributes like eye-blinks or mouth movements. In the similar fashion, Bao et al. [11] also used optical flow to estimate motion for detecting attacks produced with planar media such as prints or screens.

Since the frequency of facial motion is restricted by the human physiological rhythm, thus motion based methods take a relatively long time (usually > 3 s) to accumulate stable vitality features for face spoof detection. Moreover, they may be circumvented or confused by other motions, for example, background motion in the video attacks.

(ii) Texture Analysis Based Methods. This kind of methods examines the skin properties, such as skin texture and skin reflectance, under the assumption that surface properties of real faces and prints, for example, pigments, are different. Examples of detectable texture patterns due to artifacts are printing failures or blurring. Li et al. [13] described a method for print-attack face spoofing by exploiting differences in the 2D Fourier spectra of live and spoof images. The method assumes that photographs are normally smaller in size and contain fewer high frequency components compared to real faces. The method only works well for downsampled photos of the attacked identity but likely fails for higher-quality samples. In [14, 16, 17], authors developed microtexture analysis based methods to detect printed photo attacks. One limitation of presented methods is the requirement of reasonably sharp input image. Recently, Galbally et al. [3] designed a face spoof detection scheme based on 25 different image quality measures: 21 full-reference measures and 4 nonreference measures. However, all 25 image quality measures are required to get good results and no face-specific information has been considered in designing informative features for face spoof detection.

Compared to other techniques, texture analysis based algorithms are generally faster to classify a spoof attack. Nevertheless, they could be easily overfitted to one particular illumination and imagery condition and hence do not generalize well to different spoofing conditions.

(iii) Hardware-Based Methods. Few interesting hardware-based face antispoofing techniques have been proposed so far based on imaging technology outside the visual spectrum, such as 3D depth [18], complementary infrared (CIR), or near infrared (NIR) images [15] by comparing the reflectance information of real faces and spoof materials using a specific set-up of LEDs and photodiodes at two different wavelengths. Preliminary efforts on thermal imaging for face liveness detection have also been exploited, including the acquisition of large database of thermal face images for real and spoofed access attempts [19]. Besides, numbers of researchers have explored multimodality as antispoofing techniques for face spoofing attacks. They have mainly considered the combination of face and voice by utilizing the correlation between the lips movement and the speech being produced [20], where a microphone and a speech analyzer were required. Similarly, challenge-response strategy considering voluntary eye-blinking and mouth movement following a request from the system has been studied in [10]. Though hardware-based methods provide better results and performances, they require extra piece of hardware which increases the cost of the system. A summary with relevant features of the most representative works in face antispoofing is presented in Table 1.

Though there exist several face antispoofing or liveness detection techniques, face spoof attacks remain an issue because of difficulty in finding discriminative and computationally inexpensive features and mechanisms for spoof recognition. Reported methods are limited in their scope since they mainly use full image or complete video for liveness detection. In particular, there is a lack of investigation on how specific image patches rather than full image perform in spoof detection. As often image regions (video frames) are redundant or correspond to the clutter in the image (video), leading thus generally to low performances and high computational cost, towards this direction, we propose seven novel methods to find discriminative image patches, which we define as regions that are salient, instrumental, and class-specific. Experimental results show that the proposed methods obtain comparable performances to existing techniques.

3. The Proposed Method for Face Spoof Recognition

Face spoof detection can be seen as a two-class classification problem, where the input face image (or video) has to be flagged as either live or spoof. The keynote of the process is attaining a discriminant feature set together with an appropriate classification scheme that gives the probability of the image (or video) realism. Practical face spoof detection requires that a decision be made based on single image or a limited number of frames in the video-based system. In this work, thus we aim to design simple but effective solution based on discriminative image patches using a single face frame/image. We define these image patches to be discriminative, consistent, salient, and repetitive. The notion is that the interesting patches are those that are specific to a face image (or video frame) and should contain features that help discriminate a given live face image from spoofed one.

Figure 3 shows the schematic diagram of the proposed face spoof recognition algorithm based on discriminative image patches. The proposed framework first randomly selects a single frame from a given face video (in case of image-based system, the given single image is used). Then, face is detected using Local SMQT Features and Split-Up Snow Classifier [30]. Subsequently, the detected face is densely divided into a grid of nonoverlapping local patches. These patches are ranked based on their discriminative power. The top

patches are selected using specific discriminative image patch selection method among the proposed techniques (explained below in detail). For each selected image patch, features are extracted that are then fed into particular classifier (i.e., SVM, Naive-Bayes, QDA, or Ensemble classifier). The classification results of individual patches are combined by a majority-voting based scheme to obtain the final binary decision: genuine or spoof face.

3.1. Discriminative Image Patches Selection Methods

In what follows, we give the details of the proposed seven methods for discriminative image patches selection. The proposed patch selection methods are grouped into two categories: clustering based methods and nonclustering based methods. For clarity, in Figure 4 we show a diagram with the patch selection methods classification followed in this section.

3.1.1. Clustering Based Methods

The patch selection methods in this category rely on a clustering algorithm at any specific stage of the procedure. Three clustering based methods proposed in this work are as follows.

(1) DEND-CLUSTERING. In this discriminative image patches selection technique, the

patches in the given face image are grouped into

clusters, such that patches within a cluster are more similar to each other than patches belonging to different clusters. Then, for each cluster a prototype (representative) patch that typifies the members of that cluster is chosen, resulting in discriminative patches. Since this method uses the dendrogram [31] to choose the discriminative patches, thus we call it DEND-CLUSTERING.

For each normalized face image

, first a set of dense patches

is taken, where is the total number of dense patches. Computation of the dissimilarity scores between patches is needed to perform clustering. Therefore, first the dissimilarity between successive patches is computed by comparing the respective features. Then, hierarchical clustering [31] is exploited, because the representation of the patches is in the form of an

dissimilarity matrix instead of an

pattern matrix, where is the dimension of feature vector. In particular, an agglomerative complete link clustering algorithm [31] is used in this work. The outcome of this algorithm is a dendrogram (a binary tree), where each terminal node corresponds to a patch, and the intermediate nodes indicate the formation of clusters. The discriminative patches are selected as follows: (i) Find the pairwise distance scores between the patches to form the dissimilarity matrix

. (ii) Apply the complete link clustering algorithm on , and generate the dendrogram

. Use the dendrogram to identify clusters. The method in [31] automatically determines the threshold distance to cut the dendrogram and identify exactly clusters. (iii) In each of the clusters identified in step (ii), select a patch whose average distance from the rest of the patches in the cluster is minimum. If a cluster has only 2 patches, choose any one of the two patches at random. (iv) The patches selected in step (iii) are arranged in descending order on the basis of their ideal selection measure (ISM) value, which is computed as

and is the image gradient at location . (v) The top patches are selected as discriminative patches.

It is worth mentioning that steps (i)–(iii) in DEND-CLUSTERING method have close resemblance with the technique in [31] for fingerprint template selection. Here, we extended the technique by proposing step (iv) to be utilized for ranking and selection of discriminative patches.

(2) CP (Cluster Pairing). Apparently, the discrimination power of patches (features) decides maximum possible classification accuracy, and thus prior knowledge of “how cluttered the features (patches) may be” and “their contribution to classes separability in the feature space” can help to design and accomplish better classification scheme and accuracy. To this aim, in this method, first two independent sets of clusters are generated using genuine and spoof attack samples, respectively. Since overlapping of interclass clusters has great effect on classification accuracy, therefore pairs of overlapped corresponding clusters of two independent sets are formed using minimum distance between them. Finally, patches which do not belong to both clusters of a given interclass highly overlapped cluster pair are selected as discriminative patches. In other words, if a patch belongs to both clusters of a given interclass cluster pair, it means that its features cause overlapping different classes in the feature space, which might thus lead to lower classification accuracy. The steps followed to obtain top discriminative patches using this method are as follows: (i) Two independent sets of clusters are generated using live and spoof attack training samples, respectively. Each class is represented by

number of clusters scattered in the feature space. -means clustering technique is exploited in this work for cluster creation. (ii) All possible pairs of corresponding clusters of two independent sets are formed using

where is threshold, is center of , is a given cluster from live class clusters set, and is a given cluster from spoof class clusters set. is appropriate for detecting spherical and compact cluster pairs, since each cluster is represented only by its center point. (iii) For a given face image, patches are chosen as discriminative patches, which do not belong to both elements of the interclass clusters.

(3) CS (Cluster Space). Principally, information classes cannot be described efficaciously by a single well-defined grouping in a spectral space. Thus, it is better to represent them by a group of spectral classes (clusters), which is prime inference of this method. It is worth noting that this method is identical to the above-mentioned Cluster Pairing (CP) method. In this method, number of clusters are generated using both live and fake labeled training patches together we name resulting spectral space as cluster space. For each cluster, an individual classifier (IC) is trained, hence resulting in number of individual classifiers. Given an image patch, its discriminative value (DV) is computed as an average of the probabilities given by all ICs. Later, the patches are sorted based on their DV with respect to other patches. Finally, patches corresponding to largest DV values are selected. This method (cluster space + IC) provides a means of optimizing the variance and correlation present in all classes and samples. Following are the steps executed to designate top discriminative patches. (i) Using training dataset’s labeled patches, number of clusters are generated using both live and fake samples together. -means clustering algorithm is exploited in this work for cluster creation. (ii) For each cluster, an individual classifier (IC) (in this work, SVM) is trained using ground truth. (iii) The patches of a given face image are arranged in descending order on the basis of their respective DV:

where is the probability given by

th classifier trained on th cluster and is the candidate patch. (iv) The top patches are selected as discriminative patches.

3.1.2. Nonclustering Based Methods

Unlike the clustering based methods, techniques in this category do not require any clustering algorithm. Following are the four nonclustering based discriminative patch selection methods.

(1) MAXDIST. This method of discriminative patches selection is based on the assumption that candidate discriminative patches are maximally dissimilar from the other patches in the given face image. Therefore, first the dissimilarity between successive patches is computed. The method then sorts the patches based on their average dissimilarity score with respect to other patches and selects those patches (i.e., discriminative patches) that correspond to the largest average dissimilarity scores. We refer to this method as MAXDIST since discriminative patches are selected using a maximum dissimilarity criterion.

Following steps are followed to select top discriminative patches. (i) An dissimilarity matrix ( ) is generated, where each element

is the distance score between features of patches and . (ii) For the th patch, the average dissimilarity score (

) with respect to the remaining ( ) patches is computed by finding the average of the elements in th row of . (iii) The average values obtained in step (ii) are ordered in descending order and the top patches that have the largest average dissimilarity scores are selected as discriminative patches, since they are the most “dissimilar” patches in the image and hence they are representing typical data measurements.

For classification performance point of view, smaller values might not be able to sufficiently seize the inter- and intraclass variability, which may lead to inferior performance. However, larger values, on the other hand, would be computationally demanding. Thus, a rational value of , by taking above-mentioned factors into account, has to be specified. A similar method has been proposed in [32] for peak frame selection in a given facial expression video. Unlike [32], in this work, we employ the technique to select discriminative patches in an image/frame.

(2) IQA (Image Quality Assessment). This algorithm uses image quality measures to select discriminative patches, thus named as image quality assessment (IQA). The method assumes that the discriminative patches will have different quality from nondiscriminative patches. The expected quality differences between discriminative and nondiscriminative patches may be local artifacts, color levels, luminance levels, degree of sharpness, entropy, structural distortions, or natural appearance.

This framework exploits 4 general reference-based image quality measures, thus having a very low degree of complexity. First, four distinct label-sets for all patches are obtained using four different image quality measures. The labels are “discriminative” and “nondiscriminative.” The method selects only those patches, which are flagged as discriminative by all four-image quality assessment techniques.

In particular, reference-based IQA methods are employed in this scheme that rely on the availability of a clean undistorted reference image to estimate the quality of the test image. However, in typical spoof recognition such a reference image is unknown, because only the input sample is available. Therefore, to circumvent this limitation, the same technique (filtering the image with a low-pass Gaussian kernel) that is successfully being used for image manipulation detection [33] and for steganalysis [34] is utilized. Following steps are executed to attain top discriminative patches: (i) The normalized face image ( ) is filtered with a low-pass Gaussian kernel in order to generate a smoothed version

. (ii) Two corresponding sets of dense patches and are taken, where is the total number of patches. (iii) Four label matrices (

) using following four-image quality measures (IQM) are generated. The patches are flagged as “discriminative” if their IMQ is greater than or equal to the threshold. The image quality measures are as follows: (a) Peak Signal to Noise Ratio (PSNR). It computes the distortion between two corresponding patches (of size ) on the basis of their pixel-wise differences as follows:

(b) Normalized Cross-Correlation (NCC). The correlation function can also be used to quantify the similarity between two digital image patches [3]. Here, a variant of correlation based measure is obtained by considering the statistics of the angles between the pixel vectors of the original and distorted patches as

(c) Total Edge Difference (TED). Edge features are generally considered as one of the most informative datasets in computer vision algorithms. Thus, we considered edge-related quality measure since the structural distortion of an image is deeply linked with its edge degradation. The TED measure is calculated as follows:

In this work, we use Sobel operator to build the binary edge maps and . (d) Gradient Magnitude Similarity Deviation (GMSD). GMSD finds the pixel-wise Gradient Magnitude Similarity (GMS) between the reference and distorted patches. It uses a pooling strategy based on standard deviation of the GMS map to predict accurately perceptual image quality. The GMSD is obtained as follows:

where is the total number of pixels in the patch, GMS is Gradient Magnitude Similarity map, and GMSM is Gradient Magnitude Similarity Mean calculated by applying average pooling to the GMS map. The GMS is calculated as

where is a positive constant that supplies numerical stability, while and are gradient magnitude images obtained by convolution using Prewitt filters along horizontal and vertical directions, respectively. The GMSM is obtained as

Clearly, a higher GMSM score means higher image quality. We refer reader to [35] for further details of GMSD technique. (iv) The patches flagged as discriminative by all above four-image quality assessment techniques are selected:

(v) The patches selected in step (iv) are arranged in descending order on the basis of their average values of (4), (6), (7), and (8). (vi) The top patches are selected as discriminative patches.

(3) DF (Diversity Filter). This method considers that the discriminative patches are discernibly diverse from the other patches in the given face image. The discriminative patches are obtained using the combination of a trained classifier (we refer to such classifier as diversity filter) and a selection procedure that selects the patches based on their incremental contribution to the discriminative patch detection performance. This method is close to object detection methods [36, 37], where similar procedure is carried out to capture visually varied parts of objects at a fixed pose or viewpoint. Unlike the proposed method, techniques in [36, 37] use, before diversity filter, a preprocessing step, that is, creating various clusters corresponding to different parts appearances in images. Also distinct diversity filter is employed for each such cluster, while our method single diversity filter is used for all samples and image patches without any clustering. Moreover, the diversity tradeoff parameter in our method is computed dynamically for each image, whereas in [36, 37] a static value is utilized for all images.

In particular, we learn the diversity model of patches based on their properties that can be computed from the filter itself. The intuition is that across image categories good filter exhibits common traits such as low clutter and gradients that are spatially correlated. Therefore, we train a ranking function with the objective to produce the order of diversity quality of patches. The function uses a weight to control tradeoff between estimated rank of a patch and the diversity it adds, which discourages adding patches similar to the ones already selected, even if this patch is highly ranked. Following are the steps required to select discriminative patches: (i) Using training dataset, a classifier (diversity filter, here SVM) is trained. (ii) The patches ( ) of a given face image are arranged in descending order on the basis of following equation:

where is the probability given by diversity filter, denotes similarity between patches and , and

is the diversity tradeoff parameter. It is worth mentioning that in (12) during the first iteration

(total number of patches), and then in each successive iteration is reduced by 1, such that the patch selected in foregoing iteration is removed. (iii) The top patches are selected as discriminative patches.

(4) IPI (Intensity-Based Patch of Interest). Local intensity inhomogeneity can be exploited to find the regions, shapes, and edges of similar kind in an image [38]. However, our aim here is to disregard the image patches (regions) with similar features in order to avoid redundancy. Therefore, in this method to determine the discriminative patches, we apply an approach identical to standard statistical background-subtraction approach (which is most commonly used to accost intensity inhomogeneity) [39]. The proposed method does not use any preprocessing step, that is, foreground and background models based on recursive or nonrecursive techniques like in [39]. Following steps are executed to attain discriminative patches: (i) A set of dense patches are taken, where is the total number of patches (of size ). (ii) A label matrix (

) is generated using a standard statistical background-subtraction approach:

where is threshold, which is estimated using similar procedure as explained above in IQA method. (iii) The patches flagged as discriminative in step (ii) are arranged in descending order on the basis of their values using (13). (iv) The top patches are selected as final discriminative patches.

3.2. Classification Method

For final classification whether the face is genuine or spoof, we used majority-voting based scheme that exploits four well-known classifiers: support vector machine (SVM), Naive-Bayes (NB), Quadratic Discriminant Analysis (QDA), and Ensemble based on AdaBoost algorithm.

4. Experiments

In this section, we evaluate the proposed approach on two different publicly available databases: REPLAY-ATTACK [4] and CASIA-Face Antispoofing Database (FASD) [22].

4.1. Datasets
4.1.1. REPLAY-ATTACK [4]

This dataset is composed of short videos of both real-access and spoofing attempts of 50 different subjects, acquired with a

resolution camera. The datasets were collected under controlled (with a uniform background and artificial lighting) and adverse (with natural illumination and nonuniform background) conditions. The face spoof attacks were created by forging genuine verification attempts of the respective subjects via printed photos, displayed photos/videos on mobile phone’s screen, and displayed photos/videos on HD screen.

4.1.2. CASIA-FASD [22]

This database contains video recordings of real and fake faces for 50 different identities. Both real-access and spoof attacks were captured using three camera resolutions: low resolution, normal resolution, and high resolution. Three kinds of attack attempts were considered: warped photo attacks, cut photo attacks, and video attacks. The dataset is divided into two subsets for training and testing: 20 and 30 identities, respectively. Table 2 provides a summary of the above two databases.

4.2. Evaluation Protocols

For REPLAY-ATTACK dataset, we followed the same standard protocols specified in [4] for the experiments. The dataset contains three totally independent datasets in terms of users. The train and development sets are used for training and parameter tuning, respectively. The final results are computed on test. The performance of the proposed liveness detection system was evaluated, as in [3, 4, 17], using Half Total Error Rate (HTER) computed as HTER = (FAR + FRR)/2, where FRR and FAR stand for False Rejection Rate and False Acceptance Rate, respectively.

For CASIA-FASD database, we followed the benchmark protocols specified in [22]. The test protocol consists of seven scenarios. The first three scenarios are to study the effect of imaging quality: low quality, normal quality, and high quality. The next three scenarios are warped photo attacks,

is the overall scenario (here all data are combined together to give a general and overall evaluation). The classifier training and parameters tuning were performed on training set, while the results are reported in terms of Equal Error Rate (EER) on the test set.

In a given video frame, first the face is detected. The detected face image is then normalized to 128 128 pixels and densely divided into a grid of nonoverlapping local patches of size . Out of total number of patches, only 40% peculiar patches are selected as discriminative patches. The LBP (Local Binary Patterns) technique is utilized to extract the features both for final classification and for discriminative patch selection methods (to obtain dis(similarity) score and clustering). Figure 5 shows examples from REPLAY-ATTACK and CASIA-FASD database of a given face image and corresponding selected discriminative patches using proposed seven image patches selection methods.

4.3. Experimental Results

The experimental results on REPLAY-ATTACK and CASIA-FASD databases are reported in Tables 3 and 4, respectively. We compared the performance of proposed method with most eminent techniques published in the literature such as methodologies in [4] (based on local binary pattern features with SVM classifier), [22] (grounded on multiple difference-of-Gaussian (DoG) filters to extract the high frequency information), [23] (using Multiscale Local Binary Patterns with SVM), [3] (which makes use of general full-reference and nonreference image quality measures), [16] (exploiting correlation between head motion and background that is estimated using optical flow), [28] (encoding information with a Histogram of Oriented Optical Flow (HOOF)), [23] (utilizing both texture and motion estimation along with preprocessing for motion magnification), and [1] (based on image distortion analysis features which is combination of specular reflection, blurriness, chromatic moment, and color diversity properties).

The results in both Tables 3 and 4 show that the proposed method in general achieves better accuracy than existing techniques under specific combination of discriminative patch selection method and classification scheme. For instance, in the experiment using the REPLAY-ATTACK database, it is easy to see in Table 3 that when the proposed framework is implemented using DEND-CLUSTERING-Ensemble or MAXDIST-Ensemble combinations, the Half Total Error Rate (HTER) is 5.00%, which is much lower than method in [25] (i.e., PCA + LBP + SVM (20.50%)). Similarly, in Table 4, we can see that the proposed system achieves error rate better than or similar to the state-of-the-art methods under overall scenario.

The MAXDIST patch selection method achieves better performances on average with the four classification techniques used in this study. Additionally, MAXDIST, CS, and DF patch selection algorithms demonstrate good generalization capability not only for disparate datasets but also for spoofing attacks with varying qualities and fabrication methods. Quite to the contrary, CP and IQA methods fail to attain proficient generalization aptitude. Beside patch selection algorithm, choice of feature classification scheme also plays vital role in accomplishing preferable performances. To this end, it is easy to see in Tables 3 and 4 that, among SVM, Naive-Bayes (NB), QDA, and Ensemble based on AdaBoost classifiers, Ensemble performs best under varying features, datasets, attack types, and amount of training and testing samples, owing to its ability of reducing the variances, averaging out the biases, and most unlikeliness of overfitting. The NB and QDA classifiers in this study are quite sensitive to patch selection approaches. Specifically, though NB enjoys simplicity and computational efficiency, it substantially performs poorly under diverse attack conditions. This may be occurring due to its assumption that all attributes are independent (i.e., no correlation between variables), since it was pointed out in [20, 40] that correlation mapping is beneficial to procure better accuracy and generalization capability in biometric liveness detection. Moreover, NB also assumes that the samples follow Gaussian distribution. However, Gaussian distribution assumption is generally true for small biometric datasets. But, spoofing databases are heterogeneous that contain different spoof attack types and sizes, and thereby NB either gets overfitted or fails to address the problem of concept-drift.

By metaknowledge analysis, it was found that spoof attacks with higher resolution/quality are comparatively harder to be recognized, as also pointed out in [3, 22], for instance, high-quality eye cut-off attacks in which the cropped portions are filled by real eyes of the attackers leading thus to the high quality spoofing attacks that are having a combination of real and spoofed face features. Furthermore, between REPLAY-ATTACK and CASIA-FASD databases, CASIA-FASD database is more challenging as it incorporates more practical states such as variant of spoof attacks (e.g., cut photo attack simulating eye-blinking) and samples with high quality (resolutions). All in all, results also suggest that, for systems or datasets based on low- or normal-quality samples, it is advisable to adopt CS method with Ensemble classifier to reach desirable performance accuracies, while MAXDIST with Ensemble is better choice for systems/datasets based on high-quality samples, especially videos.

On the whole, it can be stated that use of only certain image locations can significantly influence the face antispoofing accuracy. Namely, the proposed method uses only selected discriminative patches and attains higher-ranking precision, unlike the state-of-the-art methods which exploit whole face image/frame/video, leading hence generally to the clutter in the feature representations and to their low performances.

In many face recognition applications, there is no access to the video or image sequences of the user. However, a large number of existing face antispoofing solutions need video or sequences of images (i.e., either for motion or for temporal information) to attain high accuracy. Accordingly, they have less usability, since they are not devised to work on a single static face image. Conversely, the proposed method is single-image algorithm (i.e., the method requires just one input image and not a sequence of them). Therefore, the proposed method is more useful in various applications. Further, it is evident from the experimental results that the proposed framework is robust and performs well across diverse types of spoof attacks, materials, and techniques (such as printed image attack, video-replayed attack, cut photo attack, and image or video resolutions), although only specific face image area is considered. Consequently, the proposed method is much realistic and useful in real-world where a priori attack (artifact) types (paper, mobile, or resolution), which attacker might utilize, are unpredictable.

To sum up, the performances shown by proposed algorithm confirm that contemplating the inherent differences of discriminant abilities among various face image locations is useful for consistently recognizing well the facial spoof attacks. In other words, we show that it seems feasible to use only certain face image patches instead of whole face image to reduce significantly the error rates.

5. Conclusion

The vulnerability of face recognition systems to spoofing attacks is a largely accepted reality, which has led to great advances in face antispoofing (especially face liveness detection) technologies. Despite the remarkable advances, counteracting face spoof attacks has yet proven to be a challenging task. Moreover, existing face liveness detection methods use whole face image or complete video for liveness detection. However, often image regions (video frames) are redundant or correspond to the clutter in the image (video), thus leading generally to low performances. Therefore, in this paper, we propose using just discriminative image patches for face liveness detection. In particular, we present seven novel methods to obtain discriminative patches in a face image (or randomly selected lone video frame). The features of selected discriminative image patches are fed to a specific classifier (i.e., SVM, Naive-Bayes, QDA, or Ensemble). The classification results of these patches are combined by a majority-voting based scheme for the final classification of genuine and spoof faces. Experimental results on two publicly available databases show comparative performances compared to the existing works. The future works include devising more novel techniques for attaining discriminative image patches and inclusion of temporal information in the proposed method for higher security applications.

Competing Interests

The authors declare that they have no competing interests.


  1. D. Wen, H. Han, and A. K. Jain, “Face spoof detection with image distortion analysis,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 746–761, 2015. View at: Publisher Site | Google Scholar
  2. Z. Akhtar, C. Micheloni, and G. L. Foresti, “Biometric liveness detection: challenges and research opportunities,” IEEE Security & Privacy, vol. 13, no. 5, pp. 63–72, 2015. View at: Publisher Site | Google Scholar
  3. J. Galbally, S. Marcel, and J. Fierrez, “Image quality assessment for fake biometric detection: application to iris, fingerprint, and face recognition,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 710–724, 2014. View at: Publisher Site | Google Scholar
  4. I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local binary patterns in face anti-spoofing,” in Proceedings of the International Conference of the Biometrics Special Interest Group (BIOSIG '12), pp. 1–7, Darmstadt, Germany, September 2012. View at: Google Scholar
  5. J. Määttä, A. Hadid, and M. Pietikäinen, “Face spoofing detection from single images using texture and local shape analysis,” IET Biometrics, vol. 1, no. 1, pp. 3–10, 2012. View at: Publisher Site | Google Scholar
  6. Z. Akhtar, C. Micheloni, C. Piciarelli, and G. L. Foresti, “MoBio_LivDet: mobile biometric liveness detection,” in Proceedings of the 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS '14), pp. 187–192, Seoul, Republic of Korea, August 2014. View at: Publisher Site | Google Scholar
  7. G. Pan, L. Sun, Z. Wu, and S. Lao, “Eyeblink-based anti-spoofing in face recognition from a generic webcamera,” in Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, Rio de Janeiro, Brazil, October 2007. View at: Publisher Site | Google Scholar
  8. X. Tan, Y. Li, J. Liu, and L. Jiang, “Face liveness detection from a single image with sparse low rank bilinear discriminative model,” in Proceedings of the 11th European Conference on Computer Vision (ECCV '10), K. Daniilidis, P. Maragos, and N. Paragios, Eds., vol. 6316 of Lecture Notes in Computer Science, pp. 504–517, Crete, Greece, September 2010. View at: Publisher Site | Google Scholar
  9. Y. Li and X. Tan, “An anti-photo spoof method in face recognition based on the analysis of fourier spectra with sparse logistic regression,” in Proceedings of the Chinese Conference on Pattern Recognition (CCPR '09), Nanjing, China, November 2009. View at: Google Scholar
  10. K. Kollreider, H. Fronthaler, and J. Bigun, “Non-intrusive liveness detection by face images,” Image and Vision Computing, vol. 27, no. 3, pp. 233–244, 2009. View at: Publisher Site | Google Scholar
  11. W. Bao, H. Li, N. Li, and W. Jiang, “A liveness detection method for face recognition based on optical flow field,” in Proceedings of the International Conference on Image Analysis and Signal Processing (IASP '09), pp. 233–236, IEEE, Taizhou, China, April 2009. View at: Publisher Site | Google Scholar
  12. G. Pan, L. Sun, Z. Wu, and Y. Wang, “Monocular camera-based face liveness detection by combining eyeblink and scene context,” Telecommunication Systems, vol. 47, no. 3, pp. 215–225, 2011. View at: Publisher Site | Google Scholar
  13. J. Li, Y. Wang, T. Tan, and A. K. Jain, “Live face detection based on the analysis of Fourier spectra,” in Proceedings of the Biometric Technology for Human Identification, vol. 5404 of Proceedings of SPIE, pp. 296–303, Orlando, Fla, USA, April 2004. View at: Publisher Site | Google Scholar
  14. J. Bai, T. Ng, X. Gao, and Y. Shi, “Is physics-based liveness detection truly possible with a single image?” in Proceedings of IEEE International Symposium on Circuits and Systems, pp. 3425–3428, Paris, France, May-June 2010. View at: Google Scholar
  15. Z. Zhang, D. Yi, Z. Lei, and S. Z. Li, “Face liveness detection by learning multispectral reflectance distributions,” in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (FG '11), pp. 436–441, Santa Barbara, Calif, USA, March 2011. View at: Publisher Site | Google Scholar
  16. J. Komulainen, A. Hadid, M. Pietikäinen, A. Anjos, and S. Marcel, “Complementary countermeasures for detecting scenic face spoofing attacks,” in Proceedings of the International Conference on Biometrics (ICB '13), pp. 1–7, Madrid, Spain, June 2013. View at: Publisher Site | Google Scholar
  17. A. Anjos and S. Marcel, “Counter-measures to photo attacks in face recognition: a public database and a baseline,” in Proceedings of the International Joint Conference on Biometrics (IJCB '11), pp. 1–7, Washington, DC, USA, October 2011. View at: Publisher Site | Google Scholar
  18. T. Wang and S. Z. Li, “Face liveness detection using 3d structure recovered from a single camera,” in Proceedings of the International Conference on Biometrics (ICB '13), pp. 1–6, IEEE, Madrid, Spain, 2013. View at: Publisher Site | Google Scholar
  19. T. I. Dhamecha, A. Nigam, R. Singh, and M. Vatsa, “Disguise detection and face recognition in visible and thermal spectrums,” in Proceedings of the 6th IAPR International Conference on Biometrics (ICB '13), pp. 1–6, Madrid, Spain, June 2013. View at: Publisher Site | Google Scholar
  20. G. Chetty and M. Wagner, “Liveness detection using cross-modal correlations in face-voice person authentication,” in Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), pp. 2181–2184, Lisbon, Portugal, September 2005. View at: Google Scholar
  21. N. Kose and J.-L. Dugelay, “Reflectance analysis based countermeasure technique to detect face mask attacks,” in Proceedings of the 18th International Conference on Digital Signal Processing (DSP '13), pp. 1–6, IEEE, Fira, Greece, July 2013. View at: Publisher Site | Google Scholar
  22. Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispoofing database with diverse attacks,” in Proceedings of the 5th IAPR International Conference on Biometrics (ICB '12), pp. 26–31, IEEE, New Delhi, India, April 2012. View at: Publisher Site | Google Scholar
  23. S. Bharadwaj, T. I. Dhamecha, M. Vatsa, and R. Singh, “Face anti-spoofing via motion magnification and multifeature videolet aggregation,” Tech. Rep. IIITD-TR-2014-002, 2014. View at: Google Scholar
  24. I. Chingovska, J. Yang, Z. Lei, and D. Yi, “The 2nd competition on counter measures to 2D face spoofing attacks,” in Proceedings of the IEEE International Conference on Biometrics (ICB '13), pp. 1–6, Madrid, Spain, June 2013. View at: Publisher Site | Google Scholar
  25. S. Tirunagari, N. Poh, D. Windridge, A. Iorliam, N. Suki, and A. T. S. Ho, “Detection of face spoofing using visual dynamics,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, pp. 762–777, 2015. View at: Publisher Site | Google Scholar
  26. T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel, “Can face anti-spoofing countermeasures work in a real world scenario?” in Proceedings of the 6th IAPR International Conference on Biometrics (ICB '13), pp. 1–8, Madrid, Spain, June 2013. View at: Publisher Site | Google Scholar
  27. J. Galbally and S. Marcel, “Face anti-spoofing based on general image quality assessment,” in Proceedings of the 22nd International Conference on Pattern Recognition (ICPR '14), pp. 1173–1178, Stockholm, Sweden, August 2014. View at: Publisher Site | Google Scholar
  28. R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 1932–1939, Miami, Fla, USA, June 2009. View at: Publisher Site | Google Scholar
  29. Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face anti-spoofing based on color texture analysis,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '15), pp. 2636–2640, Quebec City, Canada, September 2015. View at: Publisher Site | Google Scholar
  30. M. Nilsson, J. Nordberg, and I. Claesson, “Face detection using local SMQT features and split up snow classifier,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), pp. II-589–II-592, Honolulu, Hawaii, USA, April 2007. View at: Publisher Site | Google Scholar
  31. U. Uludag, A. Ross, and A. K. Jain, “Biometric template selection and update: a case study in fingerprints,” Pattern Recognition, vol. 37, no. 7, pp. 1533–1542, 2004. View at: Publisher Site | Google Scholar
  32. S. Zhalehpour, Z. Akhtar, and C. Eroglu Erdem, “Multimodal emotion recognition based on peak frame selection from video,” Signal, Image and Video Processing, 2015. View at: Publisher Site | Google Scholar
  33. S. Bayram, I. Avcibaş, B. Sankur, and N. Memon, “Image manipulation detection,” Journal of Electronic Imaging, vol. 15, no. 4, Article ID 041102, 17 pages, 2006. View at: Publisher Site | Google Scholar
  34. I. Avcibas, N. Memon, and B. Sankur, “Steganalysis using image quality metrics,” IEEE Transactions on Image Processing, vol. 12, no. 2, pp. 221–229, 2003. View at: Publisher Site | Google Scholar | MathSciNet
  35. W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: a highly efficient perceptual image quality index,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 684–695, 2014. View at: Publisher Site | Google Scholar | MathSciNet
  36. L. Bourdev and J. Malik, “Poselets: body part detectors trained using 3D human pose annotations,” in Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09), pp. 1365–1372, Kyoto, Japan, September 2009. View at: Publisher Site | Google Scholar
  37. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010. View at: Publisher Site | Google Scholar
  38. U. Vovk, F. Pernus, and B. Likar, “A review of methods for correction of intensity inhomogeneity in MRI,” IEEE Transactions on Medical Imaging, vol. 26, no. 3, pp. 405–421, 2007. View at: Publisher Site | Google Scholar
  39. S. Cheung and C. Kamath, “Robust techniques for background subtraction in urban traffic video,” in Proceedings of the IEEE Conference on Visual Communications and Image Processing (VCIP '07), pp. 1–12, 2007. View at: Google Scholar
  40. Z. Akhtar, C. Micheloni, and G. L. Foresti, “Correlation based fingerprint liveness detection,” in Proceedings of the International Conference on Biometrics (ICB '15), pp. 305–310, Phuket City, Thailand, May 2015. View at: Publisher Site | Google Scholar


Copyright © 2016 Zahid Akhtar and Gian Luca Foresti. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Data Set, Data Annotation and Data Preprocessing

2.1 Data Set

The butterfly data sets used in this paper are all photos of butterflies in the natural ecological environment, hereinafter referred to as ecological photos. One part is from the data set provided in document [14] , the other part is from the images in search engines such as Baidu and image libraries, including 94 species and 11 genera of butterflies. Fig. 1 shows some samples of butterfly ecology.

Figure 1: Butterfly Eco-photograph

A total of 5695 pictures were taken from document [14] , including two kinds of photographs of butterflies: specimen photograph and ecological photograph. According to document [12] , because the shooting angle and background environment of specimens differ greatly from the ecological photographs, the training effect of using only ecological photographs in the training set is obviously better than that of using both specimens and ecological photographs together in butterfly detection and classification tasks, and the purpose of this study is to locate butterflies in natural environment and determine the species of butterflies, so only 1048 photos of butterflies in natural ecological environment are selected in this paper.

Most of the butterfly samples contained in each photo in the data set are only one, and the maximum number is not more than 20. Each butterfly species consists of at least four samples with a typical heavy-tailed distribution.

The test set is based on the standard test set provided in document [14] , which contains 678 ecological photos and the rest as training set.

2.2 Data Annotations

Because the posture of butterflies in ecological photographs is more complex, and even there are many butterflies overlapping together, and the data sets provided in document [14] are confused and there is no uniform standard for labeling. We formulated a set of uniform labeling standards and manually labeled the positions and species of all butterfly samples in all photos according to this standard.

In the data set provided in document [14] , there are two ways to label the area where butterflies are located: one is to use the antennae and legs of butterflies as the border, as shown in Fig. 2(a) the other is to use the trunk and wings of butterflies as the border, as shown in Fig. 2(b) . We use two annotation methods to unify data sets.

(a) Uses the antennae and legs of butterflies as the border (b) Uses the antennae and legs of butterflies as the border
Figure 2: Two different methods to annotate a single butterfly

Because some butterfly species have social attributes, many butterflies often overlap in photos. The data set provided in document [14] uses the method of labeling multiple butterflies in overlapping areas as a single sample, as shown in Fig. 3(a) . We have also developed a standard for labeling this situation: each butterfly in the overlapping area is independently labeled and the occluded part is ignored, as shown in Fig. 3(b) . By using this method, not only the number of training samples is increased, but also the recognition effect of the model for complex scenes is improved.

(a) Labels multiple butterflies in overlapping areas as a single sample (b) Each butterfly in the overlapping area is independently labeled and the occluded part is ignored
Figure 3: Two different methods to annotate two or more overlapping butterflies

2.3 Data Preprocessing

Target detection algorithms based on deep learning often require a large amount of data as training set. In this paper, we expand the training set by nine transformation methods, such as rotation, mirror image, blur, contrast rise and fall, and combine different pretreatment methods and their parameters (such as rotation angle, exposure, etc.) to get the optimal pretreatment method. The results will be shown in Part 4.

Through the above process, butterfly automatic detection and classification in natural ecological environment has been transformed into a multi-objective detection and classification problem. Different from common target detection problems, butterfly automatic detection and classification problems have three difficulties: 1) There are many classifications (94 classifications) 2) The distribution of samples is not uniform. Some rare species of butterflies have significantly fewer samples than other species of butterflies 3) It is necessary to classify different small classes (different kinds of butterflies) under the same big class (butterflies), that is, fine-grained classification is needed. Therefore, the research of butterfly automatic detection and classification in this paper is more difficult.

Global crop production faces a major challenge regarding sustainability in the context of a rapidly growing world population and the gradual diminishing of natural resources. Remote sensing plays a fundamental role in changing the plant production model through the development of new technologies (robots, UAVs, sensors), making products more profitable and competitive and, also, more sustainable. Among the new advances, unmanned aerial vehicles (UAVs) equipped with perception systems have demonstrated suitability in the timely assessment and monitoring of vegetation. They can be operated at low altitudes, providing an ultra-high spatial resolution image, have great flexibility of flight scheduling for data collection at critical and desired moments and, also, the generation of digital surface models (DSMs) using highly overlapped images and photo-reconstruction techniques or artificial vision. Therefore, it is essential to advance the research for the technical configuration of UAVs, as well as to improve processing and analyzing of the UAV imagery of agricultural and forest scenarios in order to strengthen the knowledge of ecosystems and thereby improve farmers&rsquo decision-making processes.

We encourage all members involved in the development and applications of UAVs to show their most recent findings focused on promising developments related to vegetation monitoring. This Special Issue welcomes original and innovative papers demonstrating the use of UAVs for remote sensing applications in the areas of agricultural, forestry, and natural resources managements. The selection of papers for publication will depend on quality and rigor of the research and paper. Specific topics include, but are not limited to:

  • UAV configuration and specifications for forest or agricultural applications
  • Object- or pixel-based image analysis approaches for vegetation monitoring
  • Artificial intelligence-based image-processing approaches
  • Integration of UAV images with ground-based dataset or other remote and proximal measurements
  • Biotic (weeds, disease) and abiotic (water, nutrition deficiencies) stress factors&mdashsensing and modeling
  • Crop yield estimation or prediction
  • High-throughput phenotyping
  • UAV-based prescription map development for site-specific management
  • Precision agriculture applications
  • UAV image pre-processing for radiometric, spectral and spatial calibration, and mosaicking
  • Development, integration, and testing of new and emerging sensors and technologies for UAV-based crop management.

Dr. Ana de Castro Megías
Dr. Yeyin Shi
Dr. José M. Peña
Prof. Dr. Joe Maja
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Remote Sensing is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

What is Image Processing?

The method of image processing is used to do some processes on a picture like an image enhancement or to remove some functional data from the image. Image processing is one kind of signal processing, where the input is a picture, as well as the output, are features or characteristics allied with the image.


At the present time, the image processing technique is highly used across different industries, which is used to form core investigate regions in engineering as well as in different disciplines too. Basically, the step by step image processing steps is discussed below.

  • Click the image using digital cameras
  • Studying and operating the image
  • The output of the image can be changed based on the analysis of the image.

Image processing can be done by using two methods namely analog image processing as well as digital-image-processing. The primary image processing (analog) technique is employed for photographs, printouts. Etc. Image analyst uses different basics of understanding while using some of the image techniques. The secondary image processing (Digital) technique will assist in digital image analysis by using a PC.

Types of an image

  1. BINARY IMAGE– The binary image as its name suggests, contain only two pixel elements i.e 0 & 1,where 0 refers to black and 1 refers to white. This image is also known as Monochrome.
  2. BLACK AND WHITE IMAGE– The image which consist of only black and white color is called BLACK AND WHITE IMAGE.
  3. 8 bit COLOR FORMAT– It is the most famous image format.It has 256 different shades of colors in it and commonly known as Grayscale Image. In this format, 0 stands for Black, and 255 stands for white, and 127 stands for gray.
  4. 16 bit COLOR FORMAT– It is a color image format. It has 65,536 different colors in it.It is also known as High Color Format. In this format the distribution of color is not as same as Grayscale image.

A 16 bit format is actually divided into three further formats which are Red, Green and Blue. That famous RGB format.

Objectively defining the image parameters for facial detection, recognition, and identification in digital imagery - Psychology

Object recognition is a computer vision technique for identifying objects in images or videos. Object recognition is a key output of deep learning and machine learning algorithms. When humans look at a photograph or watch a video, we can readily spot people, objects, scenes, and visual details. The goal is to teach a computer to do what comes naturally to humans: to gain a level of understanding of what an image contains.

Figure 1. Using object recognition to identify different categories of objects.

Object recognition is a key technology behind driverless cars, enabling them to recognize a stop sign or to distinguish a pedestrian from a lamppost. It is also useful in a variety of applications such as disease identification in bioimaging, industrial inspection, and robotic vision.

Object Recognition vs. Object Detection

Object detection and object recognition are similar techniques for identifying objects, but they vary in their execution. Object detection is the process of finding instances of objects in images. In the case of deep learning, object detection is a subset of object recognition, where the object is not only identified but also located in an image. This allows for multiple objects to be identified and located within the same image.

Figure 2. Object recognition (left) and object detection (right).

These are simply ideas I've had thinking about the problem, never tried it but I like thinking about problems like this!

Before you begin

Consider normalising the pictures, if one is a higher resolution than the other, consider the option that one of them is a compressed version of the other, therefore scaling the resolution down might provide more accurate results.

Consider scanning various prospective areas of the image that could represent zoomed portions of the image and various positions and rotations. It starts getting tricky if one of the images are a skewed version of another, these are the sort of limitations you should identify and compromise on.

Matlab is an excellent tool for testing and evaluating images.

Testing the algorithms

You should test (at the minimum) a large human analysed set of test data where matches are known beforehand. If for example in your test data you have 1,000 images where 5% of them match, you now have a reasonably reliable benchmark. An algorithm that finds 10% positives is not as good as one that finds 4% of positives in our test data. However, one algorithm may find all the matches, but also have a large 20% false positive rate, so there are several ways to rate your algorithms.

The test data should attempt to be designed to cover as many types of dynamics as possible that you would expect to find in the real world.

It is important to note that each algorithm to be useful must perform better than random guessing, otherwise it is useless to us!

You can then apply your software into the real world in a controlled way and start to analyse the results it produces. This is the sort of software project which can go on for infinitum, there are always tweaks and improvements you can make, it is important to bear that in mind when designing it as it is easy to fall into the trap of the never ending project.

Colour Buckets

With two pictures, scan each pixel and count the colours. For example you might have the 'buckets':

(Obviously you would have a higher resolution of counters). Every time you find a 'red' pixel, you increment the red counter. Each bucket can be representative of spectrum of colours, the higher resolution the more accurate but you should experiment with an acceptable difference rate.

Once you have your totals, compare it to the totals for a second image. You might find that each image has a fairly unique footprint, enough to identify matches.

Edge detection

With two similar pictures edge detection should provide you with a usable and fairly reliable unique footprint.

Take both pictures, and apply edge detection. Maybe measure the average thickness of the edges and then calculate the probability the image could be scaled, and rescale if necessary. Below is an example of an applied Gabor Filter (a type of edge detection) in various rotations.

Compare the pictures pixel for pixel, count the matches and the non matches. If they are within a certain threshold of error, you have a match. Otherwise, you could try reducing the resolution up to a certain point and see if the probability of a match improves.

Regions of Interest

Some images may have distinctive segments/regions of interest. These regions probably contrast highly with the rest of the image, and are a good item to search for in your other images to find matches. Take this image for example:

The construction worker in blue is a region of interest and can be used as a search object. There are probably several ways you could extract properties/data from this region of interest and use them to search your data set.

If you have more than 2 regions of interest, you can measure the distances between them. Take this simplified example:

We have 3 clear regions of interest. The distance between region 1 and 2 may be 200 pixels, between 1 and 3 400 pixels, and 2 and 3 200 pixels.

Search other images for similar regions of interest, normalise the distance values and see if you have potential matches. This technique could work well for rotated and scaled images. The more regions of interest you have, the probability of a match increases as each distance measurement matches.

It is important to think about the context of your data set. If for example your data set is modern art, then regions of interest would work quite well, as regions of interest were probably designed to be a fundamental part of the final image. If however you are dealing with images of construction sites, regions of interest may be interpreted by the illegal copier as ugly and may be cropped/edited out liberally. Keep in mind common features of your dataset, and attempt to exploit that knowledge.

Morphing two images is the process of turning one image into the other through a set of steps:

Note, this is different to fading one image into another!

There are many software packages that can morph images. It's traditionaly used as a transitional effect, two images don't morph into something halfway usually, one extreme morphs into the other extreme as the final result.

Why could this be useful? Dependant on the morphing algorithm you use, there may be a relationship between similarity of images, and some parameters of the morphing algorithm.

In a grossly over simplified example, one algorithm might execute faster when there are less changes to be made. We then know there is a higher probability that these two images share properties with each other.

This technique could work well for rotated, distorted, skewed, zoomed, all types of copied images. Again this is just an idea I have had, it's not based on any researched academia as far as I am aware (I haven't look hard though), so it may be a lot of work for you with limited/no results.

Ow's answer in this question is excellent, I remember reading about these sort of techniques studying AI. It is quite effective at comparing corpus lexicons.

One interesting optimisation when comparing corpuses is that you can remove words considered to be too common, for example 'The', 'A', 'And' etc. These words dilute our result, we want to work out how different the two corpus are so these can be removed before processing. Perhaps there are similar common signals in images that could be stripped before compression? It might be worth looking into.

Compression ratio is a very quick and reasonably effective way of determining how similar two sets of data are. Reading up about how compression works will give you a good idea why this could be so effective. For a fast to release algorithm this would probably be a good starting point.


Again I am unsure how transparency data is stored for certain image types, gif png etc, but this will be extractable and would serve as an effective simplified cut out to compare with your data sets transparency.

Inverting Signals

An image is just a signal. If you play a noise from a speaker, and you play the opposite noise in another speaker in perfect sync at the exact same volume, they cancel each other out.

Invert on of the images, and add it onto your other image. Scale it/loop positions repetitively until you find a resulting image where enough of the pixels are white (or black? I'll refer to it as a neutral canvas) to provide you with a positive match, or partial match.

However, consider two images that are equal, except one of them has a brighten effect applied to it:

Inverting one of them, then adding it to the other will not result in a neutral canvas which is what we are aiming for. However, when comparing the pixels from both original images, we can definatly see a clear relationship between the two.

I haven't studied colour for some years now, and am unsure if the colour spectrum is on a linear scale, but if you determined the average factor of colour difference between both pictures, you can use this value to normalise the data before processing with this technique.

Tree Data structures

At first these don't seem to fit for the problem, but I think they could work.

You could think about extracting certain properties of an image (for example colour bins) and generate a huffman tree or similar data structure. You might be able to compare two trees for similarity. This wouldn't work well for photographic data for example with a large spectrum of colour, but cartoons or other reduced colour set images this might work.

This probably wouldn't work, but it's an idea. The trie datastructure is great at storing lexicons, for example a dictionarty. It's a prefix tree. Perhaps it's possible to build an image equivalent of a lexicon, (again I can only think of colours) to construct a trie. If you reduced say a 300x300 image into 5x5 squares, then decompose each 5x5 square into a sequence of colours you could construct a trie from the resulting data. If a 2x2 square contains:

We have a fairly unique trie code that extends 24 levels, increasing/decreasing the levels (IE reducing/increasing the size of our sub square) may yield more accurate results.

Comparing trie trees should be reasonably easy, and could possible provide effective results.

I stumbled accross an interesting paper breif about classification of satellite imagery, it outlines:

Texture measures considered are: cooccurrence matrices, gray-level differences, texture-tone analysis, features derived from the Fourier spectrum, and Gabor filters. Some Fourier features and some Gabor filters were found to be good choices, in particular when a single frequency band was used for classification.

It may be worth investigating those measurements in more detail, although some of them may not be relevant to your data set.

Other things to consider

There are probably a lot of papers on this sort of thing, so reading some of them should help although they can be very technical. It is an extremely difficult area in computing, with many fruitless hours of work spent by many people attempting to do similar things. Keeping it simple and building upon those ideas would be the best way to go. It should be a reasonably difficult challenge to create an algorithm with a better than random match rate, and to start improving on that really does start to get quite hard to achieve.

Each method would probably need to be tested and tweaked thoroughly, if you have any information about the type of picture you will be checking as well, this would be useful. For example advertisements, many of them would have text in them, so doing text recognition would be an easy and probably very reliable way of finding matches especially when combined with other solutions. As mentioned earlier, attempt to exploit common properties of your data set.

Combining alternative measurements and techniques each that can have a weighted vote (dependant on their effectiveness) would be one way you could create a system that generates more accurate results.

If employing multiple algorithms, as mentioned at the begining of this answer, one may find all the positives but have a false positive rate of 20%, it would be of interest to study the properties/strengths/weaknesses of other algorithms as another algorithm may be effective in eliminating false positives returned from another.

Be careful to not fall into attempting to complete the never ending project, good luck!

Watch the video: Βυζαντινή εικονογραφία - Φορητή εικόνα, αγιογραφία, ψηφιδωτό (June 2022).


  1. Aeldra

    Senks, very useful information.

  2. Fowler

    Try not torture.

  3. Gardner

    Very interesting blog. All the most interesting is collected here. It's good that I found this blog, I read different notes here from time to time.

  4. Memi

    I consider, that you are not right. I suggest it to discuss.

  5. Josiah

    I think, that you are not right. I can defend the position. Write to me in PM, we will talk.

  6. Yehudi

    is absolutely in agreement with the previous communication

  7. Caolabhuinn

    It is remarkable, very useful information

  8. Mosida

    Well, what next?

  9. Talkree

    I apologise, but, in my opinion, you commit an error. I can defend the position.

Write a message