Detection of duplicate and non-face images in the eRecruitment applications using machine learning techniques

The objective of this work is to develop methodologies to detect, and report the noncompliant images with respect to indian space research organisation (ISRO) recruitment requirements. The recruitment software hosted at U. R. rao satellite centre (URSC) is responsible for handling recruitment activities of ISRO. Large number of online applications are received for each post advertised. In many cases, it is observed that the candidates are uploading either wrong or non-compliant images of the required documents. By non-compliant images, we mean images which do not have faces or there is not enough clarity in the faces present in the images uploaded. In this work, we attempt to address two specific problems namely: 1) To recognise image uploaded to recruitment portal contains a human face or not. This is addressed using a face detection algorithm. 2) To check whether images uploaded by two or more applications are same or not. This is achieved by using machine learning (ML) algorithms to generate similarity score between two images, and then identify the duplicate images. Screening of valid applications becomes very challenging as the verification of such images using a manual process is very time consuming and requires large human efforts. Hence, we propose novel ML techniques to determine duplicate and non-face images in the applications received by the recruitment portal.

INTRODUCTION Computers and information group (CIG) of U. R. rao satellite centre (URSC) is involved in development, customization, and management of the software used for recruitment activities of indian space research organisation (ISRO) [1], [2]. Recruitment is the process of sourcing, screening, and selecting the candidates for a vacancy within an organization. Each year several advertisements are released, and few lakhs of applications are received per year. Screening and processing of such a huge volume of applications manually will not only require large human efforts but also might lead to inconsistent results. Automation is the only solution to reduce the burden from such repetitive tasks. Based on the expertise gained over the years, certain things which can be Journal homepage: http://ijra.iaescore.com generalized as set of rules are already automated. In addition to these rule based automations, in this work, we would to explore certain image processing techniques using machine learning (ML) algorithms for increased automation of recruitment activities.
In this work, we attempt to address two specific problems namely : 1) To recognise image uploaded to recruitment portal contains a human face or not. We propose to solve this problem using Haar cascade classifiers based face detection algorithm. 2) To check whether images uploaded by two or more applications are same or not. We propose to solve this problem using image similarity detection algorithm based on certain ML techniques. The face detection algorithms work based on the facial features such as spacing of the eyes, bridge of the nose, the contour of the lips, ears, and chin. Face detection has numerous applications in security (authentication and authorization), defense, marketing, healthcare, hospitality, face detection, lip reading, and auto-focus.
The rest of the paper is organized is being as: Section 2 provides brief literature survey. Section 3 describes the development and evaluation of face detection system for screening of e-recruitment applications. Section 4 discusses the development of similarity detection system. Section 5 summary and future work change to conclusion.

RELATED WORK
The research in face detection and recognition is very actively pursued over last several decades. There have been significant number of works reported in this area. Only very few notable works among them are described here. Some of the literature surveys on the face detection and recognition is being as. In 2003, Lewis et al. [3] have presented a detailed review on the psychological evidence about the process of face detection in brain. It is shown that with the use of face recognition systems, it is possible to identify or check the identity of individuals in a matter of few seconds.
In 2009, Jafri et al. [4] have presented an overview of various face recognition techniques. The benefits and limitations of different face recognition algorithms are examined. The applications and difficulties involved in each of these techniques are described.
In 2010, Degtyarev et al. [5] have proposed set of parameters for face detection algorithms to evaluate their qualities and perform objective comparisons, and to determine the current state of the art face detection algorithm. They have compared seven face detection algorithms and the results of their comparison are reported. In 2010, Zhang et al. [6] have surveyed the recent advances in face detection for previous decade with an hope see better algorithms developed in future to solve the problem of face detection. They have surveyed various techniques according to the way features are extracted and type of learning algorithms employed.
In 2013, Roomi et al. [7] have presented a survey of various face recognition works reported in the past decade, mainly focusing on the ones which were not reported in other similar surveys. Further, they have categorized them into meaningful approaches such as appearance based, feature based, and soft computing based. A comparative study of merits and demerits of these approaches is also presented.
In 2015, Farfade et al. [8] have proposed a deep dense face detector method for multi-view face detection. The proposed method does not require pose/landmark annotation and is able to detect faces in a wide range of orientations using a single model based on deep convolutional neural networks with minimal complexity. In 2018, Hua et al. [9] have presented joint optimal solution for addressing face representation and matching problems in face verification task using a unified framework. A second-order face representation method for face pair and a unified face verification framework, in which the feature extractors and the subsequent binary classification model design are made to select flexibly, is presented.
In 2020, Kortli et al. [10] have presented a survey of some of the well-known theories and algorithms used in face recognition. A detailed comparison in terms of robustness, accuracy, complexity, and discrimination, of all these different techniques is reported. An overview of the most commonly used databases for both supervised and unsupervised learning is given. Frischholz has consolidated all useful information on face detection and recognition problems in [11]. It provides appropriate links to various softwares, datasets, algorithms, selected publications, and other resources related to face detection and recognition problems.
There are few studies exploring the use of artificial intelligence (AI) techniques for recruitment applications such as screening the candidates, establishment of relationships, taking unbiased decisions and schedules, and applicant's social media communications. Some of the works exploring AI techniques for recruitment activities is being as. In 2018, Upadhyay et al. [12] have reviewed the applications of AI tools in the hiring Ì ISSN: 2089-4856 process and its practical implications. They have highlighted the strategic shift in recruitment industry caused due to the adoption of AI in the recruitment process. It is found that the application of AI for managing the recruitment process is leading to efficiency as well as qualitative gains for both clients and candidates.
In 2019, Albert [13] has investigated the use of AI tools such as chatbots, screening software, and task automation, in the recruitment and selection of candidates by the companies. On a similar lines, Weinert et al. [14] have also examined the use of AI techniques for selection and assessment of human resources by the companies, and various challenges involved it. In 2019, Nawaz [15] has explored the application of face detection for recruitment process. He has demonstrated the use of principal component analysis techniques to detect duplicate faces and thereby enabling the detection of duplicate applications.
In 2019, Nawaz [16] has examined the use of AI techniques on the recruitment effectiveness of the software companies. The study uses a data-set containing a structured questionnaire from 100 human resource professionals. In 2019, Esch et al. [17] have worked on how the potential candidates regard the use of AI in the recruitment process and is there any influence on the likelihood of applying for a job by potential candidates due to use of AI in recruitment. They show that the novelty factor of using AI in the recruitment process, mediates and further positively influences job application likelihood. Figure 1 shows the block diagram of complete face detection system implemented by us. A photo uploaded by an applicant will be fetched and fed as input to face detection algorithm. If a face is detected by the face detection algorithm, then the application will be accepted. If a face is not detected by the face detection algorithm then that photo will be added to the list of images that have to be manually inspected. The list of such images is made available on the screening portal with a provision for screening personnel either to accept or reject such applications. The screening personnel will manually inspect and accept the application if the photo is proper or else reject the application.

Face detection algorithm
Face detection is an image processing technique for identifying human faces in images and videos. It is the psychological process with which humans locate and attend to faces in a visual scene [3]. Face detection is a specific case of object detection, where face becomes the object to be detected. The task of object detection is to find the locations and sizes of all objects in an image that belong to a given class. In this work, we have worked on face detection using a haar cascade classifiers. The face detection using Haar feature-based cascade classifiers is a machine learning based approach where a cascade function is trained using large number of positive and negative images [18]. The trained cascade function is used to detect similar objects in other images. Haar features are like convoluctional kernel, where each feature is a single value obtained by subtracting sum of pixels under white rectangle from sum of pixels under black rectangle [19]. The haar features are computed  Figure 2 shows the various types of Haar features for face. The edge features seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks, while the line features focus on the property that the eyes are darker than the bridge of the nose. These features are detected only when the window is applied on the face region, and the windows applying on cheeks or any other part of the image become irrelevant. Each and every feature is applied on all the training images. For each feature, it finds the best threshold which will classify the faces into positive and negative classes. The features with minimum error rate are selected. These features indicate that they are the features that best classifies the face and non-face images. We have used the pre-trained Haar cascade classifier model provided by Opencv [19] library.

Evaluation of face detection algorithm
We have ran the face detection algorithm for some of our selected recruitment advertisements. Table  1 shows the evaluation statistics of face detection system. From the column 4, it can be seen that some of the valid photos are also detected as invalid photos. Hence, we can not blindly use the output of face detection algorithm as it is. The list of suspected invalid photos have to be inspected manually and actual invalid photos have to be determined. This makes the face detection system semi-automatic. Although this system can not replace the human intervention completely, but it drastically reduces the human effort involved in screening of recruitment applications. Sixth column in Table 1 shows the % reduction in the manual effort for screening applications. The average reduction is 98.55%, which indicates only 1.45% of the manual effort required for performing the screening using face detection system. This is a very drastic reduction in the manual effort. For example, in case of serial no. 1 (second row), the use of face detection system has reduced the number of applications to be screened from 4145 to 79. Likewise the reduction is from 30008 to 304 for serial no. 6 (seventh row). The last column provides the face detection accuracy. The average face detection accuracy is found to be 76.41%, which is reasonably a good value. This approach would not only reduce the costs involved in recruitment activities but also promises more consistent results, and requires very less time compared to humans. This approach will not give any chance to miss out any of the applications with valid photos as any rejection will always have to be done by humans. Figure 3 shows few suspected invalid photos detected by face detection algorithm. It is very surprising to see various different kinds of photos uploaded by the candidates along-with their applications. Invalid photos vary from animations, signatures, marks cards, snapshot of mobiles, whatsapp images, some random image taken from internet, and some random photo clicked using mobiles.
Due to data confidentiality issues, we have shown only the generic images in Figure 3. However, there are several variety of images such as certificates, grade cards, photo images, (which are of restricted nature and can not be published) that were also classified as invalid images by the algorithm. Few such examples include 1) faces in the image are completely covered by hairs such that only one side of the face is visible, 2) photos that are captured using the head covered with a cap or a turban such that part of the forehead is not visible, 3) photos are taken such that part of the forehead, cheeks and chin are not visible, and 4) photos with goggles Ì ISSN: 2089-4856 covering their eyes. Hence, in some of the cases the face detection algorithm has failed to detect a human face due to following reasons. 1) If the photo is taken by wearing a spectacle. In this case, the algorithm fails to detect the facial features such as spacing of the eyes, and the contrasting line features present at the eyebrows and eyeball covers are lost, 2) If an head cap or turban is used such that certain part of forehead and eyebrows are covered, and complete face is not visible. In this case also algorithm fails to extract all the facial features, 3) If the face is rotated such that only one side of the face is visible, and other side of the face is either partially or completely invisible, then algorithm will not able capture all the required features, 4) If the resolution of the image is too low, so that considered window size exceeds the photo size.

SIMILARITY DETECTION SYSTEM FOR PHOTOS
Two important techniques for comparison of images are 1) Comparison of histograms and 2) Template matching. An histogram is a graphical representation of the value distribution of a digital image. The histogram intersection algorithm was proposed by Swain and Ballard in [21]. The histogram intersection does not require the accurate separation of the object from its background and it is robust to occluding objects in the foreground. Histograms are translation invariant, but they change slowly under different view angles, scales and in presence of occlusions [22]. Histogram comparison is one of the simplest, fastest method to find the similarities in the images. Here the assumption is that a particular type of picture will have a particular color in abundance. For example, a picture of a forest will have a lot of green color, a picture of a banana will have lot of yellow color. So, if two pictures with forests are being compared then we will get some similarity between the two histograms, as both of them have lot of green color. Further details on comparison of histograms can be found in [21], [22].
Template matching is a technique in digital image processing for finding small parts of an image which match a template image. A basic method of template matching uses an image template, tailored to a specific feature of the search image which we want to detect. The cross correlation output will be highest at places where the image structure matches the mask structure, where large image values get multiplied by large mask values. As all possible positions of the template with respect to the search image are considered, the position with the highest score is the best position [23], [24]. It is known work well with identical images with same size and orientation, to which our case mostly fits in. Further details on template matching can be found in [23], [24].
In this study, we have computed the similarity score using the combination of both the approaches -comparison of histograms and template matching. Python's OpenCV library is used for implementation. Since, both of these methods alone did not produce better results, we have combined them using a weighted combination method. We have assigned a lower weightage of 0.1 to histogram comparison method as it was found to be less accurate than template matchingmethod. And, template matching method was assigned a higher weightage of 0.9. Two images are compared and a similarity score is returned based on the comparison. The similarity score indicates "how similar the two images being compared are". For example, a similarity score of 100% would indicate that the same image is being compared, and a similarity score of 0% would indicate that two images are totally different.
Each image in an advertisement will be compared with all other images. This would result in a time complexity of O(n 2 ). After comparison of images, the algorithm would return a similarity score ranging from 0% to 100%. In this study, we have considered only the cases with similarity score of 100%. The comparison of images that have returned a similarity score of 100% would be treated as similar images. This algorithm is computationally very intensive and requires huge computing resources. For one instance of comparison of pair of images on a Desktop PC (8 GB RAM, Intel i7-6700 CPU @ 3.40GHz with 8 cores, No Graphics card) took around one minute.
Although the proposed technique is working reasonably well and has produced some of the promising results, due to data confidentiality issues, we are restricted to not to publish any of the images that are detected by the similarity detection system. We have found that, there are number of instances where the same candidate has applied multiple times to the same post advertised using the same photo. In one such case, we found that a candidate has applied 5 times to the same post using the same photo.

SUMMARY AND FUTURE WORK
In this work, we have explored two ML techniques-face detection and similarity detection-for automating the screening of recruitment applications. It is found that the use of face detection system has drastically reduced (by 98.5%) the manual effort required for screening the recruitment applications. The detailed analysis on when and why the face detection fails is carried out. The similarity detection system was developed to compare two images and determine their similarity score. Although, the similarity detection system is working reasonably well but it is very resource hungry and requires large computing infrastructure.
In future, various state-of-the-art deep learning algorithms such as convolutional neural networks (CNN) for face detection [25], [26] can be explored to detect and eliminate non-face images. Instead of using the libraries provided by OpenCV, the face detection models can be trained using custom datasets of face and non-face images, and then these models can be used for performing face detection. One can also explore the Ì ISSN: 2089-4856 possiblity of development of hybrid techniques (which combine outputs of multiple face detection algorithms) for face detection. The feature mapping techniques can be explored for building similarity detection systems for similarity detection of face images. Sparse coding based image similarity detection [27] techniques can be explored for building similarity detection systems.