As companies grow, so does the traffic in their hallways. People get in and out of work, food gets delivered, guests arrive. Everyone solves those problems in their own way, either by tasking someone to take care of guests, delivers & co, by installing complicated security and authorization systems and so on. But again, we’re all just human, so errors can and do occur, as it isn’t for most businesses feasible to have one person whose only job it is to watch the door.
Watching an entrance is a boring and tedious job with a narrow set of needed skills. In other words, it’s a perfect job for a machine! Or, it would be, but there’s a teeny-tiny problem - machine seem to lack a visual cortex in their brain. And brains in general. Nature is weird, I know.
But, thanks to recent advances in the fields of Computer Vision and Machine Learning and due to accessible number crunching hardware, machines these days can put up quite a fight* in visual task solving.
* Disclaimer: we do not make machines fight for fun in our office. We do, however, for science.
As demand for such a solution rises from day to day, we made a system to do just that and tenderly named it Vantage. Let me introduce you two.
Goals and usage examples
The problem is that we have entrances where some people could appear, but shouldn’t. For example, the system can be installed on all back entrances of our building to watch for unusual activity. In this case an unusual activity could be a system detection of a person that couldn’t be attributed to any authorized personnel. It can also be installed at the front entrance, as a welcome system for guests and an alert system for the receptionist.
People can’t be expected to sit at the front entrance all day every day without a break or emergency, so an alert system could be used to notify them if there’s someone at the door that needs their attention.
And I know what you’re thinking about this system being used on a back entrance! “Can’t we just install locks with authorization cards and be done with it?”. The answer is, sadly, no. The door might not lock properly, someone might join a large group of employees to enter the restricted area, an employee might bring in someone into office without OK-ing it with the appropriate departments, and so much more. It’s human behaviour. It’s messy.
Those are just some examples where a system like ours could be used. The full extent of where the system is usable is simple - wherever you need a reliable pair of eyes watching for people that shouldn’t be where they are.
Overview of the system
Let’s get a bit technical now. The system is comprised of three major modules:
the cameras & networking hardware,
the image processing module,
a web interface & data hub,
and one minor module that handles the busywork.
Spoiler alert: This is a diagram of the system
The cameras & networking hardware
To give our machine the ability to see the area we want it to look after, we need cameras. The system supports IP Cameras and USB Cameras at the moment, but any additional image source can be easily added.
Most companies opt for IP cameras, as they have already established networking that the cameras can use. As for the camera type & quality, it depends on the environment we’re trying to protect. We need a good view of incoming faces to avoid unnecessary false alarms, so pixel count, lens details and so on have to be determined on site. Most environments work well with standard surveillance cameras, but why be content with “OK”, when a little effort can give us “great”?
Another benefit of IP Cameras is that everything we know about network security can be applied here too. That’s useful because we don’t want to give just about anyone a direct video feed to the at-risk areas. In fact, when this system is set up properly, no person needs to have a direct access to the video feed of those cameras.
The image processing module
The image processing module has two tasks - it needs to handle and maintain the incoming video streams, and to make some sense of them. It requires access to the video feeds, and dedicated hardware with a CUDA-compatible GPU. That often ends up as a dedicated machine connected to the same network as the IP cameras.
One thing that was really important to us is that this component is horizontally scalable. Our goal was to satisfy the needs of big and small companies alike. If you have just one or two entrances, one dedicated machine is more than enough. If you have dozens and dozens of areas you want to surveil, just add more machines and you're good to go!
When a processing machine boots up it asks the web hub for all necessary information required to function. To borrow the vocabulary of Kubernetes, these machines are cattle, not pets.
The theory behind how faces are identified is meaty, so we’ll talk about it a bit later in its dedicated subsection.
The web interface & hub
The job of this module is twofold - it serves as a contact point for the people that use the system, and as a coordinator for the image processing units.
The human interaction part is straightforward. It’s a web interface that gives has a Live view of current events that need attention, it keeps track of events that happened and it keeps track of various statistics that might interest people. It also provides to some employees, those with the proper authorization, the ability to add and remove people the system can recognize, and to some even the ability to coordinate the image processing units.
The other part is it’s role as a hub. All processing units report to this one hub. All processing units get their assignments from this hub. All processing units get their knowledge about employees, in an anonymised form, from this hub.
This module holds all the necessary information for the system to function. It’s hosted in the proven secure environment of Azure.
There’s one additional module that does the busywork for the system. As we offer the ability to generate and adjust the data by which the people are recognized from the security footage, we offload the actual processing to a scheduled task on Azure. All data that needs to be processed gets stored into Azure Blob Storage, and all results are saved in an Azure Queue for the web hub to integrate.
Face recognition - the theory behind it
What’s happening to images
Our task is to “just” get an image and see who’s on it. We won’t dive deep into the math for this, but we’ll get a more technical understanding of what’s happening behind the pretty data tables and animated modals.
To tackle this problem we use state of the art techniques from the fields of Computer Vision and Machine Learning.
Let’s for a moment think about how we, as humans, recognize people. We don’t scan the whole face of a person, looking at every detail to determine if we’re talking to Derek or Brunhilda. We look for certain facial features by which we can come to an conclusion about our conversation partner's identity.
We want to simulate something like that in the software. Looking at every millimeter, or in the machines case, at every single pixel directly to determine if we know a person is unfeasible - it’s prohibitively slow and costly.
Derek, is that you?
Instead, we use convolutional neural networks (more info). This type of neural network is efficient for image processing.
The interesting part is how the network was trained and designed. Without boring you with too many details, the gist of it is that the network arhitecture and training is designed in such a way that we get a condensed representation of a face, a numerical representation of facial features the network learned are important to differentiate people. We transform picture from a 2D picture into a 128D vector representation of facial features that an algorithm then can use to compare the similarity of faces.
128D facial representations dimensionally reduced to 3D with Principal Component Analysis (PCA) for visualisation purposes
This image is a 3D visualization of all personnel in a company where the system is used. The dimensionality reduction is done with Principal Component Analysis (PCA, more info). As we can see, there's a minimum distance between each and every point. That information contained within the representation allows us to differentiate people. Funfact: the defining trait of those two big clusters we see is the perceived gender.
Each point in the above image is one person. We calculate those points from various image sources companies provide us. We made a sample dataset with people turning their heads in different lighting conditions to determine the minimum amount of image data needed for a faithful representation. Here's a visualisation of all data we got from one person.
That's a lot of data points!
From all those data points, we caluclate one or more representations that can reliably match a face. That 128D vector is the distilled essence of facial features required to distignuish people.
Representation stability and failure points
A point of concern is how stable the representation is and where the system flat out fails.
The network had to be tuned until the representation was sufficiently stable. In other words, no matter how the face is positioned in relation to our viewpoint it should give us, ideally, the same representation. That’s sadly not possible with the current state of the art, so we have to tweak our network to the point where representations of the same face in different conditions are nearly the same. As seen on the above image, images of the same person tend to cluster together in the face-representation space, which is exactly what we want.
One of the steps made to insure a stable representation is a fast and efficient preprocessing pipeline that tries to bring all images to a more-or-less same state of “processability”. Here’s an example image of before and after processing:
Taking away the VEIL OF MISTERY!
The big problem we have is when a face is mostly obstructed in an image, be it from some occlusion in the environment (which should be promptly removed) or because of the relative position of the face and the camera. In those cases the network can’t make out enough details to form a good representation, and we end up with something that we can’t properly interpret. More on that in the results section.
However, we can’t just accept that and move on! Doing that would cause a potentially large volume of false positives, which in turn creates more unnecessary work that we want to avoid if possible.
To combat this problem, a separate, small and efficient image filter has been made. This filter executes much faster than a neural network, it doesn’t require a GPU and it cuts down a large percentage of useless processing.
Our top priority with this system was to respect the privacy of employees. Extensive steps have been taken during the architecture design of this system to ensure that no accidental data leakage can occur. All information stored is on a need-to-know basis. Let’s highlight some decisions:
The data from the cameras goes through the company's private network where access from the outside can be controlled and disabled.
The image processing unit doesn’t hold any images, nor any information about the people it can recognize. Not even their names. The data by which we recognize people cannot be reconstructed back into an image of a face (think of it like a hash of a face), and only an internal ID is attached to them.
The images of people are not saved. All data is saved and passed around as those face hashes that can’t be reconstructed into facial images. However, an image can be added to a person we can recognize in the web interface, but that’s optional and just cosmetic.
Access to the web interface is given through Microsoft’s Active Directory.
Everything that can be done in the web interface is logged - every little change is written down in case they’re necessary.
We use Microsoft technologies and platforms like Azure, because of their proven track record of security.
If there are any privacy concerns about our system, do not hesitate to contact us.
Finally, let’s see some formal results. Here’s a look at the dashboard:
An anonomysed view of the Dashboard
Our system has an accuracy of 99.1% for image identification on Labeled faces in the Wild (LFW). Here’s a sample distance matrix - it shows how facial representations of people are in our face-representation space. Let's take a look at the distance matrix of our facial representation. There are three images per person in this sample. Lower values indicate similarity.
Huh, the matrix really let go off itself
As promised, let’s talk a bit about the problems of obstructed faces. We made a sample dataset of people that turn their head around in different lighting conditions to test the stability of the representations. Here’s the results:
Chart of performance in relative to head rotation. (Also known as: the forest picture on Denis's computer)
This chart represent the distance between facial representations of people that can be recognized to the face currently seen in the video (the Y axis) in relation to the whole video (the X axis, time). As we can see, as the face turns, our distance increases a bit but is stable enough to weather it; in cases of almost total obstruction, the model can’t produce a stable representation. That’s why we implemented a filter to remove inconclusive facial images.
To summarize, our system is able to watch critical areas that need an extra pair of eyes. It’s can be used in various ways, and one of the most popular is as an addition to the security system of offices. We make sure that no data can be leaked outside, that all actions require accountability and that the system can be scaled with no problems to match the demand of your setup. Privacy and security is our top priority for this system that allows you to monitor your areas of interest with an easy to use and powerful web dashboard, backed by the years of experience, reliability and expertise of Microsoft baked into their technologies and platforms in the form of Azure.