The joy of recognition: creating a CV classifier and transferring it to Core ML

7 min readApr 30, 2021

FASTEP is an internal startup of Spider Group that offers businesses the chance to optimize their work with interactive instructions and video calls using augmented reality (AR). It works on mobile apps including the one for iOS.

An important part of FASTEP is the classification and identification module for appliance recognition based on the computer vision. For the user, everything looks very simple. You show the front panel of the equipment to the camera, and the software will offer a match from the database and instructions for exactly what is in front of you. With this tool, you do not need to enter the model identification number or search for it by its appearance.

To make it so fast and convenient, we had to optimize the process.

The usual scheme is that when an application uploads frames from the camera to the server, the neural network on the server performs calculations and returns the solution. There are some problems here:

Users like to test applications on anything and send photos of everything to the server, but not the things we need. The backend is loaded with useless work.
There are delays in the server work.
The solution becomes unnecessarily dependent on the quality of the Internet connection.

We realized that we needed optimization options. The first was the redesign of the solution architecture. The second was to convert networks based on the PyTorch library to native Core ML, because the native one should be faster and more economical than the third-party one.

Some of the calculations are done on a smartphone

We have divided the process of determining the model into two parts: the classification of the object and the definition of a specific model of equipment. Then we moved the definition of the class of equipment to the smartphone. Thus, the ”primary“ analysis of the video stream was taken over by the user device.

The neural network on the smartphone determines the class of equipment (for example, “Dishwasher”). If the image is classified, it goes to the server. We can also determine that an object does not belong to any of the classes. Thus, the wrong options are cut off in advance, and the load on the server part is reduced. Now the server receives the data which is already classified.

Back-end architecture

Before that, a single neural network was engaged in determining a specific model of equipment. After a series of experiments, we realized that a better result was achieved by networks that were trained to determine the model within the same class: one dealt with washing machines, the other with microwaves, and so on. We created such networks and taught the backend to route requests based on class data from the user’s device.

Another problem turned out to be more important than anticipated. Manufacturers of equipment produce it in series, in which the models differ only in some parameters and are outwardly absolutely identical or almost the same. Unfortunately, we collected the data for training without taking this fact into account. In such series, the network correctly found one model, and then took another for the same model, and we marked it with the wrong answer. It seemed like a punishment for correct answers, which was unfair.

This problem cannot be solved without revising the design of household appliances, but it can be circumvented. We have changed the approach to data collection and now group the same type of models. Fortunately, the instructions for these models are also almost the same.

Why convert PyTorch to Core ML

We have launched the classification on the mobile device. But the calculations were performed on the CPU and GPU, without a neuroprocessor — and this is not ideal. The versatility of PyTorch sometimes creates its own problems: it is understood by any device, but such networks cannot boast deep enough optimization for a specific hardware.

The devices overheat, the battery is suddenly discharged, and the video stream processing speed is kept at six frames per second. At this speed we cannot talk about seamless operation.

The logical solution was to convert the neural network to Core ML.

Core ML optimizes the performance of neural network computing on Apple devices by distributing tasks between the CPU, GPU, and neuroprocessor to minimize memory and power consumption. Running the model strictly on the user’s device eliminates the need for network traffic at this stage.

The results exceeded our expectations. Thanks to the parallelization of tasks, Core ML was between 2 and 2.5 times faster.

From creating a PyTorch model to Core ML

So, at the input there is a model in PyTorch, and at the output you need to get a Core ML model for embedding in the application.

To convert neural networks, we used the Core ML tools library. The PyTorch, NumPy, and PIL libraries were also used to set additional parameters and validate the transformed model.

Stages:

Creating and initializing the PyTorch model
Changing the PyTorch model to interpret the results
Converting the PyTorch model to TorchScript
Converting a TorchScript model to an MLModel
Quantization
Checking the model

Creating and initializing the PyTorch model

For the correct transformation, the network must be created without quantization and without optimization using the optimize_for_mobile method.

resnet50 = models.resnet50(pretrained=False)
fc_in_features = resnet50.fc.in_features
resnet50.fc = nn.Linear(
  in_features=fc_in_features,
  out_features=7)
resnet50.load_state_dict(torch.load("./Models/resnet50.pt", map_location=torch.device('cpu')))

PyTorch model output transformations for interpreting results

The output of the model can be transformed either at the PyTorch level or at the application level using Accelerate. PyTorch allows you to do this faster. Adding a sigmoid to the model looks like this:

torch_model = torch.nn.Sequential(
  resnet50,
  nn.Sigmoid())

Converting the PyTorch model to TorchScript

Before converting, it is necessary to switch the model to the calculation mode:

torch_model.eval()

You can convert it to TorchScript using the trace and script methods. Trace captures the models according to the test input tensor, which allows you to make the model faster, but if the model has conditional statements or loops with a variable range inside, the converted model may not work correctly. To avoid problems, trace should be used with script.

To transform using trace, you need to use a random tensor of a given dimension (or some specific one, for example, obtained from a test image):

input_shape = (3, 224, 224)
example_input = torch.rand(1, *input_shape)
traced_model = torch.jit.trace(torch_model, example_input)

The script method, in addition to the model, does not require additional arguments.

scripted_model = torch.jit.script(torch_model)

Converting a TorchScript model to an MLModel

The convert method from Core ML tools has the following parameters:

model — a PyTorch model converted to a TorchScript object
source: str — the name of the framework, which can be output automatically based on the model parameter

Inputs is an array of TensorType or ImageType input arguments. Both of these types can be used. But if an image is expected as the input to the neural network and no special preprocessing is required, you should use ImageType, as this speeds up the conversion. It is also possible to convert the image to a tensor yourself and send it to the network. ImageType has arguments that allow the image to be pre-processed based on the values of the average and standard deviation used in training the network:

input = ct.ImageType(
  color_layout='RGB', 
  scale=1.0/255.0/0.226,
  bias=(-0.485/0.229, -0.456/0.224, -0.406/0.225),
  shape=example_input.shape)

Classifier_config — you can use this parameter to name the output classes of the neural network. You do not need to specify this parameter, but using it has a couple of advantages:

Core ML treats the model as a classifier and automatically sorts the output from the highest value to the lowest;
In XCode, a “Preview” tab is created where you can test the network on test images.

Full call of the conversion and save method:

cml_model = ct.convert(
  traced_model,
  inputs=[input], 
  classifier_config=ct.ClassifierConfig(class_labels)
cml_model.save('Models/Resnet50.mlmodel')

Quantization

It allows reduction of the size of the model by decreasing the weights from 32 bits to a smaller value — for example, 8-that can be specified in the arguments. You can also specify one of the three algorithms (linear, linear_symmetric, kmeans_lut).

Core ML tools has a special method for quantization:

model_8bit = quantization_utils.quantize_weights(
  cml_model,
  nbits=8,
  quantization_mode="linear")
model_8bit.save('Models/Resnet50_8bit.mlmodel')

Checking the model

You can check the conversion quality directly in Python. To do this, load the model from the file and call the predict method, which takes as input a dictionary with keys from the inputs array and a value of the ndarray type, if a tensor is assumed, or PIL. Image, if it is an image.

model = ct.models.MLModel('Models/Resnet50.mlmodel')
prediction = model.predict({'input.1': img})

What effect did this have?

The conversion to Core ML increased the processing speed to more than 30 frames per second. This means solving the problem in almost real (for a person) time. The bonus was a reduction in the operating temperature of the smartphone and battery savings. All these are especially important parts of the user’s perception of the application. And now we can use this solution not only in our own platform, but also for Spider Group customers. Request domestic neural network classifiers!

P. S.

What about Android? Now we will focus on optimizing for the Neural Networks API. It will be much more fun, because of the wonderful variety of devices and operating system versions.