Deploy a TensorFlow Model to a Mobile

If you want to deploy your TensorFlow model to a mobile or embedded device, a large model may take too long to download and use too much RAM and CPU, all of which will make your app unresponsive, heat the device and drain its battery. To avoid this, you need to make a mobile-friendly. Lightweight, and efficient model, without sacrificing too much of its accuracy.

Before Deploying a TensorFlow model to a mobile, I suggest you to learn how to Deploy a machine learning model to a Web Application. This will help to to understand things better before getting into to deploy a TensorFlow model to a Mobile or embedded Device.

The file library provides several tools to help you deploy your TensorFlow model to a mobile and embedded devices, with three main objectives:

  • Reduce the model size to shorten download time and reduce RAM usage.
  • Reduce the number of computations needed for each prediction to minimize latency, battery usage, and heating.
  • Adapt the model to device-specific constraints.

Train and Deploy a TensorFlow Model to a Mobile

While you Deploy a Machine Learning Model, you need to reduce the model size, TFLite’s model converter can take a saved model and compress it to a much lighter format based on FlatBuffers. This is a dynamic, cross-platform serialization library initially created by Google without any preprocessing: this reduces the loading time and memory footprint.

Once the model is loaded into a mobile or embedded device, the TFLite interpreter will execute it to make predictions. Here is how you can convert a saved model to a FlatBuffer and save it to a .tflite file.

Save/Load a SavedModel

I will use the REST API or the gRPC API:

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train_full = X_train_full[..., np.newaxis].astype(np.float32) / 255.
X_test = X_test[..., np.newaxis].astype(np.float32) / 255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_new = X_test[:3]
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))Code language: Python (python)
Train on 55000 samples, validate on 5000 samples Epoch 1/10 55000/55000 [==============================] - 2s 40us/sample - loss: 0.7018 - accuracy: 0.8223 - val_loss: 0.3722 - val_accuracy: 0.9022 Epoch 2/10 55000/55000 [==============================] - 2s 36us/sample - loss: 0.3528 - accuracy: 0.9021 - val_loss: 0.3000 - val_accuracy: 0.9170 Epoch 3/10 55000/55000 [==============================] - 2s 36us/sample - loss: 0.3032 - accuracy: 0.9150 - val_loss: 0.2659 - val_accuracy: 0.9280 Epoch 4/10 55000/55000 [==============================] - 2s 37us/sample - loss: 0.2730 - accuracy: 0.9233 - val_loss: 0.2442 - val_accuracy: 0.9342 Epoch 5/10 55000/55000 [==============================] - 2s 37us/sample - loss: 0.2504 - accuracy: 0.9305 - val_loss: 0.2272 - val_accuracy: 0.9346 Epoch 6/10 55000/55000 [==============================] - 2s 37us/sample - loss: 0.2319 - accuracy: 0.9353 - val_loss: 0.2104 - val_accuracy: 0.9418 Epoch 7/10 55000/55000 [==============================] - 2s 37us/sample - loss: 0.2156 - accuracy: 0.9395 - val_loss: 0.1987 - val_accuracy: 0.9484 Epoch 8/10 55000/55000 [==============================] - 2s 36us/sample - loss: 0.2019 - accuracy: 0.9434 - val_loss: 0.1893 - val_accuracy: 0.9496 Epoch 9/10 55000/55000 [==============================] - 2s 41us/sample - loss: 0.1898 - accuracy: 0.9471 - val_loss: 0.1765 - val_accuracy: 0.9526 Epoch 10/10 55000/55000 [==============================] - 2s 39us/sample - loss: 0.1791 - accuracy: 0.9495 - val_loss: 0.1691 - val_accuracy: 0.9550
np.round(model.predict(X_new), 2)Code language: Python (python)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  ],
       [0.  , 0.  , 0.99, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.96, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.01, 0.  ]],
      dtype=float32)
model_version = "0001"
model_name = "my_mnist_model"
model_path = os.path.join(model_name, model_version)
model_pathCode language: Python (python)
'my_mnist_model/0001'
!rm -rf {model_name}
tf.saved_model.save(model, model_path)
for root, dirs, files in os.walk(model_name):
    indent = '    ' * root.count(os.sep)
    print('{}{}/'.format(indent, os.path.basename(root)))
    for filename in files:
        print('{}{}'.format(indent + '    ', filename))Code language: Python (python)
my_mnist_model/
    0001/
        saved_model.pb
        variables/
            variables.data-00000-of-00001
            variables.index
        assets/
!saved_model_cli show --dir {model_path}
The given SavedModel contains the following tag-sets:
serve
!saved_model_cli show --dir {model_path} --tag_set serve
The given SavedModel MetaGraphDef contains SignatureDefs with the following keys:
SignatureDef key: "__saved_model_init_op"
SignatureDef key: "serving_default"
!saved_model_cli show --dir {model_path} --tag_set serve \
                      --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['flatten_2_input'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 28, 28, 1)
      name: serving_default_flatten_2_input:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['dense_5'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 10)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
!saved_model_cli show --dir {model_path} --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['flatten_2_input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 28, 28, 1)
        name: serving_default_flatten_2_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['dense_5'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 10)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

Let’s write the new instances to a npy file so we can pass them easily to our model:

np.save("my_mnist_tests.npy", X_new)
input_name = model.input_names[0]
input_nameCode language: Python (python)
'flatten_2_input'

And now let’s use saved_model_cli to make predictions for the instances we just saved:

!saved_model_cli run --dir {model_path} --tag_set serve \
                     --signature_def serving_default    \
                     --inputs {input_name}=my_mnist_tests.npy
2019-06-10 10:56:43.396851: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING: Logging before flag parsing goes to stderr.
W0610 10:56:43.397369 140735810999168 deprecation.py:323] From /Users/ageron/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/tools/saved_model_cli.py:339: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.
W0610 10:56:43.421489 140735810999168 deprecation.py:323] From /Users/ageron/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Result for output key dense_5:
[[1.17575204e-04 1.13160660e-07 5.96997386e-04 2.08104262e-03
  2.57820852e-06 6.44166794e-05 2.77263990e-08 9.96703804e-01
  3.96052455e-05 3.93810158e-04]
 [1.22226949e-03 2.92685600e-05 9.86054957e-01 9.63000767e-03
  8.81790996e-08 2.88744748e-04 1.58111588e-03 1.12290488e-09
  1.19344448e-03 1.09315742e-07]
 [6.40679718e-05 9.63618696e-01 9.04400647e-03 2.98595289e-03
  5.95759891e-04 3.74212675e-03 2.50709383e-03 1.14931818e-02
  5.52693009e-03 4.22279176e-04]]
np.round([[1.1739199e-04, 1.1239604e-07, 6.0210604e-04, 2.0804715e-03, 2.5779348e-06,
           6.4079795e-05, 2.7411186e-08, 9.9669880e-01, 3.9654213e-05, 3.9471846e-04],
          [1.2294615e-03, 2.9207937e-05, 9.8599273e-01, 9.6755642e-03, 8.8930705e-08,
           2.9156188e-04, 1.5831805e-03, 1.1311053e-09, 1.1980456e-03, 1.1113169e-07],
          [6.4066830e-05, 9.6359509e-01, 9.0598064e-03, 2.9872139e-03, 5.9552520e-04,
           3.7478798e-03, 2.5074568e-03, 1.1462728e-02, 5.5553433e-03, 4.2495009e-04]], 2)Code language: Python (python)
array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  ],
       [0.  , 0.  , 0.99, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.96, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.01, 0.  ]])

TensorFlow Serving

Install Docker if you don’t have it already. Then run:

docker pull tensorflow/serving

export ML_PATH=$HOME/ml # or wherever this project is
docker run -it --rm -p 8500:8500 -p 8501:8501 \
   -v "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
   -e MODEL_NAME=my_mnist_model \
   tensorflow/servingCode language: PHP (php)

Once you are finished using it, press Ctrl-C to shut down the server.

Alternatively, if tensorflow_model_server is installed (e.g., if you are running this notebook in Colab), then the following 3 cells will start the server:

os.environ["MODEL_DIR"] = os.path.split(os.path.abspath(model_path))[0]
%%bash --bg
nohup tensorflow_model_server \
     --rest_api_port=8501 \
     --model_name=my_mnist_model \
     --model_base_path="${MODEL_DIR}" >server.log 2>&1
!tail server.log
import json

input_data_json = json.dumps({
    "signature_name": "serving_default",
    "instances": X_new.tolist(),
})
repr(input_data_json)[:1500] + "..."Code language: Python (python)

How Does it work to Deploy a TensorFlow Model to Mobile

While you Deploy a TensorFlow model to a mobile, the converter optimizes the model, both to shrink it and to reduce its latency. It prunes all the operations that are not needed to make predictions ( such as training operations), and it optimizes computations whenever possible; for example, 3*a + 4*a +5*a will be converted to (3+4+5)*a. It also tries to fuse operations whenever possible.

For example, Batch Normalization layers end up folded into pervious layer’s addition and multiplication operations, whenever possible. To get a good idea of how much TFLite can optimize a model, download one of the pretrained TFLite models, unzip the archive, then open the excellent Netron graph visualization tool and upload the.pb file to view the original model. It’s a big, elaborate graph. Next, open the optimized. Tflite model marvel at its beauty.

Another Way to Reduce the Model Size

Another way you can reduce the model size while you deploy a TensorFlow model to a mobile or embedded device(other than only using smaller neural network architectures) is by using smaller bit-widths: for example, if you use half-floats (16 bits) rather than regular floats (32 bits), the model size will shrink by a factor of 2, at the cost of a ( generally small) accuracy drop. Moreover, training will be faster, and you will use roughly half the amount of GPU RAM.

TFLite’s converter can go further than that, by quantizing the model weights down to fixed- point, 8-bit integers! This leads to a fourfold size reduction compared to using 32-bit floats, 8-bit integers! This leads to a fourfold size reduction compared to using 32-bit floats.

The simplest approach is called post-training quantization: it just quantizes the weights after training, using a fairly basic but efficient symmetrical quantization technique. It finds the maximum absolute weight value, m; then it maps the floating-point range –m to +m to the fixed-point (integer) range-127 to +127.

I hope you liked this article on Deploy a TensorFlow Model to a Mobile or an embedded device. Feel free to ask your questions in the comments section below. Don’t Forget to Subscribe for my Daily Newsletters below to get email notifications, if you like My work.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1433

Leave a Reply