PyTorch build on Jetson Nano

PyTorch

This article explains how to build PyTorch on Jetson Nano. At this time, Jetson Linux only supports python version 3.6, so if you need to use PyTorch with python 3.7 or higher, you will need to build PyTroch yourself, since the official Jetson PyTorch library does not exist.

This article describes an example for python 3.8, but other versions will work with almost the same mechanism.

install packages

Install the necessary packages for python3.8 and PyTorch.

sudo apt-get update
sudo apt-get install -y \
      python3.8 python3.8-dev \
      ninja-build git cmake clang \
      libopenmpi-dev libomp-dev ccache \
      libopenblas-dev libblas-dev libeigen3-dev \
      python3-pip libjpeg-dev \
      gnupg2 curlCode language: Bash (bash)

install cmake

Since the default cmake is too old to build PyTorch, install cmake that supports PyTorch builds.

# cmake
sudo apt-get install -y software-properties-common lsb-release
sudo wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
sudo apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main"
sudo apt-get update
sudo apt-get install -y cmakeCode language: Bash (bash)

install CUDA and cuDNN

Install CUDA and cuDNN.

# cuda and cudnn
sudo apt-get install -y nvidia-cuda nvidia-cudnn8Code language: Bash (bash)

remove package

Before building PyTorch and TorchVision, remove numpy installed by apt because it causes conflicts and install numpy for python3.8.

sudo apt remove -y python3-numpyCode language: Bash (bash)

build and install PyTorch with python3.8

Install the python3.8 packages, clone PyTorch and apply the patch. For more details about the patch, see this page.

python3.8 -m pip install -U pip
python3.8 -m pip install -U setuptools
python3.8 -m pip install -U wheel mock pillow
python3.8 -m pip install scikit-build
python3.8 -m pip install cython Pillow numpy

## download PyTorch v1.11.0 with all its libraries
git clone -b v1.11.0 --depth 1 --recursive --recurse-submodules --shallow-submodules https://github.com/pytorch/pytorch.git
(
cd pytorch
python3.8 -m pip install -r requirements.txt
wget https://raw.githubusercontent.com/otamajakusi/build_jetson_nano_libraries/main/pytorch/pytorch-1.11-jetson.patch
patch -p1 < pytorch-1.11-jetson.patch
)Code language: Bash (bash)

The following is the pytorch-1.11-jetson.patch. It can be downloaded from here.

diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
index 0327868..e484fba 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
@@ -26,6 +26,7 @@ inline namespace CPU_CAPABILITY {
 //    https://bugs.llvm.org/show_bug.cgi?id=45824
 // Most likely we will do aarch32 support with inline asm.
 #if defined(__aarch64__)
+#if defined(__clang__) || (__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3))
 
 #ifdef __BIG_ENDIAN__
 #error "Big endian is not supported."
@@ -715,5 +716,6 @@ Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<floa
 }
 
 #endif /* defined(aarch64) */
+#endif /* defined(__clang__) */
 
 }}}
diff --git a/aten/src/ATen/cuda/CUDAContext.cpp b/aten/src/ATen/cuda/CUDAContext.cpp
index 1751128..a090e70 100644
--- a/aten/src/ATen/cuda/CUDAContext.cpp
+++ b/aten/src/ATen/cuda/CUDAContext.cpp
@@ -24,6 +24,7 @@ void initCUDAContextVectors() {
 void initDeviceProperty(DeviceIndex device_index) {
   cudaDeviceProp device_prop;
   AT_CUDA_CHECK(cudaGetDeviceProperties(&device_prop, device_index));
+  device_prop.maxThreadsPerBlock = device_prop.maxThreadsPerBlock / 2;
   device_properties[device_index] = device_prop;
 }
 
diff --git a/aten/src/ATen/cuda/detail/KernelUtils.h b/aten/src/ATen/cuda/detail/KernelUtils.h
index b36e78c..dea597f 100644
--- a/aten/src/ATen/cuda/detail/KernelUtils.h
+++ b/aten/src/ATen/cuda/detail/KernelUtils.h
@@ -19,7 +19,7 @@ namespace at { namespace cuda { namespace detail {
 
 
 // Use 1024 threads per block, which requires cuda sm_2x or above
-constexpr int CUDA_NUM_THREADS = 1024;
+constexpr int CUDA_NUM_THREADS = 512;
 
 // CUDA: number of blocks for threads.
 inline int GET_BLOCKS(const int64_t N, const int64_t max_threads_per_block=CUDA_NUM_THREADS) {Code language: Diff (diff)

Build PyTorch with MAX_JOBS set to 2 because the Jetson Nano is low on memory. Increasing MAX_JOBS will build faster, but will hang at a very high probability.

export BUILD_CAFFE2_OPS=OFF
export USE_FBGEMM=OFF
export USE_FAKELOWP=OFF
export BUILD_TEST=OFF
export USE_MKLDNN=OFF
export USE_NNPACK=OFF
export USE_XNNPACK=OFF
export USE_QNNPACK=OFF
export USE_PYTORCH_QNNPACK=OFF
export USE_CUDA=ON
export USE_CUDNN=ON
export TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2"
export USE_NCCL=OFF
export USE_SYSTEM_NCCL=OFF
export USE_OPENCV=OFF
export MAX_JOBS=2
# set path to ccache
export PATH=/usr/lib/ccache:/usr/local/cuda/bin:$PATH
# set clang compiler
export CC=clang
export CXX=clang++
# create symlink to cublas
# ln -s /usr/lib/aarch64-linux-gnu/libcublas.so /usr/local/cuda/lib64/libcublas.so
# start the build
(
  cd pytorch
  python3.8 setup.py bdist_wheel
)Code language: Bash (bash)

This build takes about 13 hours.

Install.

find pytorch/dist -type f|xargs python3.8 -m pip installCode language: Bash (bash)

build and install TorchVision with python3.8

Clone and build TorchVision.

# torch vision
git clone --depth=1 https://github.com/pytorch/vision torchvision -b v0.12.0
(
  cd torchvision
  export TORCH_CUDA_ARCH_LIST='5.3;6.2;7.2'
  export FORCE_CUDA=1
  export MAX_JOBS=2
  python3.8 setup.py bdist_wheel
)Code language: Bash (bash)

This build takes about 16 minutes.

Install.

find torchvision/dist -type f|xargs python3.8 -m pip installCode language: Bash (bash)

That’s all.

References