PyTorch build on Jetson Nano


This article explains how to build PyTorch on Jetson Nano. At this time, Jetson Linux only supports python version 3.6, so if you need to use PyTorch with python 3.7 or higher, you will need to build PyTroch yourself, since the official Jetson PyTorch library does not exist.

This article describes an example for python 3.8, but other versions will work with almost the same mechanism.

install packages

Install the necessary packages for python3.8 and PyTorch.

sudo apt-get update sudo apt-get install -y \ python3.8 python3.8-dev \ ninja-build git cmake clang \ libopenmpi-dev libomp-dev ccache \ libopenblas-dev libblas-dev libeigen3-dev \ python3-pip libjpeg-dev \ gnupg2 curl
Code language: Bash (bash)

install cmake

Since the default cmake is too old to build PyTorch, install cmake that supports PyTorch builds.

# cmake sudo apt-get install -y software-properties-common lsb-release sudo wget -O - 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null sudo apt-add-repository "deb $(lsb_release -cs) main" sudo apt-get update sudo apt-get install -y cmake
Code language: Bash (bash)

install CUDA and cuDNN

Install CUDA and cuDNN.

# cuda and cudnn sudo apt-get install -y nvidia-cuda nvidia-cudnn8
Code language: Bash (bash)

remove package

Before building PyTorch and TorchVision, remove numpy installed by apt because it causes conflicts and install numpy for python3.8.

sudo apt remove -y python3-numpy
Code language: Bash (bash)

build and install PyTorch with python3.8

Install the python3.8 packages, clone PyTorch and apply the patch. For more details about the patch, see this page.

python3.8 -m pip install -U pip python3.8 -m pip install -U setuptools python3.8 -m pip install -U wheel mock pillow python3.8 -m pip install scikit-build python3.8 -m pip install cython Pillow numpy ## download PyTorch v1.11.0 with all its libraries git clone -b v1.11.0 --depth 1 --recursive --recurse-submodules --shallow-submodules ( cd pytorch python3.8 -m pip install -r requirements.txt wget patch -p1 < pytorch-1.11-jetson.patch )
Code language: Bash (bash)

The following is the pytorch-1.11-jetson.patch. It can be downloaded from here.

diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h index 0327868..e484fba 100644 --- a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h +++ b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h @@ -26,6 +26,7 @@ inline namespace CPU_CAPABILITY { // // Most likely we will do aarch32 support with inline asm. #if defined(__aarch64__) +#if defined(__clang__) || (__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3)) #ifdef __BIG_ENDIAN__ #error "Big endian is not supported." @@ -715,5 +716,6 @@ Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<floa } #endif /* defined(aarch64) */ +#endif /* defined(__clang__) */ }}} diff --git a/aten/src/ATen/cuda/CUDAContext.cpp b/aten/src/ATen/cuda/CUDAContext.cpp index 1751128..a090e70 100644 --- a/aten/src/ATen/cuda/CUDAContext.cpp +++ b/aten/src/ATen/cuda/CUDAContext.cpp @@ -24,6 +24,7 @@ void initCUDAContextVectors() { void initDeviceProperty(DeviceIndex device_index) { cudaDeviceProp device_prop; AT_CUDA_CHECK(cudaGetDeviceProperties(&device_prop, device_index)); + device_prop.maxThreadsPerBlock = device_prop.maxThreadsPerBlock / 2; device_properties[device_index] = device_prop; } diff --git a/aten/src/ATen/cuda/detail/KernelUtils.h b/aten/src/ATen/cuda/detail/KernelUtils.h index b36e78c..dea597f 100644 --- a/aten/src/ATen/cuda/detail/KernelUtils.h +++ b/aten/src/ATen/cuda/detail/KernelUtils.h @@ -19,7 +19,7 @@ namespace at { namespace cuda { namespace detail { // Use 1024 threads per block, which requires cuda sm_2x or above -constexpr int CUDA_NUM_THREADS = 1024; +constexpr int CUDA_NUM_THREADS = 512; // CUDA: number of blocks for threads. inline int GET_BLOCKS(const int64_t N, const int64_t max_threads_per_block=CUDA_NUM_THREADS) {
Code language: Diff (diff)

Build PyTorch with MAX_JOBS set to 2 because the Jetson Nano is low on memory. Increasing MAX_JOBS will build faster, but will hang at a very high probability.

export BUILD_CAFFE2_OPS=OFF export USE_FBGEMM=OFF export USE_FAKELOWP=OFF export BUILD_TEST=OFF export USE_MKLDNN=OFF export USE_NNPACK=OFF export USE_XNNPACK=OFF export USE_QNNPACK=OFF export USE_PYTORCH_QNNPACK=OFF export USE_CUDA=ON export USE_CUDNN=ON export TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2" export USE_NCCL=OFF export USE_SYSTEM_NCCL=OFF export USE_OPENCV=OFF export MAX_JOBS=2 # set path to ccache export PATH=/usr/lib/ccache:/usr/local/cuda/bin:$PATH # set clang compiler export CC=clang export CXX=clang++ # create symlink to cublas # ln -s /usr/lib/aarch64-linux-gnu/ /usr/local/cuda/lib64/ # start the build ( cd pytorch python3.8 bdist_wheel )
Code language: Bash (bash)

This build takes about 13 hours.


find pytorch/dist -type f|xargs python3.8 -m pip install
Code language: Bash (bash)

build and install TorchVision with python3.8

Clone and build TorchVision.

# torch vision git clone --depth=1 torchvision -b v0.12.0 ( cd torchvision export TORCH_CUDA_ARCH_LIST='5.3;6.2;7.2' export FORCE_CUDA=1 export MAX_JOBS=2 python3.8 bdist_wheel )
Code language: Bash (bash)

This build takes about 16 minutes.


find torchvision/dist -type f|xargs python3.8 -m pip install
Code language: Bash (bash)

That’s all.