Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation Fault when using tfdlpack.to_dlpack on tf.tensor #12

Open
awthomp opened this issue Jan 17, 2020 · 9 comments
Open

Comments

@awthomp
Copy link

awthomp commented Jan 17, 2020

I've been experimenting with using tfdlpack to connect libraries using __cuda_array_interface__ to TensorFlow with tfdlpack and reach a segmentation fault when invoking to_dlpack with a TF tensor. See below for replication:

import cupy as cp
import tfdlpack

# CuPy - GPU Array (like NumPy!)
gpu_arr = cp.random.rand(10_000, 10_000)

# Use CuPy's built in `toDlpack` function to move to a DLPack capsule
dlpack_arr = gpu_arr.toDlpack()

# Use `tfdlpack` to migrate to TensorFlow
tf_tensor = tfdlpack.from_dlpack(dlpack_arr)

# Confirm TF tensor is on GPU
print(tf_tensor.device)

# Use `tfdlpack` to migrate back to CuPy; this yields a segmentation fault
dlpack_capsule = tfdlpack.to_dlpack(tf_tensor)

I'm using 1 GP100 isolated with the CUDA_VISIBLE_DEVICES environment variable.

@jermainewang
Copy link
Collaborator

Confirmed this is a bug. I replaced cupy with torch and it also crashes.

import torch
from torch.utils import dlpack as th_dlpack
import tfdlpack

gpu_arr = torch.rand(10_000, 10_000).cuda()
print(gpu_arr)

dlpack_arr = th_dlpack.to_dlpack(gpu_arr)

# Use `tfdlpack` to migrate to TensorFlow
tf_tensor = tfdlpack.from_dlpack(dlpack_arr)

# Confirm TF tensor is on GPU
print(tf_tensor.device)

# Use `tfdlpack` to migrate back to CuPy; this yields a segmentation fault
dlpack_capsule = tfdlpack.to_dlpack(tf_tensor)

@jermainewang
Copy link
Collaborator

jermainewang commented Jan 18, 2020

What's your tensorflow version? I found the code works with tensorflow v2.1.0 but not v2.0.0.

@VoVAllen
Copy link
Owner

It works well on my machine.
I'm using tensorflow 2.1.0

@awthomp
Copy link
Author

awthomp commented Jan 18, 2020

What's your tensorflow version? I found the code works with tensorflow v2.1.0 but not v2.0.0.

Interesting. I was on TF 2.1.0 when submitting the bug report. I've included an Anaconda environment file below to ensure we're on the same page for SW dependencies:

name: tfdlpack
channels:
  - conda-forge
  - nvidia
  - pytorch
  - defaults
  - numba
dependencies:
  - python=3.7
  - numpy
  - cudatoolkit>=9.2,<10.2
  - numba
  - cupy>=6.2.0
  - pytorch
  - pip
  - pip:
      - tfdlpack-gpu

Just save this into a file named tfdlpack_conda.yml. Then run:

conda env create -f tfdlpack_conda.yml
conda activate tfdlpack

My system contains 2 GP100s (Pascal P100) and 1 P2000 to drive graphics. I typically isolate GPU0 (P100) with export CUDA_VISIBLE_DEVICES=0.

@awthomp
Copy link
Author

awthomp commented Jan 18, 2020

I'm also receiving the segfault with an NVIDIA T4. Here's a Google Colab notebook that you can run through. Perhaps pip install tfdlpack-gpu isn't pulling in all the expected/necessary dependencies?

https://colab.research.google.com/drive/18Z8bOCJ2Mr-jOD-vIbr6KAO1-KPUy_UM

@VoVAllen
Copy link
Owner

Thanks for your example. Actually I'm thinking of reorganize the whole project based on new tensorflow custom-op repo https://github.com/tensorflow/custom-op. As this is the official guide on how to distribute custom op. However I'm skeptical on whether I should make the project based on Bazel instead of CMake. I may need more time on thihs.

@awthomp
Copy link
Author

awthomp commented Jan 18, 2020

Thanks, @VoVAllen and thanks for your hard and great work at enabling DLPack support with TensorFlow. Don't hesitate to let us know what you need help with.

@VoVAllen
Copy link
Owner

VoVAllen commented Jan 19, 2020

@awthomp I've updated the binary release and it now works in colab. Could you try it in your environment again?

However there's still bug in this release. It would happen when you create a capsule from tensorflow but not consuming it in another framework. I'm still investigating the solution.

@awthomp
Copy link
Author

awthomp commented Jan 19, 2020

@VoVAllen. Wahoo! Works for me in both Colab on a T4 and on my local machine with a P100. Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants