0

I'm receiving the following error when fitting a model Segmentation fault (core dumped). I'm on Ubuntu 18.04 and have a Nvidia rtx 2070 (for CUDA) and an AMD RX 570 (for my 4k display). I don't think the dual gpus are an issue though, I can successfully run code on the rtx 2070 that worked prior to installing the amd gpu. I walked through this tutorial to setup my system for deep learning Installing Tensorflow-GPU. The following is the code I am trying to run that I got from Install Tensorflow with GPU support:

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten,  MaxPooling2D, Conv2D
from keras.callbacks import TensorBoard

(X_train,y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000,28,28,1).astype('float32')
X_test = X_test.reshape(10000,28,28,1).astype('float32')

X_train /= 255
X_test /= 255

n_classes = 10
y_train = keras.utils.to_categorical(y_train, n_classes)
y_test = keras.utils.to_categorical(y_test, n_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(28,28,1)) )
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

tensor_board = TensorBoard('./logs/LeNet-MNIST-1')

model.fit(X_train, y_train, batch_size=16, epochs=15, verbose=1, validation_data=(X_test,y_test), callbacks=[tensor_board])

Here is the output from running the above code:

Using TensorFlow backend.
Train on 60000 samples, validate on 10000 samples
2018-12-21 21:28:32.425989: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-12-21 21:28:33.111624: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-21 21:28:33.112435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.65
pciBusID: 0000:09:00.0
totalMemory: 7.77GiB freeMemory: 7.65GiB
2018-12-21 21:28:33.112452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-21 21:28:33.380127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-21 21:28:33.380166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-21 21:28:33.380172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-21 21:28:33.380625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7359 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:09:00.0, compute capability: 7.5)
Epoch 1/15
Segmentation fault (core dumped)

When looking at the nvidia-smi window it shows usage for about 1 second then goes to zero and I get the segmentation error from the terminal. I tried to run this in Jupyter and the kernel just dies. The only thing I can think of is the versions of programs I have installed. Here are the versions I have for those programs:

GCC:

gcc version 6.5.0 20181026 (Ubuntu 6.5.0-2ubuntu1~18.04) 

CUDA:

CUDA Version 9.0.176
CUDA Patch Version 9.0.176.4

Tensorflow:

1.12.0

CUDNN:

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

And my Nvidia SMI looks like this: +----------------------------------------------------------------------

-------+
| NVIDIA-SMI 415.23       Driver Version: 415.23       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   46C    P0     1W / 175W |      0MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The code seems like something pretty simple to run on rtx 2070 according to the blog mentioned above but it doesn't want to run. Any advice?

1 Answer 1

0

I actually found the answer at the bottom of the blog (with the MNIST example I mentioned above) as a comment. Here is the comment:

alright, i figured out it was probably the incompatibility of different versions of cudnn. I created a new environment with conda specifying python=3.6 (conda create --name tf-gpu python=3.6), and then installed tensorflow-gpu=1.8.0 (conda install tensorflow-gpu=1.8.0). I'm still wondering why exactly this happened but at least now all codes on this page run smoothly.

I created a new conda environment with those specific installations and the code runs smoothly now. I'll leave this posted in case someone else comes across this issue since my initial issue was not with this MNIST code but some other code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.