I have spent all too much time trying to get Caffe running on a CentOS based cluster that I use. I was hoping this would be a straightforward process. Suffice it to say it has not been. Not even close. None of the problems I encountered were particularly challenging to solve. The complication came from the fact that I ran into one hurdle after another. I should note however that installing Caffe on my personal machine which runs Linux Mint 17.1 went smoothly.
I'm writing this post as a record of the problems I encountered and the solutions I used. Unfortunately because I wasn't expecting to need to write this I may be missing details. If you notice something missing please feel free to leave a comment and I will update this document. Similarly, if you know of better ways to solve any of these problems please feel free to share. I will not likely test the solutions my self unless I need to reinstall for some reason so your comments would be largely for posterity. One thing to note is that some if not many of the problems I encountered may be rather specific too the cluster I'm using. My apologies if the following doesn't address the problem you are experiencing.
The first problem I ran into was with protobuf. The problem was related to the members of a unions being defined as constant in src/google/protobuf/util/internal/datapiece.h. Specifically the union defining the types i32_, i64_, u32_, u64_, double_, float_, bool_, and str_. This problem appears to be fairly common according to a quick Google search and the fix is as simple as removing the const keyword. However the error itself can be a bit misleading as it doesn't lead one specifically to the offending lines.
The next problem I encountered was related to glog. Specifically it requires a newer version of autotools than was installed on the system I'm using. To solve this I performed a user space install of autoconf and automake. It proved tricky to install autoconf for reasons I still don't understand. Thanks to the cluster admins I was able to finally do so using these instructions. Automake simply required downloading the tarball, configuring for my home directory, and then performing make followed by make install. Unfortunately glog was still not happy! The included make file hardcoded aclocal-1.14 and the upgraded autotools gave me autotools-1.15. Bah! Executing autoreconf -ivf fixed that issue though. I am not completely certain this is a good solution however as I have not used autoreconf before.
The next hurdle was with gflags. Evidently it requires cmake which was not already installed. Downloading and installing cmake resolved this problem. One thing to note when performing a user space install of cmake is that you need to use the --prefix flag when calling the boostrap script to indicate that you want cmake installed in your home directory. I also found that I had to compile with -fPIC, so the complete cmake command I ended up using was CXXFLAGS="-fPIC" cmake -DCMAKE_INSTALL_PREFIX=~ .. in a build subdirectory of the repository.
I also had to install OpenCV. I didn't run into any trouble here. But a word of warning for those that have never built it before -- it takes a very long time!
Next up was leveldb. In this case I just cloned the github repository and ran make in it. From there caffe needs to be told where to find the header files and shared objects that were built. I told it as much by appending to the INCLUDE_DIRS and LIBRARY_DIRS lists respectively in caffe/Makefile.config. The headers are in the include subdirectory of the repository while the libraries will be placed at the root of the repository.
From there I found I needed lmdb. This library is developed under OpenLDAP. At the time of this writing they offer a github repository with just the lmdb code so I cloned it and built the library. From there I updated the INCLUDE_DIRS and LIBRARY_DIRS lists in caffe/Makefile.config to point to libraries/liblmdb within the repository.
Next I had to install the Google snappy library. In this case I had to get the tarball from the Google code repository not the github repository. For reasons I don't know or really care about it seems that build files are missing from the gihub repository.
The last problem I ran into was related to Atlas. I had previously performed a user space install of it but did not add the resulting lib directory to my LD_LIBRARY_PATH and LIBRARY_PATH environment variables. Doing so allowed me to finally able to execute make all in the Caffe repository and have it complete without errors. It took a while though, in part because I was using a single thread since I never knew what it would stumble over next. As such I advise throwing more threads at it by instead using the command make all -jX where X is the number of threads you want it to use.
One final note on the installation. Don't forget to add the appropriate atlas, leveldb, and liblmdb directories to LD_LIBRARY_PATH in your .bashrc.
At this point I'm really hoping it was worth the effort to install Caffe. Comparatively the Theano and Pylearn2 installations were so much easier on this same system.
Musings on artificial intelligence, machine learning, robotics, research, and just about anything else that comes to mind.
20150725
20150421
Linux Mint 17.1, Nvidia, CUDA, and cuDNN
I recently replaced a Titan X, which was on loan, with a GTX 980. After messing with drivers for nearly a day I was able to get my dual monitor setup running again. Unfortunately whatever i did freaked out Theano yielding the error:
dustin@Cortex ~ $ ipython Python 2.7.6 (default, Mar 22 2014, 22:59:56) Type "copyright", "credits" or "license" for more information. IPython 1.2.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: import theano WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu0 \ is not available (error: Unable to get the number of gpus available: \ unknown error)I tried upgrading the Nvidia driver to 346.59 and CUDA from 6.5 to 7.0 with no luck. So I decided to start fresh since I had been wanting to upgrade from Linux Mint 17 to 17.1 anyway. Following are the steps I used to get my system running Theano again. I have not replicated these results so hopefully I am not overlooking any major steps.
Nvidia Driver
I used the xorg-edgers PPA to install Nvidia drivers 346.59 as described on Noobs Lab. In short, add the new repository:sudo add-apt-repository ppa:xorg-edgers/ppaUpdate to get the list of available packages:
sudo apt-get updateInstall 346:
sudo apt-get install nvidia-346 nvidia-settingsI have a dual monitor setup that require I execute nvidia-settings to Enable Xinerama via the X Server Display Configuration page.
CUDA
I found some helpful instructions for installing CUDA. In short, start by installing the GNU Compiler Collection tools with:sudo apt-get install build-essentialDownload the Nvidia CUDA 7.0 DEB. Though I'm running Linux Mint 17.1 I used the Ubuntu 14.04 Network DEB. Install it:
sudo dpkg -i cuda-repo-ubuntu1404_6.5-14_amd64.debUpdate to get the list of packages:
sudo apt-get updateInstall CUDA:
sudo apt-get install cudaFinally alter your .bashrc to add CUDA to your PATH and LD_LIBRARY_PATH environment variables. Theano will also want to know where CUDA is located so now is a good time to setup the CUDA_ROOT environment variables as well.
export PATH=/usr/local/cuda-7.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH export CUDA_ROOT=/usr/local/cuda-7.0
cuDNN
Installing cuDNN for use with Theano can be found on the cuDNN page of deeplearning.net. I used the first method currently suggested on that page which is to copy *.h to $CUDA_ROOT/includes and *.so* to $CUDA_ROOT/lib64.
Subscribe to:
Posts (Atom)