Friday, 24 July 2015

Setting up a virtual machine for data analysis at the LHC

Analysis? I thought you only posted about electronics?

The LHC is now producing data as 50ns bunch crossing and is getting ready to bring that down to 25ns! This means physics analyses are picking up and more people are getting into it. I have been involved in the double Higgs production decaying to di-photon and di-bquark objects. This has involved the development of an analysis framework and more recently the statistical tools associated with it.

Why this post and why now?

I have been working on the analysis code for over 7 months now. Why am I suddenly posting a how-to? Well, my Virtual Machine packed up for some reason and I have to re-install everything so a nice recipe will help me remember things if it happens again as well as for future students and possibly even you.

Lets get started!

I am using VMPlayer as my virtual machine. It doesn't matter what you use but I prefer this over virtual box because of the nice tools that come with VMP and I am comfortable with it and its free. You can download it here: https://www.vmware.com/products/player

First up you need to download the ISO of choice to install on your VM. I chose SLC6.6 since it was the latest at the time of writing and its CERN software so I know the analysis code will work on it. Up to you what you wish to use. Here is the link to install SLC6. Try use the 64bit version: http://linux.web.cern.ch/linux/scientific6/docs/install.shtml

I have not tried SLC7. I am too scared to move to CentOS :P

So, once you have downloaded the SLC6.6 64bit ISO we can get started.

Open VMPlayer and create a new machine. Use the ISO as the input disk. I wont go into much detail about the install as its fairly straight forward and you can follow the link.

First time booting up

After putting in a root password we should move onto creating a user. This is where we will create the main user of the VM. Its a good idea to make the user have the same name as your CERN account. This will make life a lot easier when sshing to lxplus or using the GRID etc. On the Create User page fill in your CERN NICE account name. Choose any password. Next.

You should have booted up into SLC6. Now we need to add your user into the wheel group so you can use sudo.
su -
nano /etc/sudoers

uncomment the # %wheel line by removing the "#"

## Allows people in group wheel to run all commands
%wheel ALL=(ALL)       ALL

Add yourself to the wheel group:
sudo usermod -aG wheel johnsmith
exit
Test sudo works:
sudo ls

Update!

Now that you have sudo rights. Lets update the system. This can take long if you are using a remote package repository. If you are in South Africa you can find the mirror repo for SLC at www.mirror.ac.za by scrolling to the bottom. Adding a repo is something you can google :)
The update may take a while so go grab some coffee while this updates.

sudo yum update

Install VMware-tools

Player->manage->Install VMware Tools
An ISO should appear on the Desktop
It should either open automatically. If not just google "mount iso in linux"
Extract the .tar.gz file to the desktop.
Open a console and cd into the folder you just extracted
Run the vmware-install.pl executable as sudo
sudo ./vmware-install.pl

I leave everything on default. Now you can drag/drop copy/paste from the host to the VM.

Installing software collections from CERN

We have a basic working desktop. Now we want to make sure we have an up to date compiler. We make use of the CERN software collections: http://linux.web.cern.ch/linux/scl/#scl12

Install the packages:
sudo yum install python27 devtoolset-3
For some reason python33 doesnt work with the rcSetupTool we install later so stick to 27 for now. We want to enable these packages when ever we start a new bash. So in ~/.bashrc add the following:
source /opt/rh/devtoolset-3/enable
source /opt/rh/python27/enable
Enable the .bashrc by sourcing it then check the version:
source ~/.bashrc
g++ -v
and you should see 4.9. For the new analysis code you NEED 4.8 or above. This is important.

DO NOT COMPILE ANYTHING UNTIL THIS IS DONE!!!

SERIOUSLY.... I think the default is g++ v4.3 which is ancient!

Installing ROOT

If you have done any data analysis you probably have heard of the ROOT package. The latest version can be found here: https://root.cern.ch/drupal/content/downloading-root

At the time of writing Root-v6.04.00- was not compatible with Analysis Base so stick to the older Root-v6.02.12

After you have downloaded the source code we need to install all the pre-requisites: https://root.cern.ch/drupal/content/build-prerequisites
sudo yum install git make gcc-c++ gcc binutils libX11-devel libXpm-devel libXft-devel libXext-devel
Optional packages: 
sudo yum install gcc-gfortran openssl-devel pcre-devel mesa-libGL-devel glew-devel ftgl-devel mysql-devel fftw-devel cfitsio-devel graphviz-devel avahi-compat-libdns_sd-devel libldap-dev python-devel libxml2-devel gsl-static
I normally create a root-src and a root-build directory. This lets me manage which version I am using.
Extract the root source into the root-src directory.

From inside the directory:
sudo ./configure --all
sudo make -j 4

Then define the install location which should be the root-build directory
export ROOTSYS=/path/to/install/root-builds/
sudo make install

Installing Boost

Boost is a set of peer reviewed libraries with general functionality. You can download it here: http://sourceforge.net/projects/boost/files/boost/1.58.0/

To install it:
tar -xvjf boost_1_58_0.tar.bz2
cd boost_1_58_0/
sudo ./b2
sudo ./b2 install
sudo ./bootstrap.sh --prefix=/usr/local/boost-builds/v1-58-0/ --with-libraries=system,regex,filesystem,program_options,test
Now we tell the environment where to find Boost:
export BOOSTLIBDIR=/usr/local/boost-builds/v1-58-0/lib
export BOOSTINCDIR=/usr/local/boost-builds/v1-58-0/include

Getting things ready for the analysis

So now our machine is almost ready for the data analysis. We have ROOT and Boost libraries installed. Now we just need to do some fine tuning to make our lives easier for updating the analysis code.


rcSetupLocal

rcSetup is the tool which manages the analysis base. The base is a collection of tools which are used in all areas of an analysis. Calibrations, selections, etc etc. 

First we need Subversion:
sudo yum install svn
Lets set it up as follows:
mkdir -p ~/ATLAS/sw/rcSetup
cd ~/ATLAS/sw/rcSetup/
svn co svn+ssh://svn.cern.ch/reps/atlasoff/rcSetup/tags/rcSetup-00-04-09
ln -s rcSetup-00-04-09 latest

In our .bashrc file we will create a command which is not the default rcSetup but rather rcSetupLocal (To distinguish between the two):

rcSetupLocal() {
    export rcSetupSite=~/ATLAS/sw/releases
    export PATHRESOLVER_ALLOWHTTPDOWNLOAD=1
    source ~/ATLAS/sw/rcSetup/latest/rcSetup.sh $*
}

Setting up Kerberos

We will be checking out packages using SVN. We will be checking out a LOT of packages and since rcSetup is a wrapper around SVN using SSH we end up having to type passwords multiple times for a single download. Fine if there are two or three but we will be downloading hundreds. So we need to setup the kerberos to fetch a ticket from CERN which will give us an extended period of passwordless use.

You should have kerberos installed already. Check if you have
/etc/krb5.conf

If yes then replace the contents with
; AD  : This Kerberos configuration is for CERN's Active Directory realm.
;
; /etc/krb5.conf
; On SLC nodes this file is maintained via ncm-krb5clt(1), local changes may be lost.
; If you need to add your realm, look at the "template" file 
; in /usr/lib/ncm/config/krb5clt/etc_krb5.conf.tpl
; or get in touch with project-elfms@cern.ch
;
; Created   1-Apr-2011
; Modified  3-Mar-2014
;

[libdefaults]
 default_realm = CERN.CH
 ticket_lifetime = 25h
 renew_lifetime = 120h
 forwardable = true
 proxiable = true
 default_tkt_enctypes = arcfour-hmac-md5 aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc
 allow_weak_crypto = true

[realms]
 CERN.CH = {
  default_domain = cern.ch
  kpasswd_server = cerndc.cern.ch
  admin_server = cerndc.cern.ch
  kdc = cerndc.cern.ch

  v4_name_convert = {
     host = {
         rcmd = host
     }
  }
 }
; the external institutes info is completely static for now and comes
; straight from the NCM template
 FNAL.GOV = {
  default_domain = fnal.gov
  admin_server = krb-fnal-admin.fnal.gov
  kdc = krb-fnal-1.fnal.gov:88 
  kdc = krb-fnal-2.fnal.gov:88 
  kdc = krb-fnal-3.fnal.gov:88 
 }
 KFKI.HU = {
  kdc = kerberos.kfki.hu
  admin_server = kerberos.kfki.hu
 } 
 HEP.MAN.AC.UK = {
  kdc = afs4.hep.man.ac.uk
  kdc = afs1.hep.man.ac.uk
  kdc = afs2.hep.man.ac.uk
  kdc = afs3.hep.man.ac.uk
  admin_server = afs4.hep.man.ac.uk
  kpasswd_server = afs4.hep.man.ac.uk
  default_domain = hep.man.ac.uk
 }
[domain_realm]
 .cern.ch = CERN.CH
 .fnal.gov = FNAL.GOV
 .kfki.hu = KFKI.HU
 .hep.man.ac.uk = HEP.MAN.AC.UK
[appdefaults]
   pkinit_pool =  DIR:/etc/pki/tls/certs/
   pkinit_anchors = DIR:/etc/pki/tls/certs/
; options for Red Hat pam_krb5-2
 pam = {
   external = true
   krb4_convert =  false 
   krb4_convert_524 =  false 
   krb4_use_as_req =  false 
   ticket_lifetime = 25h
 }
If no then install Kerberos and then replace the file.

Test the code by typing:
kinit
You should be prompted for your CERN username and password. If you do not have the same username then it should query you or else visit here: http://linux.web.cern.ch/linux/docs/kerberos-access.shtml

Installing the AnalysisBase

Now we can start the installation of the Analysis Base. This can take some time. So be patient! You may need to install doxygen so before running:

sudo yum install doxygen
cd ~/ATLAS/sw/release/
rcSetupLocal -b Base,2.3.20

If at the end some packages do not install you can have a look at the log files and see what went wrong. You can clean and rebuild a single package by typing
rc clean_pkg packageName
rc compile_pkg packageName
 Hopefully its successful! :)

Installing your analysis packages

In this example I am installing the HGammaAnalysisFramework. You can follow their twiki here: https://twiki.cern.ch/twiki/bin/viewauth/AtlasProtected/HggDerivationAnalysisFramework#HGamAnalysisFramework

Go to the directory where you will be doing your analysis and setup the analysis base:
cd ~/my_xAOD_work/
rcSetupLocal Base,2.3.20
Download the packages and then run the HGam installation script to setup the environment
rc checkout_pkg atlasoff/PhysicsAnalysis/HiggsPhys/Run2/HGamma/xAOD/HGamAnalysisFramework/tags/HGamAnalysisFramework-00-02-17 
./HGamAnalysisFramework/scripts/setupRelease
This is where I download my analysis package HH2yybb:
rc checkout_pkg atlasoff/PhysicsAnalysis/HiggsPhys/Run2/HGamma/xAOD/HH2yybb/tags/HH2yybb-00-02-05
For an example of running an analysis package (Such as the HH2yybb). You can visit: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/HGam_run2_ggbb#yybb_Analyis_Framework

Installing an IDE

Its a really good idea to use anIDE. This is 2015 stop using vi, nano and emacs. IntelliSense allows you to see a list of functions and their parameters while you are typing. This will save you LOADS of time so try use it!

I am using CodeBlocks in my VM. Its quite decent and has the functionality I need. 

In your main directory. such as my_xAOD_work/
rc checkout_pkg atlas-krasznaa/AODUpgrade/CMakeGenerator/trunk
mkdir IDE
cd IDE/
../CMakeGenerator/generateRootCoreCMakeProject.py
Now you have created the make file we can generate a codeblocks project by typing:
cmake -G "CodeBlocks - Unix Makefiles" .
Now open codeblocks and open the project called RootCore.cbp

You should have a working IDE now :)

Done!

Happy analysising :P

2 comments:

  1. This is cool for running locally - on your laptop while on the plane, e.g... but of course, you can avoid all of this stress by submitting analysis jobs to the grid. Besides, best to run where the data is ! SAGrid supports ATLAS at Wits, UJ and UCT. We also have a continuous integration project allowing you to propose and deliver your own applications.

    See http://www.africa-grid.org and sub-pages, or get in touch for more :)

    ReplyDelete
  2. Hi Bruce, thanks for the info!
    I probably should have emphasized that this is more for code development than running over full datasets. Good luck to the person who tries to download the LHC data to their laptop :) I require local development since working remotely on lxplus is prone to interruptions (Meaning I have to setup the environment every time I reconnect) and there is little to no IDE support over console as the X11 forwarding is ridiculously slow. This way I can develop the code easily and like you say submit to the GRID to run over large datasets to do the actual analysis.

    ReplyDelete