Project leader: Tomas Lindén Personnel
HIP supports the computing needs of the LHC experiments ALICE, CMS and TOTEM. ALICE computing resources in Finland are part of the distributed Nordic Data Grid Facility (NDGF) Tier-1 resource within the Nordic e-Infrastructure Collaboration (NeIC), while the CMS resources are of Tier-2 level using some services from NDGF. These Tier-1 and Tier-2 resources are part of the distributed Worldwide LHC Computing Grid, WLCG. The good performance of WLCG is an essential factor in all the physics analyses and publications.
The HIP computing activities in 2020 will be focused on analysing the CMS data from Run 2 using the Finnish Tier-2 centre grid and cloud resources. The project team consists of T. Lindén (project leader) and F. Kivelä, who will work for HIP during the summer months. HIP is represented in the Nordic LHC Computing Grid (NLCG) Steering Committee by T. Lindén, and he is also a deputy member of the WLCG Grid Deployment Board (GDB).
Data intensive GRID computing
In 2020, the computing activities will consist of analysing the accumulated Run 2 data and to produce corresponding Monte Carlo data. This needs adequate high performing computing resources at the participating universities and institutes to cope with the data analysis requirements.
The ALICE and CMS computing resources are managed in collaboration with the Finnish IT Center for Science (CSC) and they are in production use. ALICE data management is integrated with the NDGF distributed dCache Storage Element for disk. Job submission is handled through a single server, an ALICE VOBOX that interfaces the local batch system with the ALICE grid middleware ALIEN. The ALICE CPU resources consists of a virtual cluster run on the CSC cPouta cloud system. The renewal of the CSC supercomputing environment in 2019 and 2020 gives a possibility to the increase the ALICE CPU resources according to the requirements. The CSC EuroHPC cluster Lumi will further increase the available resources at CSC and the possibilities for ALICE and CMS to make use of larger amounts of CPU.
The most important CMS Tier-2 services are end user data analysis on the grid and Monte Carlo simulation production, which need to run uninterrupted while migrating to new hardware, software and/or new services. CMS uses an approach where a set of services needs to be provided for a working Tier-2 site are distributed over the available resources. These services run partly in Kumpula and partly at CSC in Espoo. CMS runs most jobs in Singularity containers and this will allow easier usage of other Linux resources than Scientific Linux 6 or CERN CentOS 7. Using containers gives improved I/O performance compared to fully virtualized services. CentOS 8 was published in September 2019, so alcyone and at some point kale might be upgraded to that distribution and then Singularity will enable the transition process for CMS jobs. The new CSC environment has a container based service called Rahti, which should be developed and configured to support CMS jobs. This can be done by finalizing and developing further the Ansible FGCI-scripts written for kale-cms.
The Academy of Finland FIRI funding will be used for new disk storage for ALICE and CMS as well as tape for ALICE that needs to be procured, installed, configured, tested and taken into production usage. When the present disk system is taken out of usage, it could be transported to Kumpula for either development and test work or to be used as part of the University of Helsinki Lustre HPC filesystem.
The CMS services in Kumpula (Frontier, PhEDEx, BDII, Argus) are running on very old virtual host servers. When PhEDEx is replaced by the central Rucio service, the needs of the other services should be reviewed and the services migrated to newer hardware or to centrally hosted virtual machines.
The University of Helsinki has a total bandwidth of 2*10 Gb/s to FUNET and the CMS disk traffic between Kumpula and Espoo now runs over the shared UH link. The network performance when using both kale in Viikki and alcyone in Kumpula should be studied.
Memory savings can be achieved by running multi-core jobs instead of only simultaneous single-core jobs. This needs changes in the local resources, the middleware and the scheduling of jobs. Multicore CMS jobs are becoming increasingly important and should be supported also at HIP.
Advanced Resource Connector (ARC) middleware usage in CMS piloted by HIP has spread to Estonia, UK, France, Switzerland and Germany. The ARC Compute Element is now based on the web services BES standard and the gridftp interface will be phased out at some point. Job submission through the glideinWMS depends on the Condor-G gridftp ARC interface, so collaboration is needed with the Condor and ARC teams to ensure that jobs can be submitted from Condor-G to ARC resources using the BES interface well before the ARC gridftp interface is taken out of usage.
Usage of the other Finnish Grid and Cloud Infrastructure (FGCI) clusters could be studied, since CMSSW can be made available through CERNVMFS and the data can be read remotely with xrootd. This could be one potential source of more CPU power to partly meet the challenge of the increasing LHC luminosity. Usage of the FGI alcyone node with 1 TB of RAM or the FGCI kale node with 1.5 TB of RAM for a large CMS AOD sample could be interesting from the CMSSW performance point of view. It might be also useful for some calibration workflows. Documentation and recovery procedures for different failure scenarios need to be further developed and tested. The HIP ARC clusters have a working fair share configuration, but it might still be improved on in high load situations.
The long term CMS computing challenge is to develop the computing model, software, grid middleware and cloud services needed to meet the significant increase in data produced by the luminosity increase after LS3.