LCG AA Internal Review

Report
Summary Session I
René Brun
ACAT05
27 May 2005
Outline
19 presentations
Data Analysis, Data Acquisition and Tools : 6
GRID Deployment : 4
Applications on the GRID : 5
High Speed Computing : 4
R. Brun,
ACAT05
DESY, Zeuthen
2
Data Analysis, Acquisition, Tools
•
•
•
•
•
•
Evolution of the Babar configuration data base design
DAQ software for SND detector
Interactive Analysis environment of unified accelerator libraries
DaqProVis, a toolkit for acquisition, analysis, visualisation
The Graphics Editor in ROOT
Parallel interactive and batch HEP data analysis with PROOF
R. Brun,
ACAT05
DESY, Zeuthen
3
Evolution of the Configuration Database
Design
Andrei Salnikov, SLAC
For BaBar Computing Group
ACAT05 – DESY, Zeuthen
BaBar database migration
• BaBar was using Objectivity/DB ODBMS for many of its
databases
• About two years ago started migration from Objectivity to ROOT
for event store, which was a success and improvement
• No reason to keep pricey Objectivity only because of “secondary”
databases
• Migration effort started in 2004 for conditions,
configuration, prompt reconstruction, and ambient
databases
R. Brun,
ACAT05
DESY, Zeuthen
5
Configuration database API
• Main problem of the old database – API exposed too much to the
implementation technology:
• Persistent objects, handles, class names, etc.
• API has to change but we don’t want to make the same mistakes again (new
mistakes are more interesting)
• Pure transient-level abstract API independent on any specific implementation
technology
• Always make abstract APIs to avoid problems in the future (this may
be hard and need few iterations)
• Client code should be free from any specific database implementation details
• Early prototyping could answer a lot of questions, but five years of experience
count too
• Use different implementations for clients with different requirements
• Implementation would benefit from features currently missing in
C++: reflection, introspection (or from completely new language)
R. Brun,
ACAT05
DESY, Zeuthen
6
DAQ software for SND detector
Budker Institute of Nuclear Physics, Novosibirsk
M. Achasov, A. Bogdanchikov, A. Kim, A. Korol
Main data flow
1 kHz
4 KB
Readout and
events
building
1 kHz
100 Hz
1 kHz
Events
filtering
Events
packing
4 KB
1 KB
Storage
1 KB
Expected rates:
• Events fragments: 4 МB/s are read from IO processors over Ethernet;
• Event building: 4 MB/s;
• Event packing: 1 МB/s;
• Events filtering (90% screening): 100 KB/sec.
R. Brun,
ACAT05
DESY, Zeuthen
8
DAQ architecture
Detector
Off-line
Front-end electronics
KLUKVA
KLUKVA
KLUKVA
KLUKVA
Visualization
Backup
X 12
X 16
Filtered events
CAMAC
CAMAC
Readout & Event Building
Buffer
Database
TLT computers
Calibration
process
System support
R. Brun,
ACAT05
DESY, Zeuthen
9
R. Brun,
ACAT05
DESY, Zeuthen
10
Interactive Analysis Environment of
Unified Accelerator Libraries
V. Fine, N. Malitsky, R.Talman
Abstract
Unified
Accelerator
Libraries
(UAL,http://www.ual.bnl.gov)
software is an open accelerator simulation environment addressing a
broad spectrum of accelerator tasks ranging from online-oriented
efficient models to full-scale realistic beam dynamics studies. The
paper introduces a new package integrating UAL simulation algorithms with the
Qt-based Graphical User Interface and an open collection of analysis and
visualization components. The primary user application is implemented as an
interactive and configurable Accelerator Physics Player whose extensibility is
provided by plug-in architecture. Its interface to data analysis and visualization
modules is based on the Qt layer (http://root.bnl.gov) developed and supported
by the Star experiment. The present version embodies the ROOT
(http://root.cern.ch)
data
analysis
framework
and
Coin
3D
(http://www.coin3d.org) graphics library.
R. Brun,
ACAT05
DESY, Zeuthen
12
Accelerator Physics Player
UAL::USPAS::BasicPlayer* player = new
player->setShell(&shell);
An open collection
of viewers
UAL::USPAS::BasicPlayer();
qApp.setMainWidget(player);
player->show();
qApp.exec();
An open collection of
algorithms
R. Brun,
ACAT05
DESY, Zeuthen
13
Examples of the Accelerator-Specific Viewers
Turn-By-Turn BPM data
(based on ROOT TH2F
or TGraph )
Twiss plots (based on
ROOT TGraph)
Bunch 3D Distributions
(based on COIN 3D)
Bunch 2D Distributions
(based on ROOT TH2F)
R. Brun,
ACAT05
DESY, Zeuthen
14
Parallel Interactive and Batch
HEP-Data Analysis
with PROOF
Maarten Ballintijn*, Marek Biskup**,
Rene Brun**, Philippe Canal***,
Derek Feichtinger****, Gerardo Ganis**,
Guenter Kickinger**, Andreas Peters**,
Fons Rademakers**
* - MIT
** - CERN
*** - FNAL
**** - PSI
ROOT Analysis Model
standard model
 Files analyzed on a local computer
 Remote data accessed via remote fileserver (rootd/xrootd)
Local file
Remote file
(dcache, Castor,
RFIO, Chirp)
Client
Rootd/xrootd
server
R. Brun,
ACAT05
DESY, Zeuthen
16
PROOF Basic Architecture
Single-Cluster mode
 The Master divides the work among the slaves
 After the processing finishes, merges the results
(histograms, scatter plots)
 And returns the result to the Client
Slaves
Master
Files
Client
Commands, scripts
Histograms, plots
R. Brun,
ACAT05
DESY, Zeuthen
17
PROOF and Selectors
The code is shipped to each
slave and SlaveBegin(), Init(),
Process(), SlaveTerminate()
are executed there
Many Trees
are being
processed
Initialize
each slave
No user’s
control of the
entries loop!
The same code works
also without PROOF.
R. Brun,
ACAT05
DESY, Zeuthen
18
Analysis session snapshot
What we are implementing:
AQ1: 1s query produces a local histogram
Monday at 10h15
ROOT session
On my laptop
AQ2: a 10mn query submitted to PROOF1
AQ3->AQ7: short queries
AQ8: a 10h query submitted to PROOF2
BQ1: browse results of AQ2
BQ2: browse temporary results of AQ8
BQ3->BQ6: submit 4 10mn queries to PROOF1
Wednesday at
8h40
session
on any web
browser
CQ1: Browse results of AQ8, BQ3->BQ6
R. Brun,
ACAT05
Monday at 16h25
ROOT session
On my laptop
DESY, Zeuthen
19
ROOT Graphics Editor
by Ilka Antcheva
ROOT graphics editor can be:
• Embedded – connected only
with the canvas in the
application window
• Global – has own application
window and can be
connected to any created
canvas in a ROOT session.
R. Brun,
ACAT05
DESY, Zeuthen
20
Focus on Users
• Novices (for a short time)
• Theoretical understanding, no practical experience with ROOT
• Impatient with learning concepts; patient with performing tasks
• Advanced beginners (many people remain at this level)
• Focus on a few tasks and learn more on a need-to-do basis
• Perform several given tasks well
• Competent performers (fewer then previous class)
• Know and perform complex tasks that require coordinated actions
• Interested in solving problems and tracking down errors
• Experts (identified by others)
• Ability to find solution in complex functionality
• Interested in theories behind the design
• Interested in interacting with other expert systems
R. Brun,
ACAT05
DESY, Zeuthen
21
DaqProVis
M.Morhac
• DaqProVis, a toolkit for acquisition, interactive analysis, processing and
visualization of multidimensional data
• Basic features
• DaqProVis is well suited for interactive analysis of multiparameter data
from small and medium sized experiments in nuclear physics.
• data acquisition part of the system allows one to acquire multiparameter events
either directly from the experiment or from a list file, i.e., the system can work
either in on-line or off-line acquisition mode.
• in on-line acquisition mode, events can be taken directly from CAMAC crates or
from VME system that cooperates with DaqProVis in the client-server working
mode.
• in off-line acquisition mode the system can analyze event data even from big
experiments, e.g. from Gammasphere.
• the event data can be read also from another DaqProVis system. The capability of
DaqProVis to work simultaneously in both the client and the server working mode
enables us to realize remote as well as distributed nuclear data acquisition,
processing and visualization systems and thus to create multilevel configurations
R. Brun,
ACAT05
DESY, Zeuthen
22
DaqProVis (Visualisation)
R. Brun,
ACAT05
DESY, Zeuthen
23
DaqProVis (suite)
• DaqProVis and ROOT teams are already cooperating.
• Agreement during the workshop to extend this cooperation
R. Brun,
ACAT05
DESY, Zeuthen
24
GRID deployment
• Towards the operation of the Italian Tier-1 for CMS: Lessons
learned from the CMS Data Challenge
• GRID technology in production at DESY
• Grid middleware Configuration at the KIPT CMS Linux Cluster
• Storage resources management and access at Tier1 CNAF
R. Brun,
ACAT05
DESY, Zeuthen
25
Towards the operations of
the Italian Tier-1 for CMS:
lessons learned from the CMS Data Challenge
D. Bonacorsi
(on behalf of INFN-CNAF Tier-1 staff and the CMS experiment)
ACAT 2005
X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research
May 22nd-27th, 2005 - DESY, Zeuthen, Germany
DC04 outcome (grand-summary + focus on INFN T1)
•
•
reconstruction/data-transfer/analysis may run at 25 Hz
automatic registration and distribution of data, key role of the TMDB
•
•
was the embrional PhEDEx!
support a (reasonable) variety of different data transfer tools and set-up
•
Tier-1’s: different performances, related to operational choices
• SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk
•
•
register all data and metadata (POOL) to a world-readable catalogue
•
•
~15k jobs submitted
time window between reco data availability - start of analysis jobs can be reasonably low
(i.e. 20 mins)
reduce number of files (i.e. increase <#events>/<#files>)
•
•
•
R. Brun,
LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC
real-time analysis at Tier-2’s was demonstrated to be possible
•
•
•
RLS: good as a global file catalogue, bad as a global metadata catalogue
analyze the reconstructed data at the Tier-1’s as data arrive
•
•
INFN T1: good performance of LCG-2 chain (PIC T1 also)
more efficient use of bandwidth
reduce overhead of commands
address scalability of MSS systems (!)
ACAT05
DESY, Zeuthen
27
Learn from DC04 lessons…
• Some general considerations may apply:
• although a DC is experiment-specific, maybe its conclusions are not
• an “experiment-specific” problem is better addressed if
conceived as a “shared” one in a shared Tier-1
• an experiment DC just provides hints, real work gives insight
 crucial role of the experiments at the Tier-1
• find weaknesses of CASTOR MSS system in particular operating conditions
• stress-test new LSF farm with official production jobs by CMS
• testing DNS-based load-balancing by serving data for production and/or
analysis from CMS disk-servers
• test new components, newly installed/upgraded Grid tools, etc…
• find bottleneck and scalability problems in DB services
• give feedback on monitoring and accounting activities
• …
R. Brun,
ACAT05
DESY, Zeuthen
28
PhEDEx at INFN
•
INFN-CNAF is a T1 ‘node’ in PhEDEx
•
CMS DC04 experience was crucial to start-up PhEDEX in INFN
•
•
CNAF node operational since the beginning
First phase (Q3/4 2004):
•
Agent code development + focus on operations: T0T1 transfers
•
>1 TB/day T0T1 demonstrated feasible
•
•
… but the aim is not to achieve peaks, but to sustain them in normal operations
Second phase (Q1 2005):
•
PhEDEx deployment in INFN to Tier-n, n>1:
•
“distributed” topology scenario
•
•
Tier-n agents run at remote sites, not at the T1: know-how required, T1 support
already operational at Legnaro, Pisa, Bari, Bologna
An example:
data flow to T2’s in daily
operations (here: a test with
~2000 files, 90 GB, with no
optimization)
~450 Mbps CNAF T1  LNL-T2
~205 Mbps CNAF T1  Pisa-T2
Third phase (Q>1 2005):
 Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS
official production system, PhEDEx involvement in SC3-phaseII, etc…
R. Brun,
ACAT05
DESY, Zeuthen
29
Storage resources management and access at
TIER1 CNAF
Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo
behalf of INFN TIER1 Staff
[email protected]
ACAT 2005
May 22-27 2005
DESY Zeuthen, Germany
on
TIER1 INFN CNAF Storage
HSM (400 TB)
NAS (20TB)
STK180 with 100 LTO-1
(10Tbyte Native)
W2003 Server with LEGATO
Networker (Backup)
CASTOR HSM servers
Linux SL 3.0 clients
(100-1000 nodes) NFS
RFIO
WAN or TIER1 LAN
STK L5500 robot (5500 slots)
6 IBM LTO-2,
2 (4) STK 9940B drives
Diskservers with Qlogic FC HBA 2340
Infortrend
4 x 3200 GByte SATA
A16F-R1A2-M1
H.A.
PROCOM 3600 FC
NAS2 9000 Gbyte
SAN 1 (200TB)
IBM FastT900 (DS 4500)
3/4 x 50000 GByte
4 FC interfaces
2 Brocade
Silkworm 3900
32 port FC Switch
2 Gadzoox Slingshot
4218 18 port FC Switch
R. Brun,
PROCOM 3600 FC
NAS3 4700 Gbyte
NFS-RFIO-GridFTP oth...
SAN 2 (40TB)
STK BladeStore
About 25000 GByte
4 FC interfaces
NAS1,NAS4
3ware IDE SAS
1800+3200 Gbyte
AXUS BROWIE
About 2200 GByte
2 FC interface
ACAT05
DESY, Zeuthen
Infortrend
5 x 6400 GByte SATA
A16F-R1211-M2 +
JBOD
31
CASTOR HSM
Point to Point FC 2Gb/s connections
8 tapeserver
STK L5500 2000+3500 mixed
slots
Linux RH AS3.0
6 drives LTO2 (20-30 MB/s)
HBA Qlogic 2300
2 drives 9940B (25-30 MB/s)
1300 LTO2 (200 GB native)
650 9940B (200 GB native)
Sun Blade v100 with 2 internal ide disks
with software raid-0 running ACSLS 7.0
OS Solaris 9.0
EXPERIMENT
Staging area
(TB)
1 CASTOR (CERN)Central
Services server RH AS3.0
Tape pool
(TB native)
ALICE
8
12
ATLAS
6
20
CMS
2
15
LHCb
18
30
BABAR,AMS+oth
2
4
WAN or TIER1 LAN
6 stager with diskserver RH AS3.0
8 or more rfio diskservers
15 TB Local staging area
RH AS 3.0 min 20TB
staging area
Indicates Full rendundancy FC 2Gb/s
connections (dual controller HW and
Qlogic SANsurfer Path Failover SW)
R. Brun,
1 ORACLE 9i rel 2 DB
server RH AS 3.0
SAN 2
ACAT05
DESY, Zeuthen
SAN 1
32
DISK access (2)
We have different protocols in production for accessing the disk storage. In
our diskservers and Grid SE front-ends we corrently have:
1.
2.
3.
4.
NFS on local filesystem: ADV. Easy client implementation and compatibility and
possibility of failover (RH 3.0). DIS. Bad perfomance scalability for an high number of
access (1 client 30MB/s 100 client 15MB/s throughtput)
RFIO on local filesystem: ADV. Good performance and compatibility with Grid Tools
and possibility of failover. DIS. No scalability of front-ends for the single filesystem, no
possibility of load-balancing
Grid SE Gridftp/rfio over GPFS (CMS,CDF): ADV: Separation from GPFS servers
(accessing the disks) and SE GPFS clients. Load balancing and HA on the GPFS servers
and possibility to implement the same on the Grid SE services (see next slide). DIS. GPFS
layer requirements on OS and Certified Hardware for support.
Xrootd (BABAR): ADV: Good performance DIS: No possibility of load-balancing for the
single filesystem backends, not grid compliant (at present...)
NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM so is possible from many
front-ends (i.e. gridftp or rfio server) to access simultaneously the SAME
filesystem. Also can use bigger filesystem size (we use 8-12TB).
1
R. Brun,
ACAT05
DESY, Zeuthen
33
Generic Benchmark
(here shown for 1 GB files)
WRITE (MB/s)
# of simultaneous client
processes
GPFS 2.3.0-1
Lustre 1.4.1
•
READ (MB/s)
1
5
10
50
120
1
5
10
50
120
native
114
160
151
147
147
85
301
301
305
305
NFS
102
171
171
159
158
114
320
366
322
292
RFIO
79
171
158
166
166
79
320
301
320
321
native
102
512
512
488
478
73
366
640
453
403
RFIO
93
301
320
284
281
68
269
269
314
349
Numbers are reproducible with small fluctuations
•
R. Brun,
Lustre tests with NFS export not yet performed
ACAT05
DESY, Zeuthen
34
Grid Technology in Production
at DESY
Andreas Gellrich*
DESY
ACAT 2005
24 May 2005
*http://www.desy.de/~gellrich/
R. Brun,
ACAT05
DESY, Zeuthen
35
Grid @ DESY
R. Brun,
•
With the HERA-II luminosity upgrade, the demand for MC production rapidly increased while the outside
collaborators moved there computing resources towards LCG
•
The ILC group plans the usage of Grids for their computing needs
•
The LQCD group develops a Data Grid to exchange data
•
DESY considers a participation in LHC experiments

EGEE and D-GRID

dCache is a DESY / FNAL development

Since spring 2004 an LCG-2 Grid infrastructure in operation
ACAT05
DESY, Zeuthen
36
Grid Infrastructure @ DESY …
•
DESY installed (SL3.04, Quattor, yaim) and operates a complete independent Grid infrastructure which
provides generic (non- experiment specific) Grid services to all experiments and groups
•
The DESY Production Grid is based on LCG-2_4_0 and includes:




R. Brun,
Resource Broker (RB), Information Index (BDII), Proxy (PXY)
Replica Location Services (RLS)
In total 24 + 17 WNs (48 + 34 = 82 CPUs)
dCache-based SE with access to the entire DESY data space
•
VO management for the HERA experiments (‘hone’, ‘herab’, ‘hermes’, ‘szeu’), LQCD (‘ildg’), ILC (‘ilc’, ‘calice’),
Astro-particle Physics (‘baikal’, ‘icecube’)
•
Certification services for DESY users in cooperation with GridKa
ACAT05
DESY, Zeuthen
37
R. Brun,
ACAT05
DESY, Zeuthen
38
Grid Middleware Configuration
at the KIPT CMS Linux Cluster
S. Zub, L. Levchuk, P. Sorokin, D. Soroka
Kharkov Institute of Physics & Technology, 61108 Kharkov, Ukraine
http://www.kipt.kharkov.ua/~cms
[email protected]
What is our specificity?
 Small PC-farm (KCC)
 Small scientific group of 4 physicists,
combining their work with system administration
 CMS tasks orientation
 No commercial software installed
 Self-security providing
 Narrow bandwidth communication channel
 Limited traffic
R. Brun,
ACAT05
DESY, Zeuthen
40
Summary
•
•
•
•
•
An enormous data flow expected in the LHC experiments forces the HEP
community to resort to the Grid technology
The KCC is a specialized PC farm constructed at the NSC KIPT for
computer simulations within the CMS physics program and preparation
to the CMS data analysis
Further development of the KCC is planned with considerable increase
of its capacities and deeper integration into the LHC Grid (LCG)
structures
Configuration of the LCG middleware can be troublesome
(especially at small farms with poor internet connection),
since this software is neither universal nor “complete”, and
one has to resort to special tips
Scripts are developed that facilitate the installation procedure
at a small PC farm with a narrow internet bandwidth
R. Brun,
ACAT05
DESY, Zeuthen
41
Applications on the Grid
•
•
•
•
•
The CMS analysis chain in a distributed environment
Monte Carlo Mass production for ZEUS on the Grid
Metadata services on the Grid
Performance comparison of the LCG2 and gLite File Catalogues
Data Grids for Lattice QCD
R. Brun,
ACAT05
DESY, Zeuthen
42
The CMS analysis chain in a
distributed environment
Nicola De Filippis
on behalf of the
CMS collaboration
ACAT 2005
DESY, Zeuthen, Germany 22nd –
27th May, 2005
R. Brun,
ACAT05
DESY, Zeuthen
43
The CMS analysis tools
Overview:
•
R. Brun,
Data management
• Data Transfer service:
• Data Validation stuff:
• Data Publication service:
PHEDEX
ValidationTools
RefDB/PubDB
•
Analysis Strategy
• Distributed Software installation: XCMSI
• Analysis job submission tool:
CRAB
•
Job Monitoring
• System monitoring:
• application job monitoring:
ACAT05
BOSS
JAM
DESY, Zeuthen
44
The end-user analysis wokflow
 The user provides:
•
•
Dataset (runs,#event,..)
private code
DataSet
Catalogue
(PubDB/RefDB)
UI
CRAB
Job submission tool
 CRAB discovers data and sites
hosting them by querying RefDB/
PubDB
 CRAB prepares, splits and submits
jobs to the Resource Broker
Workload
Management
System
 The RB sends jobs at sites hosting the
data provided the CMS software was
installed
 CRAB retrieves automatically the
output files of the the job
R. Brun,
ACAT05
Resource Broker (RB)
XCMSI
Computing
Element
DESY, Zeuthen
Storage
Element
Worker
node
45
Conclusions
 CMS first working prototype for Distributed User Analysis is
available and used by real users
 Phedex, PubDB, ValidationTools, XCMSI, CRAB, BOSS, JAM
under development, deployment and in production in many sites
 CMS is using Grid infrastructure for physics analyses and Monte
Carlo production
 tens of users, 10 million of analysed data, 10000 jobs submitted
 CMS is designing a new architecture for the analysis workflow
R. Brun,
ACAT05
DESY, Zeuthen
46
R. Brun,
ACAT05
DESY, Zeuthen
47
R. Brun,
ACAT05
DESY, Zeuthen
48
Metadata Services on the GRID
Nuno Santos
ACAT’05
May 25th, 2005
Metadata on the GRID
• Metadata is data about data
• Metadata on the GRID
• Mainly information about files
• Other information necessary for running jobs
• Usually living on DBs
• Need simple interface for Metadata access
• Advantages
• Easier to use by clients - no SQL, only metadata concepts
• Common interface – clients don’t have to reinvent the wheel
• Must be integrated in the File Catalogue
• Also suitably for storing information about other resources
R. Brun,
ACAT05
DESY, Zeuthen
50
ARDA Implementation
• Backends
• Currently: Oracle, PostgreSQL, SQLite
Metadata Server
• Two frontends
•
Oracle
TCP Streaming
Client
SOAP
• Chosen for performance
•
MD
Server
SOAP
• Formal requirement of EGEE
• Compare SOAP with TCP Streaming
•
Client
Postgre
SQL
TCP
Streaming
SQLite
Also implemented as standalone Python
library
• Data stored on filesystem
Python Interpreter
Client
Metadata
Python
API
filesystem
R. Brun,
ACAT05
DESY, Zeuthen
51
SOAP Toolkits performance
Test communication performance
• No work done on the backend
• Switched 100Mbits LAN
•
Language comparison
•
•
•
25
TCP-S with similar performance in all
languages
SOAP performance varies strongly with
toolkit
Protocols comparison
• Keepalive improves performance
significantly
• On Java and Python, SOAP is several
times slower than TCP-S
Execution Time [s]
•
20
ACAT05
1000 pings
15
10
5
0
R. Brun,
TCP-S no KA
TCP-S KA
gSOAP no KA
gSOAP KA
C++ (gSOAP)
DESY, Zeuthen
Java (Axis)
Python (ZSI)
52
R. Brun,
ACAT05
DESY, Zeuthen
53
R. Brun,
ACAT05
DESY, Zeuthen
54
R. Brun,
ACAT05
DESY, Zeuthen
55
R. Brun,
ACAT05
DESY, Zeuthen
56
R. Brun,
ACAT05
DESY, Zeuthen
57
R. Brun,
ACAT05
DESY, Zeuthen
58
High speed Computing
• Infiniband
• Analysis of SCTP and TCP based communication in highspeed cluster
• The apeNEXT Project
• Optimisation of Lattice QCD codes for the Opteron processor
R. Brun,
ACAT05
DESY, Zeuthen
59
Forschungszentrum Karlsruhe
in der Helmholtz-Gemeinschaft
InfiniBand – Experiences at
Forschungszentrum Karlsruhe
A. Heiss, U. Schwickerath
Credits: Inge Bischoff-Gauss
Marc García Martí
Bruno Hoeft
Carsten Urbach
InfiniBand-Overview
Hardware setup at IWR
HPC applications:
MPI performance
lattice QCD
LM
HTC applications
rfio
xrootd
Lattice QCD Benchmark GE wrt/
InfiniBand
Memory and communication intensive application
Benchmark by
C. Urbach
See also CHEP04 talk
given by A. Heiss
Significant speedup
by using InfiniBand
Thanks to Carsten Urbach
FU Berlin and DESY
Zeuthen
RFIO/IB Point-to-Point file transfers
(64bit)
PCI-X and PCI-Express throughput
Notes
best results with PCI-Express:
> 800MB/s raw transfer speed
> 400MB/s file transfer speed
RFIO/IB see ACAT03
NIM A 534(2004) 130-134
solid: file transfers cache->/dev/null
dashed: network+protocol only
Disclaimer on PPC64:
Not an official IBM Product.
Technology Prototype.
(see also slide 5 and 6)
Xrootd and InfiniBand
Notes:
IPoIB notes:
Dual Opteron V20z
Mellanox Gold drivers
SM on InfiniCon 9100
same nodes as for GE
Native IB notes:
proof of concept version
based on Mellanox VAPI
using IB_SEND
dedicated send/recv buffers
same nodes as above
10GE notes:
IBM xseries 345 nodes
Xeon 32bit, single CPU
1 and 2 GB RAM
2.66GHz clock speed
Intel PRO/10GbE LR cards
used for long distance tests
First preliminary results
TCP vs. SCTP
in high-speed cluster environment
Miklos Kozlovszky
Budapest University of Technology and Economics
BUTE
TCP vs. SCTP
Both:
• IPv4 & IPv6 compatible
• Reliable
• Connection oriented
• Offers acknowledged, error free, non-duplicated transfer
• Almost same Flow and Congestion Control
TCP
SCTP
Byte stream oriented
Message oriented
3 way handshake connection init
4 way handshake connection init (cookie)
Old (more than 20 years)
Quite new (2000-)
Multihoming
Path-mtu discovery
R. Brun,
ACAT05
DESY, Zeuthen
65
Summary
•
•
•
•
SCTP inherited all the “good features of TCP”
SCTP want to behave like a next generation TCP
It is more secure than TCP, and has many attractive feature (e.g.:multihoming)
Theoretically it can work better than TCP, but TCP is faster (yet “poor”
implementations)
• Well standardized, and can be useful for cluster
R. Brun,
ACAT05
DESY, Zeuthen
66
R. Brun,
ACAT05
DESY, Zeuthen
67
R. Brun,
ACAT05
DESY, Zeuthen
68
R. Brun,
ACAT05
DESY, Zeuthen
69
R. Brun,
ACAT05
DESY, Zeuthen
70
R. Brun,
ACAT05
DESY, Zeuthen
71
R. Brun,
ACAT05
DESY, Zeuthen
72
R. Brun,
ACAT05
DESY, Zeuthen
73
R. Brun,
ACAT05
DESY, Zeuthen
74
My Impressions
Concerns
• Only a small fraction of the Session I talks correspond to the
original spirit of the AIHEP/ACAT Session I talks.
• In particular, many of the GRID talks about deployment and
infrastructure should be given to CHEP, not here.
• The large LHC collaborations have their own ACAT a few
times/year.
• The huge experiment software frameworks do not encourage
cross-experiments discussions or tools.
• For the next ACAT, the key people involved in the big experiments
should work together to encourage more talks or reviews.
R. Brun,
ACAT05
DESY, Zeuthen
76
Positive aspects
• ACAT continues to be a good opportunity to meet with other
cultures. Innovation may come from small groups or non HENP
fields.
• Contacts (even sporadic) with Session III or plenary talks are very
beneficial, in particular to young people.
R. Brun,
ACAT05
DESY, Zeuthen
77
The Captain of Kopenick
• Question to the audience :
• Is Friedrich Wilhelm Voigt (Captain of Kopenick) an ancestor of
Voigt, the father of the Voigt function ?
R. Brun,
ACAT05
DESY, Zeuthen
78

similar documents