La descarga está en progreso. Por favor, espere

La descarga está en progreso. Por favor, espere

CCRC08 post-mortem LHCb activities at PIC G. Merino PIC, 19/06/2008.

Presentaciones similares


Presentación del tema: "CCRC08 post-mortem LHCb activities at PIC G. Merino PIC, 19/06/2008."— Transcripción de la presentación:

1 CCRC08 post-mortem LHCb activities at PIC G. Merino PIC, 19/06/2008

2 2 LHCb Computing Main user analysis supported at CERN + 6Tier-1s Tier-2s essentially MonteCarlo production facilities

3 3 CCRC08: Planned tasks May activities: Maintain equivalent of 1 month data taking assuming a 50% machine cycle efficiency Raw data distribution from pit → T0 centre Raw data distribution from T0 → T1 centres –Use of FTS - T1D0 Recons of raw data at CERN & T1 centres –RAW (T1D0)  rDST (T1D0) Stripping of data at CERN & T1 centres –RAW & rDST (T1D0)  DST (T1D1) Distribution of DST data to all other centres –Use of FTS - T0D1 (except CERN T1D1)

4 4 Activities across the sites Planned breakdown of processing activities (CPU needs) prior to CCRC08 SiteFraction (%) CERN14 FZK11 IN2P325 CNAF9 NIKHEF/SARA26 PIC4 RAL11

5 5 Tier 0  Tier 1 FTS from CERN to Tier-1 centres –Transfer of RAW will only occur once data has migrated to tape & checksum is verified –Rate out of CERN ~35MB/s averaged over the period –Peak rate far in excess of requirement In smooth running sites matched LHCb requirements

6 6 Tier 0  Tier 1

7 7 Issue with UK certificates CERN outage CERN SRM endpoint problems Restart IN2P3 SRM endpoint To first order all transfers eventually succeeded –plot shows efficiency on 1st attempt…

8 8 Reconstruction Used SRM 2.2 –LHCb space tokens are: LHCb_RAW (T1D0); LHCb_RDST (T1D0) Data shares need to be preserved –Important for resource planning Input 1 RAW file & output 1 rDST file (1.6 GB) Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) –In order to fit within the available queues –Need to get queues at all sites that match our processing time Alternative: reduce file size!

9 9 Reconstruction After data transfer file should be online, as job submitted immediately –NOTE: in principle only LHCb has this requirement of “online reconstruction”  Reco jobs will read the input data from the T1D0 write buffer Just in case… LHCb pre-stages files (srm_bringonline) & then checks on the status of the file (srm_ls) before submitting pilot job via GFAL –Pre-stage should ensure access availability from cache –Only issue at NL-T1 with reporting of file status

10 10 Reconstruction 41.2k reconstruction jobs submitted 27.6k jobs proceeded to done state Done/created ~67% Sub jobs Done jobs Done/ Sub NIKHEF10.3k (26%) 2.3k (6%) 23%  PIC1.8k (4%) 1.6k (4%) 89% RAL4.7k (11%) 3.5k (8%) 74% CERN6.1k (14%) 5.3k (13%) 86% CNAF3.9k (9%) 2.8k (7%) 72% GridKa4.1k (11%) 3.1k (7%) 76% IN2P310.3k (25%) 6.1k (14%) 56% 

11 11 Reconstruction 27.6k reconstruction jobs in done state –21.2k jobs processed 25k events –Done/25k events ~77% 3.0k jobs failed to upload rDST to local SE –Only 1 attempt before trying Failover –Failover/25k events ~13% 25k events Fail upload Success /Created NIKHEF1.2k (53%) 0.9k (70%) 4% PIC1.6k (99%) 0.0k (0%) 89% RAL3.1k (89%) 0.0k (1%) 68% CERN5.2k (100%) 0.7k (14%) 76% CNAF2.6k (95%) 0.0k (1%) 67% GridKa3.0k (99%) 0.7k (22%) 58% IN2P35.1k (90%) 0.7k (14%) 43%

12 12

13 13  Error humano en el PIC: WN con la red desconfigurada 24-27 de Mayo Hacía de black-hole (ticket-4386)

14 14 Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs CNAF: more jobs than cores on a WN … IN2P3 & RAL: Problems reading input data

15 15 Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs PIC: The most cpu-efficient T1

16 16 dCache Observations Official LCG recommendation - 1.8.0-15p3 LHCb ran smoothly at half of T1 dCache sites –PIC OK - version 1.8.0-12p6 (dcap) –GridKa OK - version 1.8.0-15p2 (dcap) –IN2P3 - problematic - version 1.8.0-12p6 (gsidcap) Seg faults - needed to ship version of GFAL to run Could explain CGSI-gSOAP problem???? –NL-T1 - problematic (gsidcap) Many versions during CCRC to solve number of issues 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4

17 17 Databases Conditions DB used at CERN & Tier-1 centres –No replication tests of conditions DB Pit ↔Tier-0 (and beyond) –Switched to using Conditions DB 15th May for reconstruction LFC –Use “streaming” to populate the read-only instance at T1 from CERN –Problem with CERN instance revealed local instances not being used by LHCb! Testing underway now

18 18 Stripping Stripping on rDST files 1 rDST file & associated RAW file Space tokens: LHC_RAW & LHCb_rDST DST files & ETC produced during the process stored locally on T1D1 (add storage class) Space tokens: LHCb_M-DST DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) Space tokens: LHCb_DST (LHCb_M-DST)

19 19 Stripping 31.8k stripping jobs were submitted 9.3k jobs ran to “Done” Major issues with LHCb book-keeping SubmDone CERN2.4k2.3k CNAF2.3k2.0k GridKa2.0k IN2P34.5k0.2k NIKHEF0.3k<0.1k PIC1.1k RAL2.2k1.6k Failed to resolve datasets 17.0k

20 20 Stripping: T1-T1 transfers Stripping test limited to 4 T1 centres CNAFPIC GridKaRAL Initial problems uploading to M-DST Token at PIC Catch up ok once solved

21 21 Conclusiones A pesar de ser el Tier-1 más pequeño de LHCb, la calidad de servicio del PIC ha sido la más alta en el CCRC08 Se han testeado los siguientes procesos para los Tier-1 –Recepción de datos desde el CERN –Reconstrucción –Stripping y envío de DST a otros Tier-1 Los resultados en el PIC han sido positivos –Recepción de datos desde el CERN (~5MB/s) –Lectura de datos desde WNs (dcap) – OK –Demostrada replicación de DST a otros Tier-1s a más velocidad de la requerida (catch-up) El ejercicio ha sido también útil para que LHCb detecte los puntos débiles de su infraestructura Grid DIRAC –Mejorar el sistema de book-keeping, log-files, etc


Descargar ppt "CCRC08 post-mortem LHCb activities at PIC G. Merino PIC, 19/06/2008."

Presentaciones similares


Anuncios Google