SlideShare a Scribd company logo
Parallel Session A:
Supporting data
intensive applications
Chair:Tim Chown
Please switch your mobile phones to silent
17:30 -
19:00
No fire alarms scheduled. In the event of an
alarm, please follow directions of NCC staff
Exhibitor showcase and drinks reception
18:00 -
19:00 Birds of a feather sessions
Janet end-to-end
performance initiative update
Tim Chown, Jisc
Why is end-to-end performance important?
» Overall goal: help optimise a site’s use of its Janet connectivity
» Seeing a growth in data-intensive science applications
› Includes established areas like GridPP (moving LHC data around)
› As well as new areas like cryo-electron microscopy
» Seeing an increasing number of remote computation scenarios
› e.g., scientific networked equipment, no local compute
› Might require 10Gbit/s to remote compute to return computation results on a 100GB
data set to a researcher for timely visualisation
» Starting to see more 100Gbit/s connectivity requests
› Likely to have challenging data transfer requirements behind them
» As networking people, how do we help our researchers and their applications?
11/04/2017 Janet end-to-end performance initiative update
Speaking to your researchers
» Are you or your computing service department speaking to your researchers?
› If not, how do you understand their data-intensive requirements?
› If so, is this happening on a regular basis?
› Ideally you’d want to be able to plan ahead, rather than adapt on the fly
» Do you conduct networking “future looks”?
› Any step changes might dwarf your site’s organic growth
› This issue should have some attention at CIO or PVC Research level
» Do you know what your application elephants are?
› What’s the breakdown of your site’s network traffic?
› How are you monitoring network flows?
11/04/2017 Janet end-to-end performance initiative update
Researcher expectations
» How can we help set and manage researcher expectations?
» One aspect is helping them articulate their network requirements
› Volume of data in time X => data rate required
» It’s also about understanding practical limitations
› Can determine theoretical network throughput
› e.g. in principle, you can transfer 100TB over a 10Gbit/s link in 1 day
› But in practice, many factors may prevent this
» We should encourage researchers to speak to their computing service
» Computing services can in turn talk to the Janet Service Desk
› jisc.ac.uk/contact
» And noting here that cloud capabilities are becoming increasingly important
11/04/2017 Janet end-to-end performance initiative update
Janet end-to-end performance initiative
» This is the context in which Jisc set up the Janet end-to-end performance initiative
» The goals of the initiative include:
› Engaging with existing data-intensive research communities and identifying emerging
communities
› Creating dialogue between Jisc, computing service groups, and research communities
› Holding workshops, facilitating discussion on e-mail lists, etc.
› Helping researchers manage expectations
› Establishing and sharing best practices in identifying and rectifying causes of poor
performance
» More information:
› jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative
11/04/2017 Janet end-to-end performance initiative update
Understanding the factors affecting E2E
» Achieving optimal end-to-end performance is a multi-faceted problem.
» It includes:
› Appropriate provisioning between the end sites
› Properties of the local campus network (at each end), including capacity of the Janet
connectivity, internal LAN design, the performance of firewalls and the configuration of
other devices on the path
› End system configuration and tuning; network stack buffer sizes, disk I/O, memory
management, etc.
› The choice of tools used to transfer data, and the underlying network protocols
» To optimise end-to-end performance, you need to address each aspect
11/04/2017 Janet end-to-end performance initiative update
Janet network engineering
» From Jisc’s perspective, it’s important to ensure there is sufficient capacity in the network
for its connected member sites
» We perform an ongoing review of network utilisation
› Provision the backbone network and regional links
› Provision external connectivity, to other NRENs and networks
› Observe the utilisation, model growth, predict ahead
› Step changes have bigger impact at the local rather than backbone scale
» Janet has no differential queueing for regular IP traffic
› The Netpath service exists for dedicated / overlay links
› In general, Jisc plans regular network upgrades with a view to ensuring that there
is sufficient latent capacity in the network
11/04/2017 Janet end-to-end performance initiative update
Janet backbone, October 2016
11/04/2017 Janet end-to-end performance initiative update
Major external links, October 2016
11/04/2017 Janet end-to-end performance initiative update
E2EPI site visits
» We (Duncan Rand and I) have visited a dozen or so sites
› Met with a variety of networking staff and researchers
› And spoken to many others via email
» Really interesting to hear what sites are doing
› Some good practice evident, especially in local network engineering
› e.g. routing those elephants around main campus firewalls
› Campus firewalls often not designed for single high throughput flows
» Seeing varying use of site links
› e.g. 10G for campus, 10G resilient, 20G for research (GridPP)
› Some sites using their “resilient” link for bulk data transfers
» Some rate limiting of researcher traffic
› To avoid adverse impact on general campus traffic
› We’d encourage sites to talk to the JSD rather than rate limit
11/04/2017 Janet end-to-end performance initiative update
The Science DMZ model
» ESnet published the Science DMZ “design pattern” in 2012/13
› es.net/assets/pubs_presos/sc13sciDMZ-final.pdf
» Three key elements:
› Network architecture; avoiding local bottlenecks
› Network performance measurement
› Data transfer node (DTN) design and configuration
» Also important to apply your security policy without impacting performance
» The NSF’s Cyberinfrastructure (CC*) Program funded this model in over 100 US
universities:
› See nsf.gov/funding/pgm_summ.jsp?pims_id=504748
» No current funding equivalent in the UK; it’s down to individual campuses to
fund changes to network architectures for data-intensive science
› But this can and should be part of your network architecture evolution
11/04/2017 Janet end-to-end performance initiative update
Good news – you’re doing a lot already
» There are several examples of sites in the UK that
have a form of Science DMZ deployment
» In many cases the deployments were made
without knowledge of the Science DMZ model
» But Science DMZ is simply a set of good principles
to follow, so it’s not surprising that some Janet
sites are already doing it
» Examples in the UK:
› Diamond Light Source (more from Alex next)
› JASMIN/CEDA DataTransfer Zone
› Imperial College GridPP; supports up to 40Gbit/s
of IPv4/IPv6
› To realise the benefit, both end sites need to
apply the principles
11/04/2017 Janet end-to-end performance initiative update
Examples of campus network engineering
» In principle you can just use your Janet IP service
» Where specific guarantees are required, the Netpath Plus service is available
› See jisc.ac.uk/netpath
› But then you’ll not be able to exceed that capacity
» Some sites split their campus links, e.g. 10G campus, 10G science
› Again, be careful about using your “resilient” link for bulk data
› It’s better to speak to the JSD about a primary link upgrade
› And ensure appropriate resilience for that
» Some sites rate-limit research traffic
» Some examples of physical / virtual overlays
› The WLCG (for LHC data) deployed OPN (optical) and LHCONE (virtual) networks
› Not clear that one overlay per research community would scale
» At least one site is exploringCisco ACI (their SDN solution)
11/04/2017 Janet end-to-end performance initiative update
Measuring network characteristics
» It’s important to have telemetry on your network
» The Science DMZ model recommends perfSONAR for this
› More from Alex and Szymon later in this session
› Current version about to be 4.0 (as of April 17th, all being well )
» perfSONAR uses proven measurement tools under the hood
› e.g. iperf and owamp
» Can run between two perfSONAR systems or build a mesh
» Collects telemetry over time; throughput, loss, latency, traffic path
» Helps you assess the impact of changes to your network or systems
› And to understand variance in characteristics over time
» It can highlight poor performance, but doesn’t troubleshoot per se
11/04/2017 Janet end-to-end performance initiative update
Example: UK GridPP perfSONAR mesh
11/04/2017 Janet end-to-end performance initiative update
Janet perfSONAR / DTN test node(s)
» We’ve installed a 10G perfSONAR node at a Janet PoP in London
› Lets you test your site’s throughput to/from Janet backbone
› Might be useful to you if you know you’ll want to run some data-intensive applications
in the near future, but don’t yet have perfSONAR at the far end, or if you just want to
benchmark your site’s connectivity
› Ask us if you’re interested in using it
» We’re planning to add a second perfSONAR test node in our Slough DC
› Also planning to install a 10G reference SSD-based DTN there
› See https://fasterdata.es.net/science-dmz/DTN/
› This will allow disk-to-disk tests, using a variety of transfer tools
» We can also run a perfSONAR mesh for you, using MaDDash on aVM
» We may also deploy an experimental perfSONAR node
› e.g. to evaluate the new GoogleTCP-BBR implementation
11/04/2017 Janet end-to-end performance initiative update
Small node perfSONAR
» In some cases, just an indicative perfSONAR test is useful
» i.e., run loss/latency tests as normal, but limit throughout tests to 1Gbit/s
» For this scenario, you can build a small node perfSONAR system for under £250
» Jisc took part in the GEANT small node pilot project, using Gigabyte Brix:
› IPv4 and IPv6 test mesh at http://perfsonar-smallnodes.geant.org/maddash-webui/
» We now have a device build that we can offer to communities for testing
› Aim to make them as “plug and play” as possible
› And a stepping stone to a full perfSONAR node
» Further information andTNC2016 meeting slide deck:
› https://lists.geant.org/sympa/d_read/perfsonar-smallnodes/
11/04/2017 Janet end-to-end performance initiative update
perfSONAR small node test mesh
11/04/2017 Janet end-to-end performance initiative update
Aside: Google’sTCP-BBR
» TraditionalTCP performs poorly even with just a fraction of 1% loss rate
» Google have been developing a new version ofTCP
› TCP-BBR was open-sourced last year
› Requires just sender-side deployment
› Seeks high throughput with a small queue
› Good performance at up to 15% loss
› Google using it in production today
» Would be good to explore this further
› Understand impact on otherTCP variants
› And when used for parallelisedTCP applications like GridFTP
» See the presentation from the March 2017 IETF meeting:
› etf.org/proceedings/98/slides/slides-98-iccrg-an-update-on-bbr-congestion-control-
00.pdf
11/04/2017 Janet end-to-end performance initiative update
Building on Science DMZ?
» We should seek to establish good principles and practices at all campuses
› And the research organisations they work with, like Diamond
› There’s already a good foundation at many GridPP sites
» The Janet backbone is heading towards 600G capacity and beyond
» We can seed further communities of good practice on this foundation
› e.g. the DiRAC HPC community, the SES consortium, …
» And grow a Research DataTransfer Zone (RDTZ) within and between campuses
› Build towards a UK RDTZ
› Inspired by the US Pacific Research Platform model of multi-site, multi-discipline
research-driven collaboration built on NSF Science DMZ investment
» Many potential benefits, such as enabling new types of workflow
› e.g. streaming data to CPUs without the need to store locally
11/04/2017 Janet end-to-end performance initiative update
Transfer tools
» Your researchers are likely to find many available data transfer tools:
» There’s the simpler old friends like ftp and scp
› But these are likely to give a bad initial impression of what your network can do
» There’sTCP-based tools designed to mitigate the impact of packet loss
› GridFTP typically uses four parallelTCP streams
» Globus Connect is free for non-profit research and education use
› See globus.org/globus-connect
» There’s tools to support management of transfers
› FTS – see http://astro.dur.ac.uk/~dph0elh/documentation/transfer-data-to-ral-v1.4.pdf
» There’s also a commercial UDP-based option, Aspera
› See asperasoft.com/
» It would be good to establish more benchmarking of these tools at Janet campuses
11/04/2017 Janet end-to-end performance initiative update
E2E performance to cloud compute
» We’re seeing growing interest in the use of commercial cloud compute
› e.g. to provide remote CPU for scientific equipment
» Complements compute available at the new ESPRCTier-2 HPC facilities
› epsrc.ac.uk/research/facilities/hpc/tier2/
» Anecdotal reports of 2-3Gbit/s into AWS
› e.g. by researchers at the Institute of Cancer Research
› See presentations at RCUK CloudWorkshop - https://cloud.ac.uk/
» Bandwidth for AWS depends on theVM size
› See https://aws.amazon.com/ec2/instance-types/
» We’re keen to explore cloud compute connectivity further
› Includes AWS, MS ExpressRoute, …
› And scaling Janet connectivity to these services as appropriate
11/04/2017 Janet end-to-end performance initiative update
Future plans for E2EPI
» Our future plans include:
› Writing up and disseminating best practice case studies
› Growing a UK RDTZ by promoting such best practices within communities
› Deploying a second 10G Janet perfSONAR test node at our Slough DC
› Deploying a 10G reference DTN at Slough and performing transfer tool benchmarking
› Promoting wider campus perfSONAR deployment and community meshes
› Integrating perfSONAR data with router link utilisation
› Developing our troubleshooting support further
› Experimenting with Google’sTCP-BBR
› Exploring best practices in implementing security models for Science DMZ
› Expanding Science DMZ to include IPv6, SDN and other technologies
› Running a second community E2EPI workshop in October
› Holding a hands-on perfSONAR training event (after 4.0 is out)
11/04/2017 Janet end-to-end performance initiative update
Useful links
» Janet E2EPI roject page
› jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative
» E2EPI Jisc community page
› https://community.jisc.ac.uk/groups/janet-end-end-performance-initiative
» JiscMail E2EPI list (approx 100 subscribers)
› jiscmail.ac.uk/cgi-bin/webadmin?A0=E2EPI
» Camus Network Engineering for Data-Intensive Science workshop slides
› jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-
19-oct-2016
» Fasterdata knowledge base
› http://fasterdata.es.net/
» eduPERT knowledge base
› http://kb.pert.geant.net/PERTKB/WebHome
11/04/2017 Janet end-to-end performance initiative update
jisc.ac.uk
contact
Tim Chown
Jisc
tim.chown@jisc.ac.uk
Using perfSONAR and
Science DMZ to resolve
throughput issues
AlexWhite, Diamond
Diamond Light Source
perfSONAR and Science DMZ at Diamond
Diamond
»The Diamond machine is a type of particle accelerator
»CERN: high energy particles smashed together, analyse
the crash
»Diamond: exploits the light produced by high energy
particles undergoing acceleration
»Use this light to study matter – like a “super microscope”
What is the Diamond Light Source?
perfSONAR and Science DMZ at Diamond
Diamond
»In the Oxfordshire countryside, by the A34 near Didcot
»Diamond is a not-for-profit joint venture between STFC
and the WellcomeTrust
»Cost of use: Free access for scientists through a
competitive scientific application process
»Over 7000 researchers from academia and industry have
used our facility
What is the Diamond Light Source?
perfSONAR and Science DMZ at Diamond
The Machine
»Three particle accelerators:
› Linear accelerator
› Booster Synchrotron
› Storage ring
– 48 straight sections angled
together to make a ring
– 562m long
– could be called a
“tetracontakaioctagon”
perfSONAR and Science DMZ at Diamond
The Machine
perfSONAR and Science DMZ at Diamond
The Science
»Bright beams of light from
each bending magnet or
wiggler are directed into
laboratories known as
“beamlines”.
»We also have several cutting-
edge electron microscopes.
»All of these experiments
operate concurrently!
»Spectroscopy
»Crystallography
»Tomography (think CAT Scan)
»Infrared
»X-ray absorption
»X-ray scattering
perfSONAR and Science DMZ at Diamond
x-ray diffraction
perfSONAR and Science DMZ at Diamond
Data Rates
»Central Lustre and GPFS filesystems for science data:
› 420TB, 900TB, 3.3PB (as of 2016)
»Typical x-ray camera: 4MB frame, 100x per second
»An experiment can easily produce 300GB-1TB
»Scientists want to take their data home
Data-intensive research
perfSONAR and Science DMZ at Diamond
Data Rates
»Each dataset might only be downloaded once
»I don’t know where my users are
»Some scientists want to push data back to Diamond
Data-intensive problems
perfSONAR and Science DMZ at Diamond
Sneakernet
The old ways are the best?
perfSONAR and Science DMZ at Diamond
Network Limits
»Science data downloads from Diamond to visiting users’
institutes were inconsistent and slow
› …even though the facility had a “10Gb/s” JANET
connection from STFC.
»The limit on download speeds was delaying post-
experiment analysis at users’ home institutes.
The Problem
perfSONAR and Science DMZ at Diamond
Characterising the problem
»Our target: “a stable 50Mb/s
over a 10ms path”
› Using our site’s shared
JANET connection
› In real terms, 10ms was
approximately DLS to
Oxford
perfSONAR and Science DMZ at Diamond
Baseline findings
»Inside our network: 10Gb/s, no packet loss
»Over the STFC/JANET segment between Diamond’s edge
and the Physics Department at Oxford:
› low and unpredictable speeds
› a small amount of packet loss
Baseline findings with iperf
perfSONAR and Science DMZ at Diamond
Packet Loss
TCPThroughput =
»MSS (Packet Size)
»Latency (RTT)
»Packet Loss probability
TCP Performance is predicted by the Mathis equation
perfSONAR and Science DMZ at Diamond
Packet Loss
“Interesting” effects of packet loss
perfSONAR and Science DMZ at Diamond
Packet Loss
»According to Mathis, to
achieve our initial goal over a
10ms path, the tolerable
packet loss is:
› 0.026% (maximum)
perfSONAR and Science DMZ at Diamond
Finding the problem in the Last Mile
»We worked with STFC to connect a perfSONAR server
directly to the main Harwell campus border router to look
for loss in the “last mile”
perfSONAR and Science DMZ at Diamond
Science DMZ
The Fix: Science DMZ
perfSONAR and Science DMZ at Diamond
Security
»Data intensive science traffic interacts poorly with
enterprise firewalls
»Does this mean we can use the Science DMZ idea to just…
ignore security?
› No!
Aside: Security without Firewalls
perfSONAR and Science DMZ at Diamond
Security
»Implementing a Science DMZ means segmenting your
network traffic
› Apply specific controls to data transfer hosts
› Avoid unnecessary controls
Science DMZ as a Security Architecture
perfSONAR and Science DMZ at Diamond
Security
»Only run the services you need
»Protect each host:
› Router ACLs
› On-host firewall – Linux iptables is performant
Techniques to secure Science DMZ hosts
perfSONAR and Science DMZ at Diamond
Globus GridFTP
perfSONAR and Science DMZ at Diamond
»We adopted Globus GridFTP as
our recommended transfer
method
› It uses parallelTCP streams
› It retries and resumes
automatically
› It has a simple, web-based
interface
Diamond Science DMZ
The Diamond Science DMZ
perfSONAR and Science DMZ at Diamond
Diamond Science DMZ
»Test data: 2Gb/s+ consistently
between Diamond and the
ESnet test point at
Brookhaven Labs, NewYork
State, USA
Speed records!
»Biggest: Electron Microscope
data from DLS to Imperial
› 1120GB at 290Mb/s
»Fastest: Crystallography
dataset from DLS to
Newcastle
› 260GB at 480Mb/s
Actual transfers:
perfSONAR and Science DMZ at Diamond
Bad bits
»Globus’ logs
› Netflow
»Globus install
»My own use of perfSONAR
Bad bits
perfSONAR and Science DMZ at Diamond
SCP and Aspera
»SCP
»Aspera
Different TransferTools?
perfSONAR and Science DMZ at Diamond
Near Future
»10Gb+ performance
› Globus Cluster for multiple 10Gb links
› 40Gb single links
Near Future
perfSONAR and Science DMZ at Diamond
Thank you!
»Use real world testing
»Never believe your vendor
»Zero packet loss is crucial
»Enterprise firewalls introduce
packet loss
AlexWhite
Diamond Light Source
alex.white@diamond.ac.uk
perfSONAR and Science DMZ at Diamond
jisc.ac.uk
contact
AlexWhite
Diamond
alex.white@diamond.ac.uk
Essentials of the Modern
Performance Monitoring with
perfSONAR
SzymonTrocha, PSNC / GÉANT, szymon.trocha@psnc.pl
11 April 2017
Motivations
»Identify problems, when they
happen or (better) earlier
»The tools must be available (at
campus endpoints,
demarcations between
networks, at exchange points,
and near data resources such
as storage and computing
elements, etc)
»Access to testing resources
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Problem statement
» The global Research & Education network ecosystem is comprised of hundreds of
international, national, regional and local-scale networks
» While these networks all interconnect, each network is owned and operated by separate
organizations (called “domains”) with different policies, customers, funding models,
hardware, bandwidth and configurations
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
» This complex, heterogeneous set of networks must operate seamlessly from “end to end”
to support your science and research collaborations that are distributed globally
Where AreThe (multidomain) Problems?
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Source
Campus
S
Congested or faulty links
between domains
Congested intra-
campus links
D
Destination
Campus
Latency dependant problems inside
domains with small RTT
Challenges
»Delivering end-to-end performance
› Get the user, service delivery teams, local campus and
metro/backbone network operators working together
effectively
–Have tools in place
–Know your (network) expectations
–Be aware of network troubleshooting
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
What is perfSONAR?
» It’s infeasible to perform at-scale data movement all the time – as we see in other
forms of science, we need to rely on simulations
» perfSONAR is a tool to:
› Set network performance expectations
› Find network problems (“soft failures”)
› Help fix these problems
› All in multi-domain environments
» These problems are all harder when multiple networks are involved
» perfSONAR is provides a standard way to publish monitoring data
» This data is interesting to network researchers as well as network operators
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
TheToolkit
» Network performance comes down to a couple of key metrics:
› Throughput (e.g. “how much can I get out of the network”)
› Latency (time it takes to get to/from a destination)
› Packet loss/duplication/ordering (for some sampling of packets, do they all make it to the other side without serious abnormalities
occurring?)
› Network utilization (the opposite of “throughput” for a moment in time)
» We can get many of these from a selection of measurement tools – the perfSONAR Toolkit
» The “perfSONAR Toolkit” is an open source implementation and packaging of the perfSONAR
measurement infrastructure and protocols
» All components are available as RPMs, DEBs, and bundled as a CentOS ISO
» Very easy to install and configure (usually takes less than 30 minutes for default install)
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Importance of RegularTesting
» We can’t wait for users to report problems and then fix them
» Things just break sometimes
› Bad system or network tuning
› Failing optics
› Somebody messed around in a patch panel and kinked a fiber
› Hardware goes bad
» Problems that get fixed have a way of coming back
› System defaults come back after hardware/software upgrades
› New employees may not know why the previous employee set things up a certain way and back
out fixes
» Important to continually collect, archive, and alert on active test results
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
perfSONAR Deployment Possibilities
»Small node»Dedicated server
› A singleCPU with multiple cores
(2.7 GHz for 10Gps tests)
› 4GB RAM
› 1Gps onboard (mgmt + delay)
› 10Gps PCI-slot NIC (throughput)
»Low cost – small PC
› e.g. GIGABYTE BRIX GB-BACE-
3150
Various models
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
»Ad hoc testing
› Docker
Deployment Styles
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Beacon
Island
Mesh
Location criteria
»Where it can be integrated
into the facility
software/hardware
management systems?
»Where it can do the most good
for the network operators or
users?
»Where it can do the most good
for the community?
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
EDGE
NEXTTO
SERVICES
Distributed Deployment In a Few Steps
Choose your
home
• Connect to
network
InstallToolkit
software
• By site
administrator
Configure
hosts
• Networking
• 2 interfaces
• Visibility
Point to
central host
• To consume
central mesh
configuration
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Install and
configure
• By mesh
administrator
• Central data
storage
• Dashboard GUI
• Home for mesh
configuration
Configure
mesh
• Who, what and
when
• Every 6 hours
(bandwidth)
• Be careful 10G ->
1 G
Publish mesh
configuration
• To be consumed
by measurement
hosts
Run dashboard
• Observe
thresholds
• Look for errors
HOSTSCENTRALSERVER
New 4.0 release announcement
» Introduction of pScheduler
› New software for measurements to replace BWCTL
» GUI changes
› More information, better presentation
» OS support upgrade
› Debian 7, Ubuntu 14
› Centos7
– Still supportingCentos6
» Mesh config GUI
» Maddash alerting
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
April 17th
4.0 Data Colletion
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
New GUI
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
Thank you!
szymon.trocha@psnc.pl
12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
• http://www.perfsonar.net/
• http://docs.perfsonar.net/
• http://www.perfsonar.net/about/getting-help/
• perfSONAR videos
• https://learn.nsrc.org/perfsonar
Hands-on
course in
London is
coming…
jisc.ac.uk
SzymonTrocha
PSNC/GEANT
szymon.trocha@man.poznan.pl
12/04/2017 Title of presentation (Insert > Header & Footer > Slide > Footer > Apply to all)

More Related Content

PPTX
Parallel session: mobility
PPTX
Parallel session: security
PPTX
The Science DMZ
PPTX
Parallel session: IPv6
PPTX
Network engineering surgery (part one)
PDF
Ppt5 exp lonodn - kevin cope & alex yakimov ( imperial college ) data cent...
PPTX
Introducing Jisc's new managed identity provider service
PPTX
Network engineering surgery (part two)
Parallel session: mobility
Parallel session: security
The Science DMZ
Parallel session: IPv6
Network engineering surgery (part one)
Ppt5 exp lonodn - kevin cope & alex yakimov ( imperial college ) data cent...
Introducing Jisc's new managed identity provider service
Network engineering surgery (part two)

What's hot (20)

PPTX
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
PPTX
Opening up data: a UK perspective – Jisc and CNI conference 10 July 2014
PPTX
Archiving data from Durham to RAL using the File Transfer Service (FTS)
PDF
Science DMZ at Imperial
PPTX
Challenges in end-to-end performance
PPTX
Big Data for the Social Sciences - David De Roure - Jisc Digital Festival 2014
PPTX
Janet Network R&D Innovation - HEAnet / Juniper Innovation Day
PPTX
Science DMZ security
PPTX
perfSONAR: getting telemetry on your network
PDF
Imperial RIOXX implementation - Andrew McLean, Imperial College London
PPT
SKA NZ R&D BeSTGRID Infrastructure
PDF
Lean approach to IT development
PDF
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
PPTX
The future of research: are you ready? - Jeremy Frey - Jisc Digital Festival ...
PDF
The Environmental Futures & Big Data Impact Lab: Plymouth Launch Event Slides
PPTX
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
PPTX
Dev ops, noops or hypeops - Networkshop44
PDF
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
PPTX
Application of Assent in the safe - Networkshop44
PDF
Research data spring: streamlining deposit
UK e-Infrastructure for Research - UK/USA HPC Workshop, Oxford, July 2015
Opening up data: a UK perspective – Jisc and CNI conference 10 July 2014
Archiving data from Durham to RAL using the File Transfer Service (FTS)
Science DMZ at Imperial
Challenges in end-to-end performance
Big Data for the Social Sciences - David De Roure - Jisc Digital Festival 2014
Janet Network R&D Innovation - HEAnet / Juniper Innovation Day
Science DMZ security
perfSONAR: getting telemetry on your network
Imperial RIOXX implementation - Andrew McLean, Imperial College London
SKA NZ R&D BeSTGRID Infrastructure
Lean approach to IT development
DSD-INT 2019 Modelling in DANUBIUS-RI-Bellafiore
The future of research: are you ready? - Jeremy Frey - Jisc Digital Festival ...
The Environmental Futures & Big Data Impact Lab: Plymouth Launch Event Slides
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Dev ops, noops or hypeops - Networkshop44
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
Application of Assent in the safe - Networkshop44
Research data spring: streamlining deposit
Ad

Similar to Parallel session: supporting data-intensive applications (20)

PPTX
Future services on Janet
PPTX
End to end performance networkshop44
PPTX
Future services on Janet
PDF
Tech 2 Tech: Network performance
PPTX
Future services on Janet
PPTX
End to end performance - Networkshop44
PPTX
Tech 2 tech low latency networking on Janet presentation
PPTX
Provisioning Janet
PPTX
HPC Midlands - JANET(UK) Enabling the UK's e-Infrastructure
PPTX
Networkshop45 day one plenary session
PPTX
The Pacific Research Platform
PDF
Common Design Elements for Data Movement Eli Dart
PPTX
Network Engineering for High Speed Data Sharing
PPTX
Research network infrastructure engineers
PPTX
Jisc and janet network updates from network operations, operational services ...
PPTX
Scaling Approaches to the National Research Platform
PPTX
Shared services - the future of HPC and big data facilities for UK research
PPTX
Janet Futures
PPT
The Pacific Research Platform: a Science-Driven Big-Data Freeway System
PPTX
Edupert best practices in supporting end users - Networkshop44
Future services on Janet
End to end performance networkshop44
Future services on Janet
Tech 2 Tech: Network performance
Future services on Janet
End to end performance - Networkshop44
Tech 2 tech low latency networking on Janet presentation
Provisioning Janet
HPC Midlands - JANET(UK) Enabling the UK's e-Infrastructure
Networkshop45 day one plenary session
The Pacific Research Platform
Common Design Elements for Data Movement Eli Dart
Network Engineering for High Speed Data Sharing
Research network infrastructure engineers
Jisc and janet network updates from network operations, operational services ...
Scaling Approaches to the National Research Platform
Shared services - the future of HPC and big data facilities for UK research
Janet Futures
The Pacific Research Platform: a Science-Driven Big-Data Freeway System
Edupert best practices in supporting end users - Networkshop44
Ad

More from Jisc (20)

PPTX
Strengthening open access through collaboration: building connections with OP...
PPTX
Andrew-Brown-JUSP-showcase-20240730.pptx
PPTX
JUSP Showcase - Rebuilding Data presentation
PPTX
Adobe Express Engagement Webinar (Delegate).pptx
PPTX
FE Accessibility training matrix partnership - information session
PPTX
Procuring a research management system: why is it so hard?
PPTX
Adobe Express Engagement Webinar (Delegate).pptx
PPTX
How libraries can support authors with open access requirements for UKRI fund...
PPTX
Supporting (UKRI) OA monographs at Salford.pptx
PPTX
The approach at University of Liverpool.pptx
PPTX
Jisc's value to HE: the University of Sheffield
PPTX
Towards a code of practice for AI in AT.pptx
PPTX
Jamworks pilot and AI at Jisc (20/03/2024)
PPTX
Wellbeing inclusion and digital dystopias.pptx
PPTX
Accessible Digital Futures project (20/03/2024)
PPTX
Procuring digital preservation CAN be quick and painless with our new dynamic...
PPTX
International students’ digital experience: understanding and mitigating the ...
PPTX
Digital Storytelling Community Launch!.pptx
PPTX
Open Access book publishing understanding your options (1).pptx
PPTX
Scottish Universities Press supporting authors with requirements for open acc...
Strengthening open access through collaboration: building connections with OP...
Andrew-Brown-JUSP-showcase-20240730.pptx
JUSP Showcase - Rebuilding Data presentation
Adobe Express Engagement Webinar (Delegate).pptx
FE Accessibility training matrix partnership - information session
Procuring a research management system: why is it so hard?
Adobe Express Engagement Webinar (Delegate).pptx
How libraries can support authors with open access requirements for UKRI fund...
Supporting (UKRI) OA monographs at Salford.pptx
The approach at University of Liverpool.pptx
Jisc's value to HE: the University of Sheffield
Towards a code of practice for AI in AT.pptx
Jamworks pilot and AI at Jisc (20/03/2024)
Wellbeing inclusion and digital dystopias.pptx
Accessible Digital Futures project (20/03/2024)
Procuring digital preservation CAN be quick and painless with our new dynamic...
International students’ digital experience: understanding and mitigating the ...
Digital Storytelling Community Launch!.pptx
Open Access book publishing understanding your options (1).pptx
Scottish Universities Press supporting authors with requirements for open acc...

Recently uploaded (20)

PDF
advance database management system book.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
IGGE1 Understanding the Self1234567891011
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PPTX
Unit 4 Skeletal System.ppt.pptxopresentatiom
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
advance database management system book.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Digestion and Absorption of Carbohydrates, Proteina and Fats
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
IGGE1 Understanding the Self1234567891011
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Unit 4 Skeletal System.ppt.pptxopresentatiom
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Weekly quiz Compilation Jan -July 25.pdf
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
RMMM.pdf make it easy to upload and study
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf

Parallel session: supporting data-intensive applications

  • 1. Parallel Session A: Supporting data intensive applications Chair:Tim Chown
  • 2. Please switch your mobile phones to silent 17:30 - 19:00 No fire alarms scheduled. In the event of an alarm, please follow directions of NCC staff Exhibitor showcase and drinks reception 18:00 - 19:00 Birds of a feather sessions
  • 4. Why is end-to-end performance important? » Overall goal: help optimise a site’s use of its Janet connectivity » Seeing a growth in data-intensive science applications › Includes established areas like GridPP (moving LHC data around) › As well as new areas like cryo-electron microscopy » Seeing an increasing number of remote computation scenarios › e.g., scientific networked equipment, no local compute › Might require 10Gbit/s to remote compute to return computation results on a 100GB data set to a researcher for timely visualisation » Starting to see more 100Gbit/s connectivity requests › Likely to have challenging data transfer requirements behind them » As networking people, how do we help our researchers and their applications? 11/04/2017 Janet end-to-end performance initiative update
  • 5. Speaking to your researchers » Are you or your computing service department speaking to your researchers? › If not, how do you understand their data-intensive requirements? › If so, is this happening on a regular basis? › Ideally you’d want to be able to plan ahead, rather than adapt on the fly » Do you conduct networking “future looks”? › Any step changes might dwarf your site’s organic growth › This issue should have some attention at CIO or PVC Research level » Do you know what your application elephants are? › What’s the breakdown of your site’s network traffic? › How are you monitoring network flows? 11/04/2017 Janet end-to-end performance initiative update
  • 6. Researcher expectations » How can we help set and manage researcher expectations? » One aspect is helping them articulate their network requirements › Volume of data in time X => data rate required » It’s also about understanding practical limitations › Can determine theoretical network throughput › e.g. in principle, you can transfer 100TB over a 10Gbit/s link in 1 day › But in practice, many factors may prevent this » We should encourage researchers to speak to their computing service » Computing services can in turn talk to the Janet Service Desk › jisc.ac.uk/contact » And noting here that cloud capabilities are becoming increasingly important 11/04/2017 Janet end-to-end performance initiative update
  • 7. Janet end-to-end performance initiative » This is the context in which Jisc set up the Janet end-to-end performance initiative » The goals of the initiative include: › Engaging with existing data-intensive research communities and identifying emerging communities › Creating dialogue between Jisc, computing service groups, and research communities › Holding workshops, facilitating discussion on e-mail lists, etc. › Helping researchers manage expectations › Establishing and sharing best practices in identifying and rectifying causes of poor performance » More information: › jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative 11/04/2017 Janet end-to-end performance initiative update
  • 8. Understanding the factors affecting E2E » Achieving optimal end-to-end performance is a multi-faceted problem. » It includes: › Appropriate provisioning between the end sites › Properties of the local campus network (at each end), including capacity of the Janet connectivity, internal LAN design, the performance of firewalls and the configuration of other devices on the path › End system configuration and tuning; network stack buffer sizes, disk I/O, memory management, etc. › The choice of tools used to transfer data, and the underlying network protocols » To optimise end-to-end performance, you need to address each aspect 11/04/2017 Janet end-to-end performance initiative update
  • 9. Janet network engineering » From Jisc’s perspective, it’s important to ensure there is sufficient capacity in the network for its connected member sites » We perform an ongoing review of network utilisation › Provision the backbone network and regional links › Provision external connectivity, to other NRENs and networks › Observe the utilisation, model growth, predict ahead › Step changes have bigger impact at the local rather than backbone scale » Janet has no differential queueing for regular IP traffic › The Netpath service exists for dedicated / overlay links › In general, Jisc plans regular network upgrades with a view to ensuring that there is sufficient latent capacity in the network 11/04/2017 Janet end-to-end performance initiative update
  • 10. Janet backbone, October 2016 11/04/2017 Janet end-to-end performance initiative update
  • 11. Major external links, October 2016 11/04/2017 Janet end-to-end performance initiative update
  • 12. E2EPI site visits » We (Duncan Rand and I) have visited a dozen or so sites › Met with a variety of networking staff and researchers › And spoken to many others via email » Really interesting to hear what sites are doing › Some good practice evident, especially in local network engineering › e.g. routing those elephants around main campus firewalls › Campus firewalls often not designed for single high throughput flows » Seeing varying use of site links › e.g. 10G for campus, 10G resilient, 20G for research (GridPP) › Some sites using their “resilient” link for bulk data transfers » Some rate limiting of researcher traffic › To avoid adverse impact on general campus traffic › We’d encourage sites to talk to the JSD rather than rate limit 11/04/2017 Janet end-to-end performance initiative update
  • 13. The Science DMZ model » ESnet published the Science DMZ “design pattern” in 2012/13 › es.net/assets/pubs_presos/sc13sciDMZ-final.pdf » Three key elements: › Network architecture; avoiding local bottlenecks › Network performance measurement › Data transfer node (DTN) design and configuration » Also important to apply your security policy without impacting performance » The NSF’s Cyberinfrastructure (CC*) Program funded this model in over 100 US universities: › See nsf.gov/funding/pgm_summ.jsp?pims_id=504748 » No current funding equivalent in the UK; it’s down to individual campuses to fund changes to network architectures for data-intensive science › But this can and should be part of your network architecture evolution 11/04/2017 Janet end-to-end performance initiative update
  • 14. Good news – you’re doing a lot already » There are several examples of sites in the UK that have a form of Science DMZ deployment » In many cases the deployments were made without knowledge of the Science DMZ model » But Science DMZ is simply a set of good principles to follow, so it’s not surprising that some Janet sites are already doing it » Examples in the UK: › Diamond Light Source (more from Alex next) › JASMIN/CEDA DataTransfer Zone › Imperial College GridPP; supports up to 40Gbit/s of IPv4/IPv6 › To realise the benefit, both end sites need to apply the principles 11/04/2017 Janet end-to-end performance initiative update
  • 15. Examples of campus network engineering » In principle you can just use your Janet IP service » Where specific guarantees are required, the Netpath Plus service is available › See jisc.ac.uk/netpath › But then you’ll not be able to exceed that capacity » Some sites split their campus links, e.g. 10G campus, 10G science › Again, be careful about using your “resilient” link for bulk data › It’s better to speak to the JSD about a primary link upgrade › And ensure appropriate resilience for that » Some sites rate-limit research traffic » Some examples of physical / virtual overlays › The WLCG (for LHC data) deployed OPN (optical) and LHCONE (virtual) networks › Not clear that one overlay per research community would scale » At least one site is exploringCisco ACI (their SDN solution) 11/04/2017 Janet end-to-end performance initiative update
  • 16. Measuring network characteristics » It’s important to have telemetry on your network » The Science DMZ model recommends perfSONAR for this › More from Alex and Szymon later in this session › Current version about to be 4.0 (as of April 17th, all being well ) » perfSONAR uses proven measurement tools under the hood › e.g. iperf and owamp » Can run between two perfSONAR systems or build a mesh » Collects telemetry over time; throughput, loss, latency, traffic path » Helps you assess the impact of changes to your network or systems › And to understand variance in characteristics over time » It can highlight poor performance, but doesn’t troubleshoot per se 11/04/2017 Janet end-to-end performance initiative update
  • 17. Example: UK GridPP perfSONAR mesh 11/04/2017 Janet end-to-end performance initiative update
  • 18. Janet perfSONAR / DTN test node(s) » We’ve installed a 10G perfSONAR node at a Janet PoP in London › Lets you test your site’s throughput to/from Janet backbone › Might be useful to you if you know you’ll want to run some data-intensive applications in the near future, but don’t yet have perfSONAR at the far end, or if you just want to benchmark your site’s connectivity › Ask us if you’re interested in using it » We’re planning to add a second perfSONAR test node in our Slough DC › Also planning to install a 10G reference SSD-based DTN there › See https://fasterdata.es.net/science-dmz/DTN/ › This will allow disk-to-disk tests, using a variety of transfer tools » We can also run a perfSONAR mesh for you, using MaDDash on aVM » We may also deploy an experimental perfSONAR node › e.g. to evaluate the new GoogleTCP-BBR implementation 11/04/2017 Janet end-to-end performance initiative update
  • 19. Small node perfSONAR » In some cases, just an indicative perfSONAR test is useful » i.e., run loss/latency tests as normal, but limit throughout tests to 1Gbit/s » For this scenario, you can build a small node perfSONAR system for under £250 » Jisc took part in the GEANT small node pilot project, using Gigabyte Brix: › IPv4 and IPv6 test mesh at http://perfsonar-smallnodes.geant.org/maddash-webui/ » We now have a device build that we can offer to communities for testing › Aim to make them as “plug and play” as possible › And a stepping stone to a full perfSONAR node » Further information andTNC2016 meeting slide deck: › https://lists.geant.org/sympa/d_read/perfsonar-smallnodes/ 11/04/2017 Janet end-to-end performance initiative update
  • 20. perfSONAR small node test mesh 11/04/2017 Janet end-to-end performance initiative update
  • 21. Aside: Google’sTCP-BBR » TraditionalTCP performs poorly even with just a fraction of 1% loss rate » Google have been developing a new version ofTCP › TCP-BBR was open-sourced last year › Requires just sender-side deployment › Seeks high throughput with a small queue › Good performance at up to 15% loss › Google using it in production today » Would be good to explore this further › Understand impact on otherTCP variants › And when used for parallelisedTCP applications like GridFTP » See the presentation from the March 2017 IETF meeting: › etf.org/proceedings/98/slides/slides-98-iccrg-an-update-on-bbr-congestion-control- 00.pdf 11/04/2017 Janet end-to-end performance initiative update
  • 22. Building on Science DMZ? » We should seek to establish good principles and practices at all campuses › And the research organisations they work with, like Diamond › There’s already a good foundation at many GridPP sites » The Janet backbone is heading towards 600G capacity and beyond » We can seed further communities of good practice on this foundation › e.g. the DiRAC HPC community, the SES consortium, … » And grow a Research DataTransfer Zone (RDTZ) within and between campuses › Build towards a UK RDTZ › Inspired by the US Pacific Research Platform model of multi-site, multi-discipline research-driven collaboration built on NSF Science DMZ investment » Many potential benefits, such as enabling new types of workflow › e.g. streaming data to CPUs without the need to store locally 11/04/2017 Janet end-to-end performance initiative update
  • 23. Transfer tools » Your researchers are likely to find many available data transfer tools: » There’s the simpler old friends like ftp and scp › But these are likely to give a bad initial impression of what your network can do » There’sTCP-based tools designed to mitigate the impact of packet loss › GridFTP typically uses four parallelTCP streams » Globus Connect is free for non-profit research and education use › See globus.org/globus-connect » There’s tools to support management of transfers › FTS – see http://astro.dur.ac.uk/~dph0elh/documentation/transfer-data-to-ral-v1.4.pdf » There’s also a commercial UDP-based option, Aspera › See asperasoft.com/ » It would be good to establish more benchmarking of these tools at Janet campuses 11/04/2017 Janet end-to-end performance initiative update
  • 24. E2E performance to cloud compute » We’re seeing growing interest in the use of commercial cloud compute › e.g. to provide remote CPU for scientific equipment » Complements compute available at the new ESPRCTier-2 HPC facilities › epsrc.ac.uk/research/facilities/hpc/tier2/ » Anecdotal reports of 2-3Gbit/s into AWS › e.g. by researchers at the Institute of Cancer Research › See presentations at RCUK CloudWorkshop - https://cloud.ac.uk/ » Bandwidth for AWS depends on theVM size › See https://aws.amazon.com/ec2/instance-types/ » We’re keen to explore cloud compute connectivity further › Includes AWS, MS ExpressRoute, … › And scaling Janet connectivity to these services as appropriate 11/04/2017 Janet end-to-end performance initiative update
  • 25. Future plans for E2EPI » Our future plans include: › Writing up and disseminating best practice case studies › Growing a UK RDTZ by promoting such best practices within communities › Deploying a second 10G Janet perfSONAR test node at our Slough DC › Deploying a 10G reference DTN at Slough and performing transfer tool benchmarking › Promoting wider campus perfSONAR deployment and community meshes › Integrating perfSONAR data with router link utilisation › Developing our troubleshooting support further › Experimenting with Google’sTCP-BBR › Exploring best practices in implementing security models for Science DMZ › Expanding Science DMZ to include IPv6, SDN and other technologies › Running a second community E2EPI workshop in October › Holding a hands-on perfSONAR training event (after 4.0 is out) 11/04/2017 Janet end-to-end performance initiative update
  • 26. Useful links » Janet E2EPI roject page › jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative » E2EPI Jisc community page › https://community.jisc.ac.uk/groups/janet-end-end-performance-initiative » JiscMail E2EPI list (approx 100 subscribers) › jiscmail.ac.uk/cgi-bin/webadmin?A0=E2EPI » Camus Network Engineering for Data-Intensive Science workshop slides › jisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop- 19-oct-2016 » Fasterdata knowledge base › http://fasterdata.es.net/ » eduPERT knowledge base › http://kb.pert.geant.net/PERTKB/WebHome 11/04/2017 Janet end-to-end performance initiative update
  • 28. Using perfSONAR and Science DMZ to resolve throughput issues AlexWhite, Diamond
  • 29. Diamond Light Source perfSONAR and Science DMZ at Diamond
  • 30. Diamond »The Diamond machine is a type of particle accelerator »CERN: high energy particles smashed together, analyse the crash »Diamond: exploits the light produced by high energy particles undergoing acceleration »Use this light to study matter – like a “super microscope” What is the Diamond Light Source? perfSONAR and Science DMZ at Diamond
  • 31. Diamond »In the Oxfordshire countryside, by the A34 near Didcot »Diamond is a not-for-profit joint venture between STFC and the WellcomeTrust »Cost of use: Free access for scientists through a competitive scientific application process »Over 7000 researchers from academia and industry have used our facility What is the Diamond Light Source? perfSONAR and Science DMZ at Diamond
  • 32. The Machine »Three particle accelerators: › Linear accelerator › Booster Synchrotron › Storage ring – 48 straight sections angled together to make a ring – 562m long – could be called a “tetracontakaioctagon” perfSONAR and Science DMZ at Diamond
  • 33. The Machine perfSONAR and Science DMZ at Diamond
  • 34. The Science »Bright beams of light from each bending magnet or wiggler are directed into laboratories known as “beamlines”. »We also have several cutting- edge electron microscopes. »All of these experiments operate concurrently! »Spectroscopy »Crystallography »Tomography (think CAT Scan) »Infrared »X-ray absorption »X-ray scattering perfSONAR and Science DMZ at Diamond
  • 35. x-ray diffraction perfSONAR and Science DMZ at Diamond
  • 36. Data Rates »Central Lustre and GPFS filesystems for science data: › 420TB, 900TB, 3.3PB (as of 2016) »Typical x-ray camera: 4MB frame, 100x per second »An experiment can easily produce 300GB-1TB »Scientists want to take their data home Data-intensive research perfSONAR and Science DMZ at Diamond
  • 37. Data Rates »Each dataset might only be downloaded once »I don’t know where my users are »Some scientists want to push data back to Diamond Data-intensive problems perfSONAR and Science DMZ at Diamond
  • 38. Sneakernet The old ways are the best? perfSONAR and Science DMZ at Diamond
  • 39. Network Limits »Science data downloads from Diamond to visiting users’ institutes were inconsistent and slow › …even though the facility had a “10Gb/s” JANET connection from STFC. »The limit on download speeds was delaying post- experiment analysis at users’ home institutes. The Problem perfSONAR and Science DMZ at Diamond
  • 40. Characterising the problem »Our target: “a stable 50Mb/s over a 10ms path” › Using our site’s shared JANET connection › In real terms, 10ms was approximately DLS to Oxford perfSONAR and Science DMZ at Diamond
  • 41. Baseline findings »Inside our network: 10Gb/s, no packet loss »Over the STFC/JANET segment between Diamond’s edge and the Physics Department at Oxford: › low and unpredictable speeds › a small amount of packet loss Baseline findings with iperf perfSONAR and Science DMZ at Diamond
  • 42. Packet Loss TCPThroughput = »MSS (Packet Size) »Latency (RTT) »Packet Loss probability TCP Performance is predicted by the Mathis equation perfSONAR and Science DMZ at Diamond
  • 43. Packet Loss “Interesting” effects of packet loss perfSONAR and Science DMZ at Diamond
  • 44. Packet Loss »According to Mathis, to achieve our initial goal over a 10ms path, the tolerable packet loss is: › 0.026% (maximum) perfSONAR and Science DMZ at Diamond
  • 45. Finding the problem in the Last Mile »We worked with STFC to connect a perfSONAR server directly to the main Harwell campus border router to look for loss in the “last mile” perfSONAR and Science DMZ at Diamond
  • 46. Science DMZ The Fix: Science DMZ perfSONAR and Science DMZ at Diamond
  • 47. Security »Data intensive science traffic interacts poorly with enterprise firewalls »Does this mean we can use the Science DMZ idea to just… ignore security? › No! Aside: Security without Firewalls perfSONAR and Science DMZ at Diamond
  • 48. Security »Implementing a Science DMZ means segmenting your network traffic › Apply specific controls to data transfer hosts › Avoid unnecessary controls Science DMZ as a Security Architecture perfSONAR and Science DMZ at Diamond
  • 49. Security »Only run the services you need »Protect each host: › Router ACLs › On-host firewall – Linux iptables is performant Techniques to secure Science DMZ hosts perfSONAR and Science DMZ at Diamond
  • 50. Globus GridFTP perfSONAR and Science DMZ at Diamond »We adopted Globus GridFTP as our recommended transfer method › It uses parallelTCP streams › It retries and resumes automatically › It has a simple, web-based interface
  • 51. Diamond Science DMZ The Diamond Science DMZ perfSONAR and Science DMZ at Diamond
  • 52. Diamond Science DMZ »Test data: 2Gb/s+ consistently between Diamond and the ESnet test point at Brookhaven Labs, NewYork State, USA Speed records! »Biggest: Electron Microscope data from DLS to Imperial › 1120GB at 290Mb/s »Fastest: Crystallography dataset from DLS to Newcastle › 260GB at 480Mb/s Actual transfers: perfSONAR and Science DMZ at Diamond
  • 53. Bad bits »Globus’ logs › Netflow »Globus install »My own use of perfSONAR Bad bits perfSONAR and Science DMZ at Diamond
  • 54. SCP and Aspera »SCP »Aspera Different TransferTools? perfSONAR and Science DMZ at Diamond
  • 55. Near Future »10Gb+ performance › Globus Cluster for multiple 10Gb links › 40Gb single links Near Future perfSONAR and Science DMZ at Diamond
  • 56. Thank you! »Use real world testing »Never believe your vendor »Zero packet loss is crucial »Enterprise firewalls introduce packet loss AlexWhite Diamond Light Source [email protected] perfSONAR and Science DMZ at Diamond
  • 58. Essentials of the Modern Performance Monitoring with perfSONAR SzymonTrocha, PSNC / GÉANT, [email protected] 11 April 2017
  • 59. Motivations »Identify problems, when they happen or (better) earlier »The tools must be available (at campus endpoints, demarcations between networks, at exchange points, and near data resources such as storage and computing elements, etc) »Access to testing resources 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 60. Problem statement » The global Research & Education network ecosystem is comprised of hundreds of international, national, regional and local-scale networks » While these networks all interconnect, each network is owned and operated by separate organizations (called “domains”) with different policies, customers, funding models, hardware, bandwidth and configurations 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR » This complex, heterogeneous set of networks must operate seamlessly from “end to end” to support your science and research collaborations that are distributed globally
  • 61. Where AreThe (multidomain) Problems? 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR Source Campus S Congested or faulty links between domains Congested intra- campus links D Destination Campus Latency dependant problems inside domains with small RTT
  • 62. Challenges »Delivering end-to-end performance › Get the user, service delivery teams, local campus and metro/backbone network operators working together effectively –Have tools in place –Know your (network) expectations –Be aware of network troubleshooting 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 63. What is perfSONAR? » It’s infeasible to perform at-scale data movement all the time – as we see in other forms of science, we need to rely on simulations » perfSONAR is a tool to: › Set network performance expectations › Find network problems (“soft failures”) › Help fix these problems › All in multi-domain environments » These problems are all harder when multiple networks are involved » perfSONAR is provides a standard way to publish monitoring data » This data is interesting to network researchers as well as network operators 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 64. TheToolkit » Network performance comes down to a couple of key metrics: › Throughput (e.g. “how much can I get out of the network”) › Latency (time it takes to get to/from a destination) › Packet loss/duplication/ordering (for some sampling of packets, do they all make it to the other side without serious abnormalities occurring?) › Network utilization (the opposite of “throughput” for a moment in time) » We can get many of these from a selection of measurement tools – the perfSONAR Toolkit » The “perfSONAR Toolkit” is an open source implementation and packaging of the perfSONAR measurement infrastructure and protocols » All components are available as RPMs, DEBs, and bundled as a CentOS ISO » Very easy to install and configure (usually takes less than 30 minutes for default install) 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 65. Importance of RegularTesting » We can’t wait for users to report problems and then fix them » Things just break sometimes › Bad system or network tuning › Failing optics › Somebody messed around in a patch panel and kinked a fiber › Hardware goes bad » Problems that get fixed have a way of coming back › System defaults come back after hardware/software upgrades › New employees may not know why the previous employee set things up a certain way and back out fixes » Important to continually collect, archive, and alert on active test results 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 66. perfSONAR Deployment Possibilities »Small node»Dedicated server › A singleCPU with multiple cores (2.7 GHz for 10Gps tests) › 4GB RAM › 1Gps onboard (mgmt + delay) › 10Gps PCI-slot NIC (throughput) »Low cost – small PC › e.g. GIGABYTE BRIX GB-BACE- 3150 Various models 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR »Ad hoc testing › Docker
  • 67. Deployment Styles 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR Beacon Island Mesh
  • 68. Location criteria »Where it can be integrated into the facility software/hardware management systems? »Where it can do the most good for the network operators or users? »Where it can do the most good for the community? 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR EDGE NEXTTO SERVICES
  • 69. Distributed Deployment In a Few Steps Choose your home • Connect to network InstallToolkit software • By site administrator Configure hosts • Networking • 2 interfaces • Visibility Point to central host • To consume central mesh configuration 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR Install and configure • By mesh administrator • Central data storage • Dashboard GUI • Home for mesh configuration Configure mesh • Who, what and when • Every 6 hours (bandwidth) • Be careful 10G -> 1 G Publish mesh configuration • To be consumed by measurement hosts Run dashboard • Observe thresholds • Look for errors HOSTSCENTRALSERVER
  • 70. New 4.0 release announcement » Introduction of pScheduler › New software for measurements to replace BWCTL » GUI changes › More information, better presentation » OS support upgrade › Debian 7, Ubuntu 14 › Centos7 – Still supportingCentos6 » Mesh config GUI » Maddash alerting 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR April 17th
  • 71. 4.0 Data Colletion 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 72. New GUI 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR
  • 73. Thank you! [email protected] 12/04/2017 Essentials of the Modern Performance Monitoring with perfSONAR • http://www.perfsonar.net/ • http://docs.perfsonar.net/ • http://www.perfsonar.net/about/getting-help/ • perfSONAR videos • https://learn.nsrc.org/perfsonar Hands-on course in London is coming…
  • 74. jisc.ac.uk SzymonTrocha PSNC/GEANT [email protected] 12/04/2017 Title of presentation (Insert > Header & Footer > Slide > Footer > Apply to all)

Editor's Notes

  • #38: Each dataset might only be downloaded once Users might only want a small section of their data – over the course of an hour of data collection, perhaps only the last five minutes were useful Does not make sense to push every byte to a CDN I don’t know where my users are Some organisations have federated agreements with partner institutions –dark fibre, or VPNs over the Internet. I however have users performing ad-hoc access from anywhere.s I can’t block segments of the world – we have users from Russia, China etc Some scientists want to push data back to Diamond Reprocessing of experiments with new techniques New facilities located outside Diamond using our software analysis stack
  • #39: When I started at Diamond, this is how users moved large amounts of data offsite. It’s a dedicated machine called a “data dispenser” it’s actually a member of the Lustre and GPFS parallel filesystems it speaks USB3 It’s an embarassment
  • #41: We started testing Diamond’s own network by using cheap servers with perfSONAR, to run AD-HOC tests. First we checked them back-to-back that they could do 10Gb, then we tested deep inside the Science network next to the storage servers at the edge of the Diamond network, where we link to STFC Then we used perfSONAR’s built-in community list to find a third server just up the road at the University of Oxford to test against
  • #42: Packet loss? Is that a problem? In my experience, on the local area, no, it's never concerned me before. This was the attitude of my predecessors also.
  • #43: There are three major factors that affect TCP performance. All three are interrelated. The “Mathis” equation predicts the maximum transfer speed of a TCP link. TCP has the concept of a “Window Size”, which describes the amount of data that can be in flight in a TCP connection between ACK packets. The sending system will send a TCP window worth of data, and then wait for an acknowledgement from the receiver. At the start of a transfer, TCP begins with a small window size, which it ramps up in size as more and more data is successfully sent. You’ve heard of this before; this is just Bandwidth Delay product calculations for TCP Performance over Long Fat Networks. However, BDP calculations for a Long Fat network assume a lossless path, and in the real world, we don’t have lossless paths. When a TCP stream encounters packet loss, it has to recover. To do this, TCP rapidly decreases the window size for the amount of data in flight. This sounds obvious, but the effect of spurious packet loss on TCP is pretty profound.
  • #44: With everything being equal, the time for a TCP connection to recover from loss goes up as the latency goes up.
  • #45: The packet loss between Diamond and Oxford was too small to previously cause concern to network engineers, but was found to be large enough to disrupt high-speed transfers over distances greater than a few miles.
  • #46: So... once we had an idea what the problem was, instead of just complaining at them, I could take actual fault data to my upstream supplier. Phil at STFC was gracious in working with us to allow us to connect up another PerfSonar server. Testing with this server let us run traffic just through the firewall, which showed us that the packet loss problem was with the firewall, and not with the JANET connection between us and Oxford
  • #47: “Science DMZ” (an idea from ESnet) is otherwise known as “moving your data transfer server to the edge of your network” Your JANET connection has high latency, but hopefully no loss. If your local LAN has an amount of packet loss, you might still see reasonable speed over it because the latency on the LAN is so small. The Science DMZ model puts your application server between the long and short latency parts, bridging them. Essentially, you’re splitting the latency domains.
  • #48: This section is cribbed from an utterly excellent presentation by Kate Petersen Mace from Esnet Science DMZ is about reducing degrees of freedom and removing the number of network devices in the critical path. Removing redundant devices, methods of entry etc, is a positive enhancement to security.
  • #49: The typical corporate LAN is a wild-west filled with devices that you didn’t do the engineering on yourself: Network printers Web browsers Financial databases In my case, modern oscilloscopes running Windows 95… That cool touch-screen display kiosk in reception Windows devices In comparison to the corporate LAN, the Science DMZ is a calm, relaxed place. You know precisely what services you’re moving to the data transfer hosts. You’re typically only exposing one service to the Internet – FTP for example, or Aspera. You leave the enterprise firewall to do its job – that of preventing access to myriad unknown ports, and you use a more appropriate tool to secure each side.
  • #50: Linux iptables will happily do port ACLs on the host at 10Gb with no real hit on the CPU.
  • #51: This means that if a packet gets dropped in one stream, the other streams carry on the transfer while the affected one recovers.
  • #52: This is how we implemented the Science DMZ at diamond You can see the Data Transfer node, connected directly to the campus site front door routers There is a key question on how do you get your data to your Data Transfer node? Move data to it on a hard drive Route your data through your lossy firewall front door – it’s low latency, so Mathis says this should still be quick We went a step further and put in a private, non-routable fibre link to a server in the datacentre. The internal server has an important second function – Authentication of access to data. I didn’t want to put an AD server in the Science DMZ if I didn’t need to, so this internal server is running the Authentication server for Globus Online from inside our corporate network
  • #54: I tend to use perfSONAR just for ad-hoc testing. It’s fabulous for this, and I can direct my users to it for self-service testing of their network connections. I know it is very useful for periodic network monitoring, but I don’t take advantage of that yet
  • #55: The common implementation of SCP has a fixed TCP window size. It will never grow big enough to fill a fast link. Aspera – UDP based system, commercial. never used it, I’ve heard some people from various corners of science talk about it, but I’ve never been asked for it. I’m sure it would do well on the Science DMZ if we ever need Aspera.
  • #56: - globus online default stream splitting -> per file or chunks?