HPC Thought Leaders Presentations

Table of Contents

October 15, 2015

Attendees: Jim Cherry, Greg Warth, Sean Davis, Steve Fellini, Diana Kelly, Omar Rawi, Rahcyne Omatete, Eric Stahlberg, Kelly Lawhead, Carl McCabe

Announcements
- Third Thursday next month is week before thanksgiving (super computing conference– next HPC thought leaders meeting will be held in December as opposed to November)
- We are on track with what we have for Long Range Planning – putting content into template and sharing that around to get input

HPC Resource Updates
- There have been some significant investments and commitments made since these meetings started. Anchor points in place to form longer term strategy and how best to use those
- CBIIT Globus connect server update – purpose has evolved to being pre archive limited term data storage needs (people don’t know where to put data), now they do have a place for a limited term while long term destination is being identified (requesting storage etc.) Serves as a storage shock absorber for those with large data needs. Operational status is: it is in operation, backups being made, limited retention while duration not specific necessarily, people understand it is for a limited time until move to permanent location. Global share has been enabled, and server functionality has been tested.
- Started to get open requests on what this server can do for us.
- Comment: Useful to go through storage inventory that CBIIT has and identify large scale storage users and come back around to them to suggest a potential option to change that network mounted drive concept to more of a storage end point (CBIIT and Fredrick) – critical strategic piece of removing desktops from large scale data movement (making sure that does happen)

Cleversafe Update (Greg Warth)
- Decided to change backend storage and there is a delay. Expect to have it in mid November and will install in Fredrick (around 1.2 Tera bytes of usable storage). Plan on putting one of nodes at Shady Grove, and talked about possibility of getting one of them at Biowulf (still interested in doing that)
- Need to talk to someone to get permission for data center (talked with CIT and network group to get network set up – then we will talk with data center folks to get it done). Identify contact person for data center (Greg will identify someone)
- Eric: who is sharing in that capability from Cleversafe? Sean is majority of funding provided but also Eric and Andrew quam with data storage repository funded the project
- We will look to see (set money aside) if need to put money in front of it (CCR and CBIIT put money into it to get it up and running and Andrew qualm’s group)
- CBIIT has long term strategic planning process (long range planning) but storage will go online long before that is concluded – need to understand what our business model for storage is going to be. Discuss data policies and include CBIIT storage group, Biowulf, DCEG potentially and ISC to understand where we are headed and what the guiding principles are. Action Item: Get a meeting with this group to start talking about those things

Cloud Resources Update – Sean Davis and Azure specifically:
- Azure: CIT signed extension to enterprise agreement with Microsoft to give access to services (O365 and MS 1 drive – drop box like solution per user) for storage and other cloud services Microsoft has set up an accounting system through CIT and CIT is figuring out what that is going to look like and how to charge institutes and take care of billing for cloud services happening outside of Microsoft 365 and one drive
- Will be an annual basis with a start date and contract will continue through calendar year – during which can park money in a CIT account – money can be used to charge for Microsoft services. End of year, account needs to be at or near zero because money doesn’t roll over into new year
- How this will be done is totally up in the air right now but is being discussed – need to have meeting with CIT when farther along so at NCI level we know what central management needs and groups to access services will look like.
- Trying to find archive storage service so need to see what pricing model is and how we can charge so it will be advantageous to have that discussion
- Storage model similar to amazon storage model – they don’t have anything equivalent to glacier or near line
- They offer piece on store simple – similar in concept to azure gateway (Good way to cross migrate some of data users)
- Unlike amazon cloud where you need to migrate things in HGSS based system – Microsoft direct storage in cloud (opens up potential uses)
- Need to wait for CIT to sort out details of billing – when they are ready we will meet with them – we do have a Microsoft representative on campus who is technically savvy (Talk about this at some point later)
- AWS (Sean) what were some of your experiences: had access to amazon for 6-8 months – using it relatively regularly to try out. Set up servers to try out databases, analytical approach implementations. Tracking project management system.
- Did not have to download large data – have identical environment for 20 students, worked really well in academic setting
- Discussion – NIEAD – Landa services – automated run on certain triggers
- Azure and AWs are two cloud resources we can get to: there is some precedent now we can turn to in terms of what may be possible approaches when we need something. We have something to build from
- Focus on azure for the short term but AWS is so much more ingrained in research computing – we need to target that. Talked with CIT.

Biowulf 2 update - Steve Fellini
- phase 2 contract a week ago comprised of 30000 cores, moving to a EDR backbone, getting couple of additional DDN storage systems, 4 petabytes, requirements of 4 Gb per second bandwidth.
- Electrical work in data center will delay installation (hoping for march but its up in the air right now)
- Capacity increase from 28000 cores (good number of those are older slightly technology – I gigabit per second rather than 10). Processing power still decent
- Will be retiring several thousand of the cores. Not quite doubling capacity
- Data Management Service Update – Eric Stahlberg
  - Phase 1 earlier this year – aggregating data and meta data and figure out ways to deliver data services to users
  - Project extended couple of months because of developments with Cleversafe – effort established operational capability to take large file with meta data using Globus server. Defined interfaces most useful for users
  - Decision to move ahead with Cleversafe was taken – extended by couple of months so team can look at Cleversafe and experience instances mapped as we’ve defined them in the cloud
  - Resource map to use cases as we’ve defined them is being done
  - Looking at product like Irods as infrastructure to manage across data sources – team look at what it would take to use Irods as bulk environment. And mapping user needs to that
  - Instrumentation introduced between resource and user to understand needs, bottlenecks etc.
  - Understand what phase 2 would be – take different pieces over last year and know interfaces and use case and meta data interest level and pulling that into phase 2 to stich together these various resources

HPC Needs Review and Refresh

Support for long term planning – focus for coming year
People, storage, data management support, cloud access, more cycles (needs for coming year)

Comments from everyone on priorities and order

Plan to manage data, not just storage but right storage (we have expensive storage that should be archived – but having idea that we need life cycle management of storage is very important)
Kelly – outreach and customer engagement so people know about this and what we are doing
Potentially putting something on service now (Nikola) to begin to support HPC needs through that interface
Requests for support are pretty diverse now there is no pattern

FY16 HPC priorities

Education, outreach and training
NCI –DOE collaboration
Data management support – not just storage but making sure we manage the data
Cloud explorations – how best to use the cloud for what we need to do. What is it that we should do there and how? Azure? What’s the support model and business model?
Leverage new HPC related investments – effectively streamlining things that CIT has put in place – making sure infrastructure begins to emerge
HPC support processes – establishing this and making sure as services and capabilities are there we have a good way to handle enquiries and requests and deliver
HPC analyst starting soon – background check not completed so will be on day by day basis as to when start date will be
Make sure there are no requests that get dropped through the cracks

Other items

November 10^th invite - Carry over items
Spring summit in DC area
Website for HPC efforts
Call out IT strategic plan since priorities came out of that (to maintain alignment)
What are mutual organizational benefits for NCI and DOE to working together? Set up special time to run through view graphs that have been developed.
Three pilot areas identified as NCI priorities
Supporting the RAS mission – developing higher end larger scale computational models with higher fidelity to explain interactions of RAS interacting with membrane of cell
Pre clinical models
Integrating broader clinical data

Eric action item – get input from NCI: Get transcript from last presentation that Warren did (transcript of questions to see if it can be better developed)

December 17, 2015

Pre-Meeting Discussion

PM collaboration effort
Project management support so we have shared area for various documents amongst team members and add team members to projects as we need them
Service now could meet our needs but needs to be expanded beyond current operational focus

Agenda

DOE Collaboration
HPC Long Range Plan Feedback

DOE Collaboration

Long range plan not completely guided by DOE collaboration but there is some input and feedback
15 people in room from NCI and DOE (members and leadership), national labs, Livermore, etc.
Speak in context of national strategic computing initiative, Precision medicine initiative
Proposed collaboration intended to support both initiatives
Three pilots
- Predictive algorithm for cancer therapy
- Predictive models for pre clinical screening
- Multi scale data modeling using RAS initiative
- RAS proteins in membranes
- Milestones for 3-year pilots

Questions

If things move forward how long will this go on with DOE and how will we phase out?
- This is going to go forward at least in terms of sufficient support for planning to go ahead. Expectation is that planning will meet the bar to proceed
- In long term what will NCI do? this is a pilot that clarifies role of scientific computing in cancer. Not only thing to consider but helps us clarify what we can expect to deliver
- What is NCI thought on integrating people already here on some of these other efforts?
  - Start broadening teams and broadening involvement in various pilots
  - Learn more about pilots and be more involved and transition expertise that leads away from being fully dependent on DOE for these pilots but being able to carry it forward
  - These people don’t have expertise in biology or oncology – mistakes will be made in Exoscale if they are made
    - Get an idea of NCI level support and what it will be and support those integrations
    - Co-location – scientists from DOE collocate with Fredrick scientists to collaborate and share ideas. Specifics not yet in play but needs to in order to go forward
    - This group needs to get engaged to go forward
    - How about Frederick scientists engaging with Frederick scientists? Breaking down silos within NCI not just between DOE and NCI
    - Lots of interactions between various groups in piecemeal
    - RAS part and EM images – a little interaction, and some touch points, we have something to bring to the table if the table were offered.
    - Need to have workshops to get people to come together that have shared interests to talk and communicate
    - Need to have logistical support with which to follow through
    - NCI needs to be the one driving the science – has a lot to say as to what is the most effective way to move the science forward
    - Where does location of physical system need to be? Argon?
    - What is it that would need to be carried forward from the pilots back to NCI and how? Need to have Exoscale capacity and how to deliver that? Need to have access to NCI remotely or here?
    - We should focus on the science – DOE and Exoscale should be irrelevant. If coral is only place we can do it then so be it.
    - Tempting to think because its an HPC pilot that we need to use newest technology – it ultimately needs to be sustainable and portable for us to use
    - Have to be careful what the priority is – priority to NCI has nothing to do with establishing Exoscale in biology – NCI cares that we get science out of this.
    - Our visions need to be aligned with DOE
    - Needs to not only be focused on science but also scientists as well – whatever is developed needs to be able to come back and help the people in the labs and clinics at NCI do their jobs better
    - Need to flesh out NCI goals when we plan this (write it out as a grant). Technology piece is one aim but not a primary aim for the pilot
    - Whether this impacts cancer or not we don’t know. But we are posing our challenges to the DOE and what we get out of it / what NCI is putting into it is sharing the information in the domain
    - Guidance of what works and what doesn’t is important for them to not spend their time on dead ends
    - How many FTEs? Discussion between warren and Doug
    - What do warren and Dr. Ishkal want from us the HPC group?
      - Eric has been driving this – without him it won’t go forward
      - They want to get the science done
      - We help them get science done in near time and sustain ability to do the science in the long term (this group). High level context to advance the science and precision oncology
      - Structure around data and interfaces? Expecting anything form CBIIT, IFOG to set this up?
        Should be smart observers of what is going on so as things develop and are promising we use that to go forward. Be in a position to facilitate the involvement, not just getting data into DOE but getting our scientists plugged into the projects
        Set up procedures that will work over broad range of things
        We need to be able to set up and a group of people driving it not just one (be broader)
        As we move into detailed planning phase
        Machine learning tools, machine learning at large scale, … learning
        This can be one giant learning of what doesn’t work in all those three areas

How do we support Eric?
- Having discussion now is informative
- Pull together group to say what do we need to do to support this near and long term?
- Skeptical optimist…
- We need to have extramural people involved (computational biologists, machine learning, natural language processing people) – pull them from academic community to be our go to people, not necessarily doers
- Is there any new money to bring on people?
  - If we don’t develop some expertise internally (building staff, etc.) we will be in the same position in three years
  - To build team wherever we do, we need leadership who has done this before so we know what kind of people to hire (give us names, approaches that can work), there will be knowledge transfer
  - We need to have people to council us and we need to be dedicated so we have control over it and not lose it once projects are done
  - (all agree)
  - Resources that are dedicated to this and interact with other groups at NCI
  - We shouldn’t wait until they are over to implement transfer of successful things when we identify them
  - Workshop in super computing is first step to build this effort to augment that and begin with it
  - Need dedicated resources (not IPAS or detailing from other parts of Government) potentially contracting from data science companies, need money set aside for this
  - We have to scale now from our resources (lets’ star putting people in place to peer with DOE and move forward) so we can learn
  - Identifying who we think thought leaders are and start pulling them in
  - Identify specific timeframe milestones
  - Intention of meeting was to present to secretary of energy to see if this was worthwhile to move ahead with – agreement was yes let’s move ahead. Scope and budget are being discussed right now
  - Determine whether this will have impact on cancer research that is done computationally
  - Feasibility is also discussed in this
    - NCI – have you advanced understanding of science
    - Need to treat this like a grant, not a pilot. 75% should be guaranteed to be done. 25-30% can be high risk
    - We need to define the 75%
    - What can we do near term, what is concrete? Then what is the skeptical stuff that we can do afterwards? Some of it might fail
    - One role of this group is that’s these ideas role forward they get sanity checked. Right eyes from NCI. Not just happening in a silo or a back room
    - We need to do something other than an hour long meeting each month.
    - Need a different tact and approach
    - We need to be more concrete. Oversight or expertise that doesn’t yet exist we need to make sure gets put in place early. Identify high and low risk things, have right mix and resource them appropriately and according

Next Steps and Action Items:

Getting group together
Have another focused meeting to discuss in greater detail what we need to do
Distributing what information is
Getting involved in the details
From NCI side of things, people involved at the table aside from the pilots
Grants management and contracting, admin and infrastructure and computational – who are the leaders?
- Warren, Dave Heinberg, Dwight, etc.
- Supporting warren and working with Doug
- Julie Klemm
- What is extramural aspect of it?
- Group is very high level – not a lot of science “doers” or reality checkers

Must have meeting to decide how this project is going and what needs to be done
Need steering committee on the NCI side
Dangerous to say DOE will take care of the IT. They need to tell us exactly what they are doing so we can fill gaps and provide resources
Steering committee with sub working groups – IT, NCI admin, computational, computational biology – don’t worry about scientific side of things
Need to find these experts somewhere
Do it before we get much further along
- Use org chart to try to figure out what fits

Next HPC Thought Leaders Meeting – January 21^st, 2016

January 21, 2016

Agenda

General Updates
Storage and Data Management
Long Range Planning
DOE Collaboration
FY17 Priorities

Quick Updates

Cleversafe – Greg Warth: Equipment is here and staff spoke with Cleversafe. They are coming on site to discuss installation issues. Expect installation and up and running by middle of February. Hoping we will have it so people can start using it in test mode sometime in March. Writing project plan and putting dates in there right now. Need to know level of protection we are looking for in the system (how many nodes to make on there), and when to expect to do limited and full production. Will be in contact to ask more detailed questions. First part of installation – all 4 nodes in Frederick, then nodes in Shady Grove and one in Biowulf.
Nathan Cole: No new updates. Want to get in the loop with regards to Cloud Pilot – has not heard anything. Has anything been made available? (Sean Davis will give update on that)
Xinyu Wen: no updates. Everything is working fine. Negotiating for storage – in progress
Dianna Kelly: no updates
Jim Cherry: not on line
Steve Fellini: no updates
Sean Davis: quick update on Cloud Pilots:
- All 3 are officially open for business. Seven bridges, ISP and Google are open (4 and 7 weeks) Brode opened yesterday. 7 bridges Is pretty robust – instantiation of their cloud platform with bells and whistles and access to DCG data
- ISP google is rough around the edges. One piece working well is sequel access data base (big query) loaded most TCGE data in to that system and meta data (clinical and file level). Workflow management system and data management is still up in the air. Web portal is essentially in alpha testing.
- Brode- nobody seen anything but screen shots. Open for business but way they are doing things means they are accepting applications but will take couple of weeks for people to get access. No additional detail.
- Carl McCabe: Intramural Retreat – did you come across new people with ideas you were not aware of?
  - Sean Davis: CCR chunk of Biowulf – many people didn’t realize that resource existed or what problems they address. It’s a problem we’ll have – need to know how to communicate to the right people. Human resource level to devote to these projects. Question of how much more resourcing we should put in to make sure infrastructure element is maximally utilized. We don’t do much at this point. Need strategic planning for storage and data management. Needs to include human resources.

Representation from three different groups said they need to move data and how? Same thing with compute – our scientific computing needs (genomics, etc.) are being met, but we have needs that could be addressed using HPC that have not been traditionally addressed that way.
Grid FTP server support in Frederick? Storage and data management piece: need resources to back up what we stand up. Sean hesitant to publicize Frederick instance due to not knowing life cycle. Question also of communicating once we set these things up – what are implied or explicit policies in place and what are they meant to support?
Issue isn’t using Grid FTP but making sure we know what happens with the data. Data transfer is convenient mechanism but users need to understand where their data needs to go.
Moving forward need to think of data management and storage as a suite we offer and need to be careful that Frederick, intramural, etc. are all on the same page. Need to have the communication to support the technology that is already in place.

Eric Stahlberg
- familiarity and education challenge to raise awareness of what we have and how it fits their needs. What is there and how far can we extend resources to data management and storage and compute. Not overpromise and under deliver or under-perform.
- General request mechanism in place. Eric and George are filling up back-end. This is only the triage part – needs to extend out. Bringing forward meta data and upgrading.
- Have submitted a couple of requests. People contact us by email, etc. We are guiding a submission into that so we can get history and profile of types of request coming in.
- Differentiating informal requests from formal ones that need action. – raising visibility there to get resources and use in future planning. We have pieces in place. Now we must move forward with the execution. We had talks about different roles and functions. Now we have the pieces in place.
- Jim Cherry: no updates

Storage and Data Management

Make sure we start talking through the issues.

Have services for moving data, soon will have for archiving data
How do we push forward to meet today’s needs and needs of future?
Must have what people need and want in terms of data storage management capabilities
Globus Grid FTP (Sean Davis) – put on agenda- we have two working systems and Biowulf Helix is third. We have nice data fabric including Frederick , Shady Grove and 12A – should invest small amount of Human resources volunteering our time to talk to people (didactic) – show people what we can do with this
Request generated at Duke in sequencing – how do we access our data? – Sean showed how to do that with Globus. This kind of thing would be helpful in pulling people out of the woodwork. Asking question how can we help you doesn’t elicit answer – showing technology does
Carl McCabe: put something on NCS website – illustrated walk through.
Sean Davis: hands on, tutorial approach (not complicated technology). People may shy away if they haven’t seen it done and think its’ complicated. It’s not hard to use once you’ve used it once. Initially learning curve is daunting even through it shouldn’t be. A lot of opaque concepts.
Sean Davis: how to deploy Cleversafe? Details slightly above Greg’s questions (how to carve up storage for data integrity) , but more along lines of usability. Specific ideas on how to use system that may not jive with what people are thinking. Want to think of Object storage as object storage having people access data using as 3 swift or Cleversafe FDI directly, as opposed to having light or heavy duty front ends on that. There are many third party tools, software and documentations built around S3 and Swift. If we don’t expose these APIs we will end up having to create infrastructure that we shouldn’t have to create. Also, moving NCI forward in how we think about data architecture, engineering and science. Try to adopt technologies that allow us to leverage our data wherever it sits.
Should be able to access data sitting at S3 on amazon from Biowulf in same way access data sitting in Cleversafe system within NIH from within an instance on amazon. Important because there will come a time not to far out where we need to use resources above and beyond what we supply locally. Should not box our selves in to set up whole new infrastructure for our data management. Nice to have Amazon access S3 from Cleversafe in same way we access S3 data from Biowulf (buzz word touch all for cloud storage).
Greg, Jeff and Dianna doing long range planning for storage – should meet to go over different things we expect storage to do and interfaces we need to build, APIS to expose and how storage will work across all NCI on it.
Meeting to go over Cleversafe is a start (need to capture all this). Unified storage model – adapt what you’re trying to do here. Should be able to build in.
Need to involve CBIIT, ISP, Frederick, CIT folks to incorporate Cleversafe
Sean Davis: meeting in Shady Grove to allow Biowulf to see data center, then meeting at 12A to allow CBIIT and Frederick folks meet and greet and see the CIT data center.
Mid February before anything is installed and march before anything is operational – want to do it right rather than quickly.
Sean Davis: API’s and user interfaces are same thing, not separate. If you do expose SP or swift API there are clients, you can download from internet that immediately access the S3 or swift based storage and can mount it on local system. Leveraging software that already exists rather than write stuff to deal with our data management needs.
Eric Stahlberg: There is whole point of accessing data set and computing on that data set. All these pieces come into place for performance. Find ways to make that as transparent as possible so there is persistence, supporting innovation and analysis.

Greg Warth action item: put together a meeting. Sean suggests agenda item – give an overview of scientific work flows and data management as its practiced in the field right now so we’re starting from same scientific understanding – can get in weeds about how to implement things. We need to be all on one page from understanding – potential set of use cases.

Eric Stahlberg: Cloud Pilot provides set of good examples to look at (data management, object identification, etc.) concept of universal resource identifier for data sets. Across all instances that is a common need. One of core elements of strategy is how are we going to satisfy that need? Enterprise decision that transcends NCI

Where do we find long term archive storage? Critical need because amount we have sitting on expensive disk. Sean has to do to get back together with Allison from Microsoft. When he does, he will do it as globally as possible. Need to find place to put this data.

Sean to send communication to everyone who is at this meeting and Greg to organize a meeting. Will try and pull things together about Globus connect and some example illustrate cases of what to do. Get together with Microsoft to move data from spinning disk to more affordable resource. Will discuss meta data and data sharing within context of that meeting

HPC Long Range Planning

Eric to send document that was pulled together – this will get informed and updated so people can take a look at what was done.

Document talks to some projects aligning with storage and data management (more data management) to support compute and HPC and avenues to secure resources on and off site (large scale centers and efforts to go into cloud). It all converges and pieces are coming together. Needs are difficult to meet independently.

NCI-DOE update:
- No follow up meeting happened because there was no context for it yet
- What’s happening: NCI has opportunity and need to pursue innovations in different areas (understanding biology of cancer), developing clinical trials and understanding more completely impact of cancer and treatments on population overall. These are areas of pilot explorations which overlapped well with needs of DOE – advance Exascale computing, in context of National Strategic Computing Initiative (cross government effort to build more computational expertise not in silos, Government and Private in effort involved).
- NCI and DOE collaborated gave overview of pilots that would be pursued – agreement was that effort should go forward
- Kick off meetings are being targeted for Feb. 1-2 : context is to put in detailed plan for those focused pilot activities. Overall implications of how they extend and where they go is TBD.
- Aims and goals have been laid out but detailed plans have yet to be defined (context for kickoff meeting)
- More communications being developed about this to go out. Contact Eric Stahlberg or Warren Kibbe with any question

HPC FY 17 Priorities

Not identified yet. Discussions and meetings going forward will help shape what we prioritize for FY17.
Focusing on DOE right now for communications plan (talking with Shea, but doing long term perspective to pull things together – web pages updated and talking to people and getting word out)
Formal presentation to Shady Grove, Frederick et. all – must do formal presentations as part of communication campaign (in 37 depending on size). Cannot expect people to understand anything from emails or just collaborate. Need contact in each building to get a room and start talking to people directly.

Next HPC Thought Leaders Meeting - February 18^th, 2016

February 18^th 2016

Agenda

Data and Data management needs
Real use cases and policy implications
Establish policy directions through Warren so technology can align to those
- We have sandbox of technology flexible in place - need to align and prioritize
- Brainstorming on Data Services aspect
- Cleversafe update (if Greg joins) – production roll out supposed to be in August. Not most everyone ready to use it
- Demand much earlier than that point from HPC perspective
  - Planning for Cleversafe to be standard CCR for data storage – 40 labs approx. ready

Updates

Sean and Eric with CIT last week – saw info on forward looking plan to roll out Cleversafe
CSS (data coordination center) – using Cleversafe for storage (third party)
They are getting ready to architect how they would use it – they have software development plan done
Talking about presenting production level version before Cleversafe is rolled out (their timing is a bit off)
If its not ready to go with their timeline (don’t think they can use cloud version – depend on how they architect it)
Working to get familiarity with use cases and needs with CCR – limiting step is time to coordinate with production availability of Cleversafe – if it were sooner we’d use it sooner
Sean Davis: in short run we should consider using Amazon because API is the same – no reason not to architect against amazon for storage in short term
Let everyone know it’s a possibility – we should be architecting for that anyway because Cleversafe is more expensive than Amazon
Sean provide write-up on how to do that with amazon in short term? (sign up and provide credit card) – put Potts order in. No issues with data security (need to live up to all security requirements). For storage temporary, not a production thing. For testing purposes its fine
Eric Stahlberg: what type of data can you put up there in general? (anything, just needs to be secured depending on security requirements for that type of data)
Government Cloud – separate partition of Amazon.
Action item: get a guide from Sean for purposes of designing and developing an app meant to run the Cloud (Sean can’t do it). There’s a work group here at CBIIT involved with getting Amazon services more formally (CBIIT security and CIT security folks on that team).
For informal testing and development against S3 NCI it is doable for sure
Speak with Sean first – he’s the only one using Amazon right now
Longer term planning and opportunities to realize

Forward looking direction for Data Management, compute and Cloud (thought s and perspectives) what people would like to do and anticipate they would need to do:

Nathan Cole: for Amazon, one of biggest things is large scale amputation in terms of taking all GWOS data that CGR collected over years and try to turn it into one master data set and then imputing things across all that. This is not highly storage intensive but computationally intensive. Been able to do locally but shut down everything else longer than anyone wanted. Adding to this data set moving forward and being able to do different things with it in Amazon computing scenario. Little data going up or back but a lot in middle.
Amputation for data set (over 500 cores) took between local compute and leveraging Biowulf (using at any given time trying to run a few hundred cores) took 4 months from start to finish (total roughly 800 cores). Constricted locally to half available cores due to memory issues.
Larger nodes would make recombination easier but would not make it go faster. Amount of memory on a node determining how big of a chunk of data you can do
Compute clusters utilization right now typically is 60-70%
4 months for amazon time will be really expensive compared to what we can do locally
Might make sense for NCI to put in some money for Biowulf strategically
Biowulf expanding by 120% in next 6 months
We should do Amazon but there might be a case for taking that money and amortizing it into some local compute
Do cost analysis of what it would cost to run it on amazon. (Not 4 months straight through of crunching time – lots of other things going on in there as well)
Local compute used 30% or higher utilization – it’s cheaper to do it local (rough cut off)
If its one-time thing, it’s no big deal. If it happens more per year we can’t be shut down for that long operationally. Need to find way to spell it out through amazon or other places. Figure out what works best for that. Do testing on amazon and Biowulf and other to figure it out
How much data to move? Couple of hundred Gigs.
Impute 2 is name of application
Lots of options available to balance cost and turnaround time
Moving data between us and Biowulf or Helix is not a problem but we are looking for better ways to do it
Couple of request to get Globus set up on CCADs – is there a reason to try and set us up as an end point? (End point and client are the same thing)
Globus adds authentication and authorization, security key management and cloud automation
Installing grid FTP server with Globus stuff on top of it once you install. You are installing server with root privileges which allows it to be multi user but underlying technology is the same
It’s a 5-minute installation and simple to do – not a lot to worry about
Sean Davis to send installation notes from Steve Fellini (contact person for CBIIT)
Need some ports open (if you’re using a strong firewall you’ll need to open some ports)
Put it on a system that has as much bandwidth as possible and bypass the firewall
Do fairly early on – talk to science DMZ folks
- Sean Davis will send some information to everyone
- Jack Collins: talked with Java folks – they need place to back up their data
- Eric Stahlberg and Sean Davis spoke to Anjan
- Will have to continue to push them to use technologies already available. Reticent to use Globus but will have to bite bullet and do it. They will be one of Cleversafe users. Process to onboard them. Change way they work a little bit.
- Cleversafe not up and running right now – projection is August
- In meantime – we have storage for them 3-400 terabytes at CBIIT and can get more.
- Issue: they have 400 T on new system and they want to back it up. Back-Up system that CBIIT uses not appropriate. Instead mirror data over to Isolon where they could more appropriately manage backups
- Sean Davis will install things for them and get them up and running and make sure storage is available for them
- Keep everyone in the loop and make sure they are ok
- What timeframe are they anticipating producing data out of Clenomics efforts? Summer 50-100 patients in first year (1 T of data – nothing serious). Issue with them is they tend to spread out. Had 400 available and before project started they had 120 on it. Need to figure out (cant keep making copies of data). Data management plan is Clenomics is currently stored on 4 copies.
- Its not amount of data but Data Governance (discipline)
- They are bringing on 10 new people – hoping we can task one of them with this
- If groups want to use Cleversafe as a parking place we have to have a business model to do that. What is business model to sustain Cleversafe as it moves into production. Have good list of use cases. Place where we put non changing data – have copy (safety factor). Not built for transactional data. Clinomics data versus other data? How do we figure this out? Not wise to put data on Isolon – should be able to explode it. Need to make changes on how we take this data (until we have longer term plan) – understand use cases. Band aid solutions are not good for any of us long-term.
- Make sure not rebacking up data in Isolon (static data – check with Jeff and Javed’s group)
  - Being able to utilize this data and use it in other studies, retrospective analysis – is to have appropriate meta data to be able to do that
  - Making data valuable for use of researchers just as important as having place to store it
  - Any of these solutions include meta data (Sean Davis)
  - No single meta data solution that will work for all projects
  - Different approaches for different groups
  - Educating groups and provide minimal support services to help them create solutions – do in way fairly local to problem – when we have local problems that have global solutions we can apply those
  - Need to figure out how to incorporate it for each lab or branch. How to globally control NCI data (this thinking won’t get us very far)
    - Some use cases may just need place to park data, others for meta data
    - Need to figure out how to triage it
    - Sean Davis: sit down and discuss those things – ask labs what their use cases are (not valuable way to go about things)
    - Architecture level to look at it – here’s the way to connect things
    - We have idea of types of data and requirements levels currently
    - Having a service catalog like list of use cases and triage it in – not everything will be perfect fit (some will be shoe horned in)
    - Push technology and context of use cases 80% API covering most use cases.
    - Don’t try to define one size fits all (evolution and uniqueness to meta data)

Keep base technology as general as possible so we don’t put ourselves in a corner that’s heavy lift to transition

Get people in lab used to following some kind of best practices for discipline
- Having a steering committee monitoring some of projects going (WebEx and HPC data initiative). Communications between groups not always seen. SC can facilitate communications and align projects together and can address the other issues tactically and operationally
- Put items on list of collective group to address and focus on
- Sean Davis: another strategic item: some of reason we struggle is that we don’t have right bodies to do some of this work. Worth thinking of what kind of bodies we need. Relevant to get somebody in NCI who has some expertise in these areas because none of us do (big data management and cloud infrastructure). We can get it by learning but we are in a place where getting someone in with experience will save us a year’s worth of work.
- CDC contracted with SRA
- Bioteam working with USDA and a couple of Universities to do this
- We should find someone potentially to help move us along more quickly
- Educating people- we have a lot to do there (require some investment from CCR and DCEG) – scientists learning some of this stuff – best done within scientific program and not within CBIIT right now
- Jack Collins: Bringing in good people and setting up test lab of people who know the hardware and people supporting scientific applications from a system administration point of view at this cutting edge places
- Nice to have some hands on experience with some of these technologies (even if vendor comes in and we have 3-4 people and translating that into our day to day practice)
- Will require some funding for FY17 (Sean Davis working on funding side)
- Jack Collins to scope size projected FTEs would be or budget number would be around 400-500 direct (ballpark)
- Need to figure out as a group how to get it to exist within NCI
- Sean Davis: my idea is similar, but we need to have these kinds of people directly tied to the science (physically sitting next to scientists) – have seen it a lot. Divides currently exist. Way to move this forward is to bring the IT into the scientific programs rather than something separate that supports the programs.
- When get people primarily focused on science at CBIIT then transition can begin.
- Data engineering problems within intramural program – we need something agile right now
- We need both? Agile and someone to put infrastructure together that meets it in the end. Should keep both programs going and connected at the hip and talking to each other to know what’s going on.
- Group building infrastructure is closely monitoring what’s going on out there.
- Caveat: much of knowledge about infrastructure is embedded within scientific program and not in IT – will change over time. Cutting technologies easier to see from IT side rather than scientific side right now.

Next HPC Thought Leaders Meeting – March 17, 2016

March 17, 2016

Attendees: Steve, Nathan, Greg, Xinyu, Dianna, Sean, Carl, Kelly, Eric, George, Omar

Agenda

HPC Long Range Plan
HPC Needs Analysis
Storage and Data Services
Other Items

Carl: several people have started using Slack to communicate, maybe we can try it out to keep dialogue going outside of here (it’s a collaboration tool)

Eric: HPC long range plan:

Contribution from many people
Finalize* version sent to HPC folks. Greg and Dianna going through the various plans
Get input – taking the long range plan and making it visible
Website getting up to share information on activities – makes sense to put the LRP up there as a reference document
Dianna: working on customer versions to distribute to SMW – condensed version of LRP for them
Making sure LRP is out and not just sitting there – as a reference tool
Get input on when to target a refresh on the LRP
Carl: want to do it annually to avoid it getting stale
Greg: first quarter of calendar year to align the budget process
Next January: concerted effort to make updates prior to end of March
Collect information and collate and distill
Storage and data management needs with main LRP – is this scope we want to maintain in future or have different approach? Keep this in mind. Compute focused on more ecosystem focused
Stay tuned for collaborate site
HPC Needs Assessment
- Document sent out to all
- Assessment of what we need across CBIIT
- By end of this month as it stands – what do we need to change in terms of making sure we have needs identified that aren’t there and taking things off list that aren’t needs
- Feel free to edit accordingly
- Needs assessment will dovetail into feeding LRP and inform priorities for FY17
- Needs assessment gives a broader scope
- Try to collect info about HPC needs across CCR, DCEG, and other places to address (Sean and Dianna)
- Greg: understand if there are compute or storage demands here and Frederick would help with our planning
- Sean: helpful to go a step higher and have high level decision making about what Frederick is best used for and CBIIT is best used for and what Biowulf is best used for. If we evaluate each year this will change drastically from year to year

DAT services and storage environment

Greg: Cleversafe update: Cleversafe storage co-investment

We are going through PM change. Made some good progress , operational, working on setting up for select users to do testing

Working with Steve Fellini and CBIIT about placing nodes there and Biowulf – first set in May and other set in June (June move Cleversafe to shady grove – maybe even July)
Looking to set up basic training on how to use the system – its not a standard file system. Those interested can contact Greg and invites will be sent to training
Action: get some of users engaged to get use defined around capabilities and match what is being delivered to a need that is being fulfilled based on use cases. CCR sequencing is one of these groups. Sean to build on that: higher level detail

Steve and Biowulf group are in position to contribute to these conversations. Original implementation plan for Cleversafe in Bethesda is abeer/aver??– direct object storage APIs. If this is the case there is room for teaming up. Solving same problems so should talk as a group to move forward.

Advantages of having object store system is its dispersed geographically. Don’t need to back it up.

Nathan – would like to try it out and thinks this is direction to jump on or archiving

Nathan to drop note to Greg and he will set him up with training.

Eric: engaging in use cases to focus on what services need to do moving forward
Balancing needs – pushing limit of what technology can do vs extreme reliability and stability (needs for both in environment being served)
Match program and pediatric match program discussion – they will have some data needs growing substantially (like clinical applications) – match has similar requirements in terms of providing long term data assurance for infrequently used large files (potential opportunity in use cases)
This is situation where data would likely be held for 5-7 years, but as things would close data will likely move into GDC. Make sure we don’t create data resource that is inaccessible to other resources beyond NCI. Out Design with in mind for cooperation without huge technical barriers.
Greg: next steering committee meeting? March 31^st

April 21, 2016

- Data call: they are specifically looking for HPC cost in server and mainframe category

- Capital acquisitions

- Pull from DCEG in terms of what they are anticipating (Nathan)/ or give a view and we can reconcile it with DCEG

- Eric to give presentation

Data Call: request for info on HPC – clarification from Karen
Cost for mainframe and server projections for HPC investments – Sean sharing amount he gets each year to make decisions on

We don’t want to double count things. Biowulf and CIT should be taken into consideration

Nathan waiting on numbers from DCEG
Sense of total expenditure from DCEG for last year – we don’t have that because it’s a new category coming in. No data on HPC

- We don’t have a real baseline for new categorization

- Include server and mainframe investment cost to support personnel etc.

- Don’t know profile of what DCEG is but we’re doing HPC support for them. Don’t know size of their plant beyond a few stations in their lab

- Asked for Bioinformatics numbers by Melinda Hicks

- Not clear line between doing bioinformatics vs infrastructure necessary to make that happen- need to draw line between those two to not count dollars that support science

- Important to put cost of infrastructure and people to maintain it – this becomes an IT cost but people taking codes and rewriting them and running them on HPC is scientific computing and we don’t want to put that in

- We want to be careful and consistent on how we report on bioinformatics activities

- Eric: trying to break it out into categories with Jeff and Tony – total bioinformatics investment is fine but helpful to define physical plant and what it takes to operate that

- In terms of physicality (cost for power and cooling) no idea what those numbers are (not part of the report). But how much spent on storage probably is something we want to report

- Server procurements, storage procurements, etc.

- HPC costs in cloud getting pulled from Tony – cloud HPC resources

- Having info available and working with Karen to best sum it up

- Making bioinformatics info possible is good first line to capture and to know total bioinformatics cost is good but not from FITARA stance (don’t want to report that)

- Presentation

HPC support for FY17
Exploratory Computing – foreshadowed in LRP
Data services efforts

Give relative priority to these efforts and identify stakeholders

What amount of resources might be appropriate and when for these resources

What makes sense for FY17? What might we defer for later?

HPC support for FY17

- Eric goes through slide

- Prioritizing activities for next year to know how to allocate resources

- First three are program development

Getting NCI involved within HPC community and looking at frontiers of exascale computing – help shape what happens in that area and have interests represented and not just responding to what happens
What kind of science being supported through this and give some examples?
Eg. DOE pilots – what at NCI needs same kind of infrastructure?
Storage and kind of compute nodes more important than number of compute nodes for biowulf – laying out drivers is important
Exploring how we make HPC more accessible to the scientists – lessons learned from NSF
More tactical: evaluating HPC in the cloud – resourcing, how to do it within context of NCI and investigators
NCIP cloud computing HPC integration – how would we actually extend the backend computing support for things in the NCIP cloud
Integrating cloud with GDC (probably not existing cloud pilots that would be here) – GDC is the long term investment. Move more towards that and attach HPC to it.
Two more component to flush out – how to build same kind of infrastructure that integrates imaging and what uses there are for HPC in that (mining and indexing) – secondly, not sure you can use single approach when you move from pathology to MRI to ultrasounds, future extraction looks different. Another part of it is pulling in more data from HRs and what it looks like and what kinds of HPC are relevant for HR data
GDC – most is on genomic data
Other types of data moving into GDC (may or may not directly) but will move in to GDC like environment. May not be able to scale GDC like that
ERIC: a lot of these are going to be intramurally focused
Using accelerators to move things through sequencers faster – can you drop card in sequencer and improve their QC throughput (so they don’t have to move it to a general computing platform)
How to use cloud for storing data as a deep archive beyond premises (not in use)

- Cloud storage pilot – what it would take to use the cloud for deep data archive

- Leveraging investment in data service environment

- What happens when you have all meta data and feature extraction – investing at least capacity to do high performance analytics and getting systems that have that capability accessible within context of researchers we are working with (supporting work in that space)

- Cloud container environment – containers operated internally and externally and portability between them

- Micro grants – consider as a way to bring external resources into the effort on small projects and supporting intramural investigators in that way

- Prioritizing these: not much time right now / add other ideas / weigh in on what priorities are

- Maturity of being able to do it (having partners ready to do it) rather than priorities setting. Are we ready to do it? We have plans in place to do these things. Identifying future timeline of when these things would be ready to start.

- Not a matter of ranking them in order

- This needs assessment was built off 2015 and various meetings we’ve had

- Developing involvement within community to help NCI participate in consortia

- Response to NCI DOE pilots and looking at NCI exascale cancer working group – pull people together more broadly – what are the applications of exascale? Extend beyond that group and get broader input, looking long term and being frank about level of computing investment we need to make. Who are partners for that? What is need? What is demand?

- Training and outreach – supporting education and development of awareness about what computing and data science can do for cancer research

- Need to do more HPC training, developing more applications, helping with those who have large data service needs and extending support on using cloud more effectively

- Need people to help investigators use cloud more effectively (2-3 individuals to cover that space)

- How do we take service now implementation providing request support for HPC support and develop that further so it’s a better interaction for the individuals who need that support

- Bioinformatics core interface – did users find it reasonably effective? Yes

- Investigators have an idea and communication was done well, so was coordination

- Last two looking at making sure we have project management support in HPC space as we have more request (becoming more project oriented as opposed to task oriented)

- TPM would have more ability to do technical support and know more about problem space rather than a PM. Needs to be depth of awareness about that (it is negotiable) – potentially someone with a lot of HPC knowledge and can translate to technical team

- For FY17 – taking what we’ve looked at to develop a basic service API that’s at an enterprise level and building it out to have stronger and deeper services. Developing façade on top of what many of our object store technologies are that we might use. Rationale is to provide flexibility that we don’t become vendor locked. As things become more capable and standardized, façade will get narrower and narrower and potentially disappear.

- Helpful to lay out some of initial projects and right size whole activity so not get carried away building things without having accurate picture

- Useful to call out number of FTEs required to build things out from a budgetary perspective

- Supporting data service environment

- Dedicated system administrative support for it

- How to look out to extend storage to different types of storage places like cloud, etc.

- Next steps: how best to get input and refine this with CBIIT budget process to prioritize. Give opinions on what we should do, shouldn’t do and defer or things not on the list to think about.

July 21, 2016

Tentative Agenda

- New faces and introductions

- Needs and Updates Around NCI and CIT

- Frontiers of Predictive Oncology and Computing Meeting Updates

- Review FY17 Candidate Projects

- We were on hiatus for a little while

- Important to have these meetings more regularly and keep each other updated and aware of what is going on

- We have new faces and important to share updates

- Logistics updates and coordination support

- Suggestions on other priorities to pursue

Introductions

- Anastasia

Important we define what we mean by HPC and big data and what we are aiming for
Lots of important things going on

- Miles Kimbrough

- Nathan Cole

- Carl McCabe

- George Zaki

- Warren Kibbe

- Greg Warth (Phone)

Needs and Updates

- CCR: Sean out of vacation: not much insight but one item is looking for ways for longer than 1-year retention for files (Xinyu)

This is prime for where Helix is going with Cleversafe
Email went out looking for beta testers
NCI/NIH data archive policy? Doesn’t exist

- Storage for new instruments

Need life cycle management system for file retention
Not all data is equally valuable after one year
Needs: to have a data management instrument in place. Storage needs secondary to acquisition of instrument

- CIT – Nobody on phone to give details

Object storage – looking for beta testers (cleversafe)
- A place to put data that is assured over long term but not necessarily recalled forever
- For it to work well its better as static data versus data that changes a lot
- Cleversafe reduces need for backup infrastructure
- Currently have 1.2 Peta bytes for NCI, around 2 Peta bytes for CIT
- Initial plan was to use an avere front end for cleversafe – couldn’t do provisioning on it. Some speculation over what they will do on it now
- Cleversafe coming out with their own NFS presentation
- Storage back end for GDC – we want them to be able to scale considerably. Cleversafe too expensive for this kind of storage. SEF is not ready and having trouble going past certain amount of storage.
- Get Bob’s technical team to give a presentation
- Data retention policy: at some point if we store it and they want to access it beyond a certain point they should pay for it (we need to discuss this and plan it)
Biowulf

- DSITP

Cleversafe object storage (Greg)
- Have brought up storage here and its been deployed. One node at CIT, one at Shady grove and one at Fort Dirdy. Going live (guarantee that data will be available there and not destroyed on it) August 1 pending results of final testing. Hooked in regular part of network. Segment set up for us.
- There is latency but not large decrease in speed
- On 40Gig back bone, and have 10 Gig available.
- 2 Peta Byes usable storage
- If cleversafe does have the NFS mounts to implement in first quarter of 2017 calendar year, working with cannonball to do backups and go to synthetic pulls instead of weekly pull backups.
- Usually 60% - allowing 2 failures (default) ratio raw to available data
- Bob Grossman very conservative (we couldn’t guarantee him any kind of back up)
- We could get cost down (more reasonable) than for him to be that conservative on how cleversafe is set up. Speak with Bob. (Greg and Eric to get a meeting with Bob) – he has 4.7 Peta bytes and 5.5 available space – at least get to his technical team, not necessarily Bob himself. Send a note to Alisson Heath aheath@uchicago.edu

SC16 Computational Approaches for Cancer Workshop
- There is a workshop in November – we’ve put a call for white papers for that.

- CBIIT

Archive API – works on cleversafe and other pieces. Ready to talk to S3 in general or cloud, etc.

- DCEG- (Nathan)

In the middle of installing another Peta Byte of storage. Somewhat of a tech refresh of oldest original 108 L nodes. Density on them is drastically poor compared to anything modern, and going out of support next year. Sticking to Isolon as vendor. 2019 ACT supposed to move in this timeframe.
We need to make sure one of considerations is a data line (Carl)
If we are that close, as long as we can get direct lines between new building and shady grove we can codirect equipment
HD400s for Isolon replacements. Now have mixture of odor and L nodes. X410s and HD400s for pure capacity.

- Other DOCS – No other updates

Logistics updates

- Communications plan being put in place by Miles

- Ramp up in August and run in September

- Open collaborate page to all members of NCI

- Yellow task that Eric is on ends in September – plan for how to continue this being worked on. Summary report on what we were able to do in first two years

- Overall perspective – Braulio yellow task is where we move programmatic support to

Frontiers of predictive oncology meeting

- Well attended nearly 100 individuals each day

- One room, enthusiasm, good networking time

- Limited range to roam encouraged people to have discussions

- Good insight shared in breakout sessions

- Planning a white paper by end of August to pool all input

- Survey in development – intel was asking how meeting went (being iterated on now) – keep paper work reduction act in mind and get intel to do this

- Planning next meeting – get information out earlier and better

- Blog post – makes sense to do with DOE

FY17 Candidate Efforts – HPC and Exploratory

- Data Services Environment

Archive and metadata services
Explore integration with GDC
Transparency on storage utilization
Expanded storage on transfer/intermediate Globus services
(add managing data retention policy and life cycle)

- HPC Support Core

Deepen level of support and education for applications of HPC (connecting the HPC with the science and making improvements there)
Front-end development for HPC backend
Continue support for compute and data intensive applications engineering and optimization
Extend level of HPC resources available (cloud, elsewhere)
Useful to have future visioning to see how we will look in one year

- Cloud Resources

Dev and compute in cloud, data storage, archive, development, etc.
- Should talk to NCBI (they are making a push to move all their services into a cloud environment)

- Predictive Models Explorations and Assessment

What are implications from computational, data and science perspective
There is a big misunderstanding inside NCI about what a predictive model is

*efforts are not distinct. They need to be coordinated and aligned overall

* Describe more about what the purpose of HPC is and less about the infrastructure and the “means to an end”. This is part of future visioning, underpinning of “why” we are doing this and what the purpose and impact is.

Content

Space Tools

HPC Thought Leaders Presentations

October 15, 2015

December 17, 2015

January 21, 2016

February 18th 2016

March 17, 2016

April 21, 2016

July 21, 2016

February 18^th 2016