Operations and Deployment: Difference between revisions

From ReddNet
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Deployment =  
=== Deployment ===  
* Bring current deployment up-to-date
* '''Bring current deployment up-to-date (estimate complete by mid-Feb)'''
* Prepare existing hardware for deployment
** Build new image for 2GB internal USB memory
* Develop MOU for current deployment
*** may be done by All-Hands
* Design and implement a depot recovery process
** Design and implement a depot recovery process
*** This process will be vetted on current deployed hardware
*** Initial process may be done by All Hands or soon thereafter
*** Begin with SFASU for initial vetting
** Send recovery keys out to sites and update the depots
*** early February
*** need to recruit one person at each site to assist
** Set Nagios back up
*** need a more stable system before this makes sense
*** turned off during this transition period to avoid flood of diagnostics
*** needs a day or two of time, could be done now depending on priority
 
* '''Prepare additional existing hardware for deployment (19 nodes)'''
** Update image on internal USB (will use for testing of the above recovery process)
** Send 6 depots to SFASU with additional PDU
** Find new collaborators/sites for remaining 13.
*** Ideas?
 
* Develop MOU for current deployment (timescale uncertain?)
** Longer term project
 
* Define a standard set of software tools for depots
* Define a standard set of software tools for depots
* Gain experience with existing deployment
** The following also exist
* Discuss a multi-tiered system for sites
*** Iperf
* Find new sites
*** Nagios
* Investigate Perceus as an update tool
*** mtr
** other tools to be added?
*** investigating new tools now
 
* Gain experience with existing deployment  
** See "Validation Framework" below.
** make frequent reports in weekly REDDnet meetings
** review deployment experience end of March
 
* '''Proposed multi-tiered system for sites for discussion'''
** Tier 1: Sites that run their own LServer and Chord ring
** Tier 2: Sites that manage their own REDDnet depots
** Tier 3: Sites that use their own storage resources as depots
** Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
** Develop MOU for each tier - we need to be able to supply reliable storage.
 
* Needs
** What resources
** What policies
** What management
 
* Investigate new management tools over the next two months
** rsync or similar (short term)
** Perceus (long term)
 
=== Monitoring ===
* '''Use StorCore, Nagios, iperf, and visualization tools from SC07'''
** Have a statistic page that gathers information from tests and presents them cleanly
*** Long term project - 6 months to a year?
 
* '''What is required to provide adequate support for REDDnet'''
** Want feedback on this.
** Setting expectations?
** Actively talk to users over the next couple of months - 6 months - ongoing
** Develop an initial plan by mid-Feb, then review every 3 months and tweak


= Monitoring =
* Use StorCore, Nagios, iperf, and visualization tools from SC07
* Create a REDDnet status site, using google maps
* Create a REDDnet status site, using google maps
* Create an RT site to resolve users' issues
** short term just get green dots on a map (what is the priority for this?)
** longer term - expected to evolve, integrate with the vis site
 
* '''Create an RT site to resolve users' issues'''
** needs to happen quickly.  mid-Feb.
 
=== Validation Framework ===
* '''Stress and WAN testing on Production REDDnet'''
** Automated testing with Clyde
*** excercise system prior to heavy real world use
*** unfortunately longer term - first of April?
*** this testing will move to test deployment eventually
** Real world use (happening now, although not heavy)


= Validation Framework =
* Stress and WAN testing on Production REDDnet
* QA testing on Test REDDnet required before moving into production REDDnet
* QA testing on Test REDDnet required before moving into production REDDnet
** A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
** Allow users to test using this system

Latest revision as of 10:04, 1 February 2008

Deployment

  • Bring current deployment up-to-date (estimate complete by mid-Feb)
    • Build new image for 2GB internal USB memory
      • may be done by All-Hands
    • Design and implement a depot recovery process
      • This process will be vetted on current deployed hardware
      • Initial process may be done by All Hands or soon thereafter
      • Begin with SFASU for initial vetting
    • Send recovery keys out to sites and update the depots
      • early February
      • need to recruit one person at each site to assist
    • Set Nagios back up
      • need a more stable system before this makes sense
      • turned off during this transition period to avoid flood of diagnostics
      • needs a day or two of time, could be done now depending on priority
  • Prepare additional existing hardware for deployment (19 nodes)
    • Update image on internal USB (will use for testing of the above recovery process)
    • Send 6 depots to SFASU with additional PDU
    • Find new collaborators/sites for remaining 13.
      • Ideas?
  • Develop MOU for current deployment (timescale uncertain?)
    • Longer term project
  • Define a standard set of software tools for depots
    • The following also exist
      • Iperf
      • Nagios
      • mtr
    • other tools to be added?
      • investigating new tools now
  • Gain experience with existing deployment
    • See "Validation Framework" below.
    • make frequent reports in weekly REDDnet meetings
    • review deployment experience end of March
  • Proposed multi-tiered system for sites for discussion
    • Tier 1: Sites that run their own LServer and Chord ring
    • Tier 2: Sites that manage their own REDDnet depots
    • Tier 3: Sites that use their own storage resources as depots
    • Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
    • Develop MOU for each tier - we need to be able to supply reliable storage.
  • Needs
    • What resources
    • What policies
    • What management
  • Investigate new management tools over the next two months
    • rsync or similar (short term)
    • Perceus (long term)

Monitoring

  • Use StorCore, Nagios, iperf, and visualization tools from SC07
    • Have a statistic page that gathers information from tests and presents them cleanly
      • Long term project - 6 months to a year?
  • What is required to provide adequate support for REDDnet
    • Want feedback on this.
    • Setting expectations?
    • Actively talk to users over the next couple of months - 6 months - ongoing
    • Develop an initial plan by mid-Feb, then review every 3 months and tweak
  • Create a REDDnet status site, using google maps
    • short term just get green dots on a map (what is the priority for this?)
    • longer term - expected to evolve, integrate with the vis site
  • Create an RT site to resolve users' issues
    • needs to happen quickly. mid-Feb.

Validation Framework

  • Stress and WAN testing on Production REDDnet
    • Automated testing with Clyde
      • excercise system prior to heavy real world use
      • unfortunately longer term - first of April?
      • this testing will move to test deployment eventually
    • Real world use (happening now, although not heavy)
  • QA testing on Test REDDnet required before moving into production REDDnet
    • A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
    • Allow users to test using this system