Operations and Deployment: Difference between revisions
Jump to navigation
Jump to search
(5 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Deployment = | === Deployment === | ||
* Bring current deployment up-to-date | * '''Bring current deployment up-to-date (estimate complete by mid-Feb)''' | ||
** Build new image for 2GB internal USB memory | ** Build new image for 2GB internal USB memory | ||
** Design and implement a depot recovery process | *** may be done by All-Hands | ||
** Design and implement a depot recovery process | |||
*** This process will be vetted on current deployed hardware | |||
*** Initial process may be done by All Hands or soon thereafter | |||
*** Begin with SFASU for initial vetting | |||
** Send recovery keys out to sites and update the depots | ** Send recovery keys out to sites and update the depots | ||
*** early February | |||
*** need to recruit one person at each site to assist | |||
** Set Nagios back up | ** Set Nagios back up | ||
** | *** need a more stable system before this makes sense | ||
*** turned off during this transition period to avoid flood of diagnostics | |||
*** needs a day or two of time, could be done now depending on priority | |||
* Prepare existing hardware for deployment | * '''Prepare additional existing hardware for deployment (19 nodes)''' | ||
** Update image on internal USB (will use for testing of the above recovery process) | ** Update image on internal USB (will use for testing of the above recovery process) | ||
** Send 6 depots to SFASU with additional PDU | ** Send 6 depots to SFASU with additional PDU | ||
** Find new collaborators/sites | ** Find new collaborators/sites for remaining 13. | ||
*** Ideas? | |||
* Develop MOU for current deployment (timescale uncertain?) | |||
** Longer term project | |||
* Define a standard set of software tools for depots | * Define a standard set of software tools for depots | ||
** Iperf | ** The following also exist | ||
** Nagios | *** Iperf | ||
** mtr | *** Nagios | ||
** other tools | *** mtr | ||
** other tools to be added? | |||
*** investigating new tools now | |||
* Gain experience with existing deployment | * Gain experience with existing deployment | ||
** See "Validation Framework" below. | |||
** make frequent reports in weekly REDDnet meetings | |||
** review deployment experience end of March | |||
* | * '''Proposed multi-tiered system for sites for discussion''' | ||
** Tier 1: Sites that run their own LServer and Chord ring | ** Tier 1: Sites that run their own LServer and Chord ring | ||
** Tier 2: Sites that manage their own REDDnet depots | ** Tier 2: Sites that manage their own REDDnet depots | ||
** Tier 3: Sites that use their own storage resources as depots | ** Tier 3: Sites that use their own storage resources as depots | ||
** Develop MOU for each tier | ** Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE | ||
** Develop MOU for each tier - we need to be able to supply reliable storage. | |||
* Investigate new | * Needs | ||
** What resources | |||
** What policies | |||
** What management | |||
* Investigate new management tools over the next two months | |||
** rsync or similar (short term) | ** rsync or similar (short term) | ||
** Perceus (long term) | ** Perceus (long term) | ||
= Monitoring = | === Monitoring === | ||
* Use StorCore, Nagios, iperf, and visualization tools from SC07 | * '''Use StorCore, Nagios, iperf, and visualization tools from SC07''' | ||
** Have a statistic page that gathers information from tests and presents them cleanly | ** Have a statistic page that gathers information from tests and presents them cleanly | ||
** | *** Long term project - 6 months to a year? | ||
* '''What is required to provide adequate support for REDDnet''' | |||
** Want feedback on this. | |||
** Setting expectations? | |||
** Actively talk to users over the next couple of months - 6 months - ongoing | |||
** Develop an initial plan by mid-Feb, then review every 3 months and tweak | |||
* Create a REDDnet status site, using google maps | * Create a REDDnet status site, using google maps | ||
** short term just get green dots on a map (what is the priority for this?) | |||
** longer term - expected to evolve, integrate with the vis site | |||
* Create an RT site to resolve users' issues | * '''Create an RT site to resolve users' issues''' | ||
** needs to happen quickly. mid-Feb. | |||
= Validation Framework = | === Validation Framework === | ||
* Stress and WAN testing on Production REDDnet | * '''Stress and WAN testing on Production REDDnet''' | ||
** Automated testing with Clyde | ** Automated testing with Clyde | ||
** Real world use | *** excercise system prior to heavy real world use | ||
*** unfortunately longer term - first of April? | |||
*** this testing will move to test deployment eventually | |||
** Real world use (happening now, although not heavy) | |||
* QA testing on Test REDDnet required before moving into production REDDnet | * QA testing on Test REDDnet required before moving into production REDDnet | ||
** A stringent set of tests to test | ** A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde) | ||
** Allow users to test using this system | ** Allow users to test using this system |
Latest revision as of 10:04, 1 February 2008
Deployment
- Bring current deployment up-to-date (estimate complete by mid-Feb)
- Build new image for 2GB internal USB memory
- may be done by All-Hands
- Design and implement a depot recovery process
- This process will be vetted on current deployed hardware
- Initial process may be done by All Hands or soon thereafter
- Begin with SFASU for initial vetting
- Send recovery keys out to sites and update the depots
- early February
- need to recruit one person at each site to assist
- Set Nagios back up
- need a more stable system before this makes sense
- turned off during this transition period to avoid flood of diagnostics
- needs a day or two of time, could be done now depending on priority
- Build new image for 2GB internal USB memory
- Prepare additional existing hardware for deployment (19 nodes)
- Update image on internal USB (will use for testing of the above recovery process)
- Send 6 depots to SFASU with additional PDU
- Find new collaborators/sites for remaining 13.
- Ideas?
- Develop MOU for current deployment (timescale uncertain?)
- Longer term project
- Define a standard set of software tools for depots
- The following also exist
- Iperf
- Nagios
- mtr
- other tools to be added?
- investigating new tools now
- The following also exist
- Gain experience with existing deployment
- See "Validation Framework" below.
- make frequent reports in weekly REDDnet meetings
- review deployment experience end of March
- Proposed multi-tiered system for sites for discussion
- Tier 1: Sites that run their own LServer and Chord ring
- Tier 2: Sites that manage their own REDDnet depots
- Tier 3: Sites that use their own storage resources as depots
- Tier ?: Sites that supply rack space, and basic infrastructure, but management is by ACCRE
- Develop MOU for each tier - we need to be able to supply reliable storage.
- Needs
- What resources
- What policies
- What management
- Investigate new management tools over the next two months
- rsync or similar (short term)
- Perceus (long term)
Monitoring
- Use StorCore, Nagios, iperf, and visualization tools from SC07
- Have a statistic page that gathers information from tests and presents them cleanly
- Long term project - 6 months to a year?
- Have a statistic page that gathers information from tests and presents them cleanly
- What is required to provide adequate support for REDDnet
- Want feedback on this.
- Setting expectations?
- Actively talk to users over the next couple of months - 6 months - ongoing
- Develop an initial plan by mid-Feb, then review every 3 months and tweak
- Create a REDDnet status site, using google maps
- short term just get green dots on a map (what is the priority for this?)
- longer term - expected to evolve, integrate with the vis site
- Create an RT site to resolve users' issues
- needs to happen quickly. mid-Feb.
Validation Framework
- Stress and WAN testing on Production REDDnet
- Automated testing with Clyde
- excercise system prior to heavy real world use
- unfortunately longer term - first of April?
- this testing will move to test deployment eventually
- Real world use (happening now, although not heavy)
- Automated testing with Clyde
- QA testing on Test REDDnet required before moving into production REDDnet
- A stringent set of tests to test both the hardware, OS, IBP, and LStore as throughly as possible (primarily Clyde)
- Allow users to test using this system