Friday, January 4th 2013 11:25:01 PM PST
The Triton Resource provides easily accessible, affordable, high-performance and data-intensive compute resources to UCSD researchers, faculty, affiliates, government and commercial partners through innovative, locally supported, scalable hardware and software over multiple 10-gigabit networks extending from campus laboratories to the UC network, California, and the US.
Posted 1:30 p.m. Friday, June 14, 2013: First of Many Planned Get-togethers Scheduled for 1-2 pm June 26, 2013
Now that the transfer of users from Triton to the TSCC is well on its way to completion, we are ready to begin holding regular gatherings of TSCC condo owners, hotel users, and potential participants. Jim Hayes will present an update on the current TSCC status and a recap of the way things work on the cluster. This will serve as a starting point for discussions of changes and enhancements that may be useful to you. It will also give both users and maintainers a chance to put faces together with the names on the mailing list. For those who might be interested in a look at the cluster hardware, Jim can also give a quick tour (bring your earplugs!).
We have reserved the SDSC auditorium for Wednesday, June 26th, from 1:00-2:00 p.m. for this meeting. Please plan to attend if you can make it. We'll post a reminder as the date gets closer.
Posted 3:30 p.m. Thursday, May 9, 2013: Migration of Triton Users to TSCC
The new TSCC research cluster is moving into production, which means that the clock has started ticking on the decommissioning of Triton--we plan to shut Triton down at the end of June. In the coming weeks, we'll be working with Triton users who have a balance of purchased cycles in making the transition to the new cluster.
There are a few steps you'll need to take as part of the transition. Because the TSCC is running a later version of the o/s than is Triton, it's very likely that any self-installed binary applications that you've been running will need to be rebuilt to run on the new cluster. (The standard application stack on the TSCC will be nearly identical to that on Triton, although with updated versions. In contrast to Triton, the TSCC "hotel" has an Infiniband inconnect, so mvapich2 and openmpi are the supported MPI flavors.)
Home file space on the TSCC will be limited to 100GB/login; if you're one of the 50 or so Triton users with more space than that, you'll need to pare down your usage to fit. If you have large amounts of data that you'd like to carry over to the TSCC, see SDSC Storage & Backup Solutions for options. If you have lab file servers that you would like mounted on the TSCC, please contact Jim Hayes for details.
As part of this transition, we're going to start enforcing the long-deferred 90-day purge policy on the /oasis parallel filesystem. Data more than 90 days old will be subject to purge without further notice beginning at the end of May.
If you are currently using a trial login on Triton, your account and data will be deleted at the end of June. You may apply for a trial account on TSCC or purchase time. Note that trial accounts on the TSCC will be limited to 250 SUs and 90 days.
For information on accessing and using the TSCC, see the escarole.
Posted 11:30 p.m. Wednesday, May 1, 2013: Production Availability of TSCC
The doors to the new Triton are now open. Accounts are being setup for participants who have hardware running in the cluster, and emails will be sent to users as they become available. Once that is done, system admins will begin migrating user accounts over from Triton.
For a short time, the system will run in "debugging mode" --in-house testing never covers all the combinations of circumstances that real users running real codes generate. Once the bumps are smoothed out, all user accounts will be reset to their full allocations, and the cluster will be in full production.
Posted 2:30 p.m. Monday, March 4, 2013: Progress on Cluster Roll-out; Production Availability Coming Soon
The general computing nodes are up, wired, and running as a cluster. We identified a work-around for the HP PXE-boot bug (thanks, Phil P.), so we can now administer them properly. The unique aspects of our scheduler config (i.e., the implementation of the hotel/condo policies) are holding up under testing, as is most of the application software stack–we're fixing configuration bugs in two of the packages; the others run smoothly.
The replacement 10GbE switches finally arrived Friday, and we're in the process of configuring, racking, and wiring them. Once that's up and running, we'll have a pathway to external storage.
We selected Advanced HPC to provide GPU nodes for the cluster and have placed the first order for two nodes. Those should arrive in a few weeks.
Our recharge proposal has passed the first (reportedly, the most difficult) hurdle in its review, so we're significantly closer to being able to put the costs in ink. We have four additional researchers in varying stages of signing up to participate in the TSCC.
Our apologies for not getting the cluster up and running as quickly as we had hoped. Pieces are falling into place steadily, and we should be able to open the doors soon.
Please see the TSCC RCI website for more information about the cluster. The biggest change from the prior version is that the GPU hardware and costs are now filled in.
Updated 11:00 a.m. Monday, February 11, 2013: Switch Maintenance Complete
The switch work is completed, and Triton should be fully accessible again.
Posted 10:30 p.m. Monday, February 11, 2013: Short Downtime for Switch Updates Starts Approx. 9:30 a.m.
Required switch maintenance is currently underway on Triton. Users will be unable to connect to Triton during the brief downtime. A follow-up will be posted when the sytem is accessible again. This will not interfere with running jobs.
Posted 4:30 p.m. Friday, January 18, 2013: Set for Price Center on Thursday Jan. 24, 2013
The roll-out presentation for TSCC is scheduled for Thursday, Jan. 24, from 1:00-2:00 in the Price Center6016399540 (second floor, in the offices above the food court). All interested persons are encouraged to attend. There will be a discussion of TSCC computing, including the new hardware, as well as storage, networking, colocation, and data curation services offered by RCI. This will also be a good opportunity to ask questions.
Posted 4:00 p.m. Friday, January 18, 2013: TSCC to Support Trials Beginning Feb. 2013
Please be aware that trial accounts on the Triton Resource will no longer be available effective Jan 31. Interested users can apply instead for trial accounts on the next generation of Triton, known as the Triton Shared Compute Cluster (TSCC). We estimate availability of trial accounts in February 2013. The new trial accounts will be for 250 core-hours with a 90-day duration. For more details on TSCC, see below and visit the 3176034477.
Posted 2:30 p.m. Friday, January 4, 2013: Triton Shared Computing Cluster to Debut Soon
As mentioned in a post to the Triton Discuss mailing list on Nov. 16, 2012, we are planning a restructuring and equipment upgrade of the Triton Resource cluster. The transition to the new system, to be known as the Triton Shared Computing Cluster (TSCC), is expected to be complete by February 1, 2013. The existing Triton Resource will be decommissioned concurrently.
TSCC will build on our experience operating Triton Resource and will feature updated and more capable technology. The TSCC will employ a hybrid business model:
For current Triton Resource users with outstanding balances of computing time, we are adopting the following policies:
We are adjusting the pay-as-you-go (hotel) portion of the system to 40 nodes (640 total cores as compared to 2,048 cores on Triton Resource), and therefore expect that overall % utilization of these nodes (and hence, average queue waits) will rise compared to what is currently experienced with the Triton Resource. Therefore, we encourage you to take advantage of the current low queue waits on the Triton Resource to complete some of your computing work prior to transition. (Please note it is not our plan to introduce unacceptable queue waits on the new system but rather to balance wait times with efficient resource utilization.)
We appreciate your support and use of the Triton Resource and we look forward to providing you access to updated technology.
Additional information may be found on the RCI Website. If you have any questions or require further information, please contact:
Updated 3:00 p.m. Monday, October 15, 2012: Maintenance Completed Ahead of Schedule
The work on Oasis has been completed, and Triton is back in service. Queued jobs should begin running soon; let us know if you see any problems.
Updated 11:15 a.m. Monday, October 15, 2012: Login Nodes Will Stay Online
/oasis/triton/scratch has been unmounted across Triton, and the hardware work is getting underway. We were able to unmount from the login nodes without rebooting, so plan to leave them up for the duration.
Updated 7:45 a.m. Monday, October 15, 2012: Full-day Downtime Expected
Just a reminder that Triton will be offline for hardware work today Monday, October 15th. Any jobs submitted with a walltime that stretches into that period will be held until the system is back online.
Posted 10:00 a.m. Tuesday, October 2, 2012: Full-day Downtime Expected
Triton will be going offline on Monday, October 15th to replace the RAID controllers on the Lustre filesystem. There are a lot of them, so the current plan is to have the system offline for most of the day. I'll post timing details as the day approaches, but you should figure that Triton will be unavailable from 8:00 a.m. until sometime late in the afternoon.
Posted 10:30 a.m. Thursday, September 27, 2012: Latest Versions Now Installed
Newer versions of the Intel compilers have been installed into /home/beta/intel. To use them with the module command, execute
module load intel/2011.7.256
Let us know if you run into any problems.
Updated 1:00 p.m. Tuesday, August 14, 2012: User Access Re-enabled Today
Although the copy-back still isn't done, the filesystem has been mounted across Triton. Members of the jcsung, stsi-group, and nmalloy-mss groups should be aware that the contents of your directories are still incomplete, and those users should delay accessing data until they are directly informed that the copying process has finished. All other users, please feel free to begin using the filesystem.
There's still a bit of clean-up to do—a process is being run to make sure that all files belong to group scratch and that the setgid bit is set for all directories. That will take a while to complete, but, with any luck, won't cause difficulties in using the filesystem while it runs.
The new mount point is /oasis/triton/scratch. Remember that new policies now take effect for the parallel filesystem, including a 90-day purge policy. Please report any problems you encounter, and thank you for your patience.
Updated 2:45 p.m. Monday, August 13, 2012: Final Data Copy Almost Complete
We had a handful of dropped connections over the weekend which has slowed the data copy, but we're down to the last two directory trees. We will post as soon as the filesystem is available.
Updated 5:30 p.m. Thursday, August 9, 2012: Final Data Copy Underway
Copy of data back to the rebuilt parallel filesystem has begun. That process will likely take about a day and a half. We'll let you know as soon as the filesystem is remounted and available for use.
Updated 4:00 p.m. Wednesday, August 8, 2012: Final Performance Tests Underway
We're doing a final test to check concurrent bandwidth to the parallel filesystem. Assuming that checks out, we will begin copying back data later this afternoon—a process that should complete on Friday. At that time, the new PFS should be accessible from Triton for running jobs.
Updated 2:00 p.m. Thursday, August 2, 2012: Maintenance Extended Through Weekend
After reconfiguring and rebuilding the system, testing showed that the data transfer from two of the data servers is extremely show. The vendor has tried replacing drives and controllers, but the problem persists. Diagnosis is continuing.
Once the problem is fixed and we've verified the performance of the file system, it will take about 36 hours to transfer the contents of /phase1 onto the rebuilt system. So, the parallel file system is likely to remain offline at least until Monday, August 6. Apologies for the continued delay; we will post as soon as we have more information.
Updated 2:15 p.m. Thursday, July 19, 2012: Week-long Phase 1 Maintenance Starts Tomorrow
This is a final reminder that the Phase1 PFS will be going offline tomorrow morning. Umounting will commence at approximately 8:00 a.m. One or both login machines may require reboots to clear hung mounts; if so, a warning will be sent to the triton-discuss list before starting the reboot.
Posted 2:00 p.m. Monday, July 16, 2012: Triton will continue to run without access to /phase1
A reminder that Phase1 will be down for a week beginning this Friday. Any purging you can do of data from there that you no longer need would be appreciated.
Posted 11:00 a.m. Thursday, May 27, 2012: Entire cluster affected, restart in progress
Yesterday's power outage, which severely affected the entire UCSD campus, caused a loss of power to the SDSC machine room and took Triton offline. Power was restored by 2 a.m. and we are in the process of bringing systems back online. The outage was caused by a transformer fire in the main substation that serves the UCSD campus.
Please note that there are several services (network, disk resources for filesystems, etc.) that need to be restored before the compute cluster can be brought online. We will post to the discussion list and this web site when the system is fully functional again.
Posted 9:15 a.m. Thursday, May 10, 2012: More memory added to 20 TCC nodes
We have doubled the memory capacity on 20 of the batch nodes, from 24GB to 48GB. If you have an application that could benefit from the extra memory, you can direct your job toward these nodes by submitting to the batch queue with a request for memory greater than 24GB. For example:
There is no additional charge for running jobs on these higher capacity nodes. Please consume responsibly.
Updated 4:15 p.m. Monday, May 7, 2012: Phase 1 reconfigured and in service
Triton is back up. We'll be monitoring /phase1 operation and performance closely; please post if you encounter any problems, and thanks for your patience.
Updated 3:00 p.m. Monday, May 7, 2012: Phase 1 still down, no estimate of completion time
Work continues on bringing /phase1 back up fully--no updated estimate at this time as to when Triton will be back online. Our apologies for the delay. We've had a card failure on one of the /phase1 servers, which lengthened the downtime.
Posted 3:00 p.m. Friday, May 4, 2012: Phase 1 reboot requires scheduler suspension
We're going to take Triton down for some brief maintenance on Monday, May 7. We'll be rebooting the /phase1 fileservers and doing some rewiring on the Myrinet switch. Estimated downtime is one hour; we'll likely reboot the login nodes at the same time to clear any mounts that go over the Myrinet.
We have a reservation in place to prevent jobs from starting during this maintenance period. Please post to the discussion list if you have any questions.
Updated 10:30 a.m. Monday, Mar. 12, 2012: Users Can Re-login to Update Tools Path
We've made the jump to Maui and will be watching closely during the day. Scheduler-related tools such as showq are now found in /opt/maui/bin instead of /opt/moab/bin. Logging out of Triton and back in will be sufficient to pick up the replacement versions, or you can modify your PATH variable directly. Please post if you notice any problems—most likely a submitted job that stays queued for no apparent reason.
Updated 2:30 p.m. Thursday, Mar. 8, 2012: Scheduler Upgrade Postponed to Monday, 3/12/2012
We're postponing the final switch from Moab to Maui until Monday, 3/12. At that point we'll remove Moab from the login nodes, so users will need to reset their paths to pick up the tools in /opt/maui/bin.
Updated 5:35 p.m. Wednesday, Mar. 7, 2012: Scheduler Software Upgrade Delayed
We've been unable to get Maui running this afternoon, and we're going to pick up trying in the morning. In the meantime, we've restarted Moab so that jobs can proceed. We'll follow up with any status changes.
Posted 1:05 p.m. Tuesday, Mar. 6, 2012: Scheduler Software Being Replaced
We're planning a brief downtime for Triton Wednesday, 3/7/2012, at 2:00 p.m. for some scheduler work--our Moab license is expiring, and we're opting to switch to the Maui scheduler instead of renewing. The change-over shouldn't affect any running jobs, but there will be an interruption in your ability to submit new ones. After the change is made, you'll need to reset your path on the login nodes to pick up Maui's versions of the scheduler tools (e.g. showq), rather than Moab's. An easy way to do this is to log out and back in.
We'll post follow-up messages as the work proceeds.
Updated 9:05 a.m. Monday, Feb. 13, 2012: SDSC Switch Reboot Completed Successfully
The switch reboot is completed, and jobs are running on Triton again. We'll be doing testing this morning to see if the reboot fixed the problem with writes to /phase1.
Updated 1:00 p.m. Friday, Feb. 10, 2012: SDSC Switch Reboot Scheduled for Feb. 13 8-9 a.m.
In order to fix a problem some people are having with large writes to /phase1, we'll need to reboot our network switch. We have a one-hour downtime scheduled for Monday, 2/13, at 8:00 a.m. for this purpose. Newly-submitted jobs that run into that time period will be queued until after the downtime.
For questions, please email the Triton Discussion List, firstname.lastname@example.org.
Updated 6:00 a.m. Tuesday, Feb. 7, 2012: SDSC Switch Maintenance Scheduled for Feb. 6 9 a.m. - 9 p.m.
Triton is available again. If you run into any issues, please email the Triton Discussion List, (214) 920-0931.
Posted 1:00 p.m. Friday, Feb. 3, 2012: SDSC Switch Maintenance Scheduled for Feb. 6 9 a.m. - 9 p.m.
Triton will be unavailable next Monday for work on the /phase1 interface. Any jobs that extend into Monday will stay queued until the work is completed. We've set aside a full day, but we hope to have it done by 5 p.m.
For questions, please email the Triton Discussion List, 780-688-4440.
Posted 3:00 p.m. Thursday, Sept. 22, 2011: Data Oasis Now Referred to as Phase One
SDSC has announced a new 800 TB Lustre filesystem, the first phase of a large high-performance filesystem that will be available to Triton users early next year. Phase One is available from Triton now, and currently has 16 storage servers, each with 4 object storage targets (64 OSTs total) and a peak measured bandwidth of 12.5 GB/s.
All current Triton users have an assigned directory in /phase1 which is accessible now. As of today Sept 22, 2011 the old filesystem (/oasis) is read-only. All users are requested to move their data from /oasis to /phase1 by Oct 10, 2011 at which point /oasis will be retired from service. Users are requested to use compute nodes or the alternate Triton login node (triton-38) to move their data between filesystems to avoid overloading a single client machine. Please contact the Triton Discussion List if you need assistance.
The /phase1 filesystem is intended to be high performance scratch storage and is subject to a purge policy. The filesystem is not backed up and should not be used for long-term storage. Users are reminded that any important data must be moved to their own local storage resources.
Updated 8:00 a.m. Wednesday, Sept. 21, 2011: SDSC Switch Maintenance Completed Successfully
The upgraded switch maintenance was completed at approximately 8:30 p.m. Tuesday, Sept. 20. Triton queues should again be accessible from external locations.
Updated 10:30 a.m. Tuesday, Sept. 20, 2011: Maintenance to Continue This Evening
After a longer than expected delay yesterday, the two lustre systems are finally remounted around Triton and the queues are pushing through jobs again. We will make every effort to keep this evening's outage time to a minimum.
From 6:00-9:00 tonight the network folks will be upgrading the switches that connect SDSC to the outside world. During that time, if you're located somewhere outside SDSC, you're likely to be unable to connect to Triton. The good news is that this work shouldn't affect connectivity within SDSC, so jobs in the queues at 6:00 p.m. should be able to run without difficulty--you just won't be able to see the results until outside connections are restored. If you have jobs that happen to need outside connections (e.g. they rely on an externally-mounted disk), it would be a good idea to put a hold on them until after the work is complete. See /status.externsdsc.org/ for status updates.
Posted 9:30 a.m. Monday, Sept. 19, 2011: Due to Reopen Around Noon
The triton batch queues will are unavailable today, Monday, September 19th, until around noon for some switch work. As part of this work, both /phase1 and /oasis will be unmounted for the duration.
We will post as soon as /phase1 and /oasis are remounted.
Posted 12:30 p.m. Friday, Sept. 16, 2011: Reservations will prevent some jobs from being scheduled until after completion of downtime
There is a system-wide reservation in place for Data Oasis maintenance on Monday, Sept. 19 from 9 a.m. to noon. Jobs which specify end times after the start of this maintenance will not be scheduled until after it ends. There is also a reservation for Tuesday, Sept. 20 from 6 p.m. to 10 p.m. with the same effect on scheduling.
Updated 6:30 p.m. Tuesday, Sept. 13, 2011: Rebooting Both Login Nodes Around 8:00 p.m. PT
We need to reboot/reinstall the login nodes once more to clear some hung processes and fix a problem with the MPI installation. We will start by shutting off new logins to the alternate login node (triton-38). That will probably cycle by 8:00 p.m. After it comes back up, we'll do the same for the primary login. Apologies for the instability; with luck, this will be the last service interruption we'll see in the aftermath of Thursday's outage.
Updated 11:45 p.m. Sunday, Sept. 11, 2011: Primary Login Node Available Again
triton-login is back up; triton-38 was also down. Details will be forthcoming as we learn more about the cause.
Updated 12:00 p.m. Sunday, Sept. 11, 2011: Alternate Login Node in Use Until Further Notice
Our main login machine, triton-login, seems to be down. Please use our alternate login, triton-38.sdsc.edu, until further notice.
Updated 12:00 p.m. Saturday, Sept. 10, 2011: Triton Back Online
Triton has mostly recovered from the faceplant it took during the power outage. As mentioned, we have temporarily lost a file server (but not the data stored on it), and we had to replace the cluster management server, which has some lingering reprecussions. Here's what we know is presently *not* working:
We'll be working on these problems, plus anything else we discover, over the next week. Please use Triton with some caution during this shake-out, and post any problems that you encounter.
Updated 3:10 p.m. Friday, Sept. 9, 2011: Mid-afternoon update
Triton management node (New Hardware) is up and running and is busily building the rest of the nodes. The major issue we see is that one of the Project NFS servers refuses to boot. That particular issue will not be solved until Monday or Tuesday.
The specific projects that are affected are mounted under /projects:
cgl-group frazer-group geogrid-aist liai-group nrnb-group ren-lab biogem-lab camera camera-lab crbs-group gleeson-lab lca-group mmiller-group sarkar-lab zhang-lab
If you have data in these directories, we have backups of your data as of approximately 4:00 a.m. on Sept. 8. If you need access to your replicated data, please contact Phil Papadopoulos or Jim Hayes. We can give you some options (read only, read/write, etc.), but we would want to talk to you individually. Our expectation is that the data on the primary Projects server is intact, we simply can't see it until hardware is addressed.
Home area data (e.g. /home/user) is stored on different servers and is unaffected.
All data partitions on /oasis and /phase1 have been checked and look to be in good shape, too.
We hope to have Triton otherwise restored by the end of the day (but it may go into the weekend). So far, the rebuild of is going smoothly.
Posted 1:30 p.m. Friday, Sept. 9, 2011: Damaged hardware will be replaced today
The power outage wreaked havoc with Triton's management node. We're replacing the hardware and rebuilding. We'll let you know when things are back in operation.
User data (held on different systems) are all intact.
Updated 11:00 a.m. Wednesday, June 29, 2011: Maintenance completed at 4:00 p.m.
Triton is available again after a longer-than-expected downtime. We've been struggling mostly with getting the updated batch scheduler software to work as desired; it seems like most of the kinks are worked out now.
Feel free to log in and resume working. Please keep a somewhat closer-than-usual eye on your jobs for the first few days, and post to Triton Discuss if you encounter anything that doesn't look right.
We appreciate your patience.
Updated 8:00 a.m. Tuesday, June 28, 2011: Short delay in upgrade procedure
We've run into some difficulties getting Triton to stand up properly after the software upgrade. We're continuing to work on it, and will post a follow-up when the system becomes available again. At this point, we do not expect to have it available until sometime Tuesday. We apologize for the delay.
Posted 2:00 p.m. Thursday, June 23, 2011: OS upgrade and new software will be available
Triton will be offline for a software upgrade next Monday, June 27th. We'll be moving from Rocks v5.3/CentOS v5.4 to Rocks v5.4/CentOS v5.6; additional new and updated applications are listed below. We've blocked out 8:00-5:00 Monday for the upgrade, but the actual downtime should be much shorter. Check this website, the mailing list, or the Triton Twitter feed for a follow-up message when the system is available again.
Please note that, unlike other recent downtimes, jobs in the queue when the system goes down will need to be resubmitted once it comes back up. Please contact the Triton Discuss mailing list if you have any questions.
New applications in /opt:
|fftw||v2.1.5 (in addition to v3.2.1)|
|python||v2.7 + v3.2|
Updated applications in /opt:
|Application||Current Version||New Version|
Apologies for the extreme delay. Triton is back up with /oasis mounted, the login nodes are reopened, and the queues have started running again. Please post a note to the Discussion List if you run into any problems.
Updated 9:30 a.m. Tuesday, May 24: Data Oasis Communications Pending
Update on the 5/23 physical relocation: the move was completed in good time yesterday; however, Data Oasis continues to exhibit communications failures. Triton will remain offline until the problem is resolved. We will post follow-ups as we find out more.
Updated 10:30 a.m. Tuesday, May 10: All-Day Downtime Set for May 23
To make room for some new hardware, the Data Oasis servers and associated switch are going to be moved to SDSC's other machine room on Monday, May 23rd. Between the move, re-racking, and re-cabling, this will be a more involved process than the switch work we did a few weeks ago; we're figuring that we'll likely have close to a full day's downtime. The Triton submission queues will be down during this period, and access to the Triton login nodes may be shut off as well. We'll post updates as more information becomes available.
Updated 12:30 p.m. Monday, April 25: All Nodes Access Data Oasis
Switch maintenance is complete, and access between Triton and Data Oasis has been restored. The hold on jobs has been removed, so submissions should begin moving through the queues again. Please report any problems to the Triton Discussion List (email@example.com).
Updated 3:15 p.m. Thursday, April 21: 8 a.m. Start for Switch Maintenance
Triton will be down for maintenance from 8 a.m. to 1 p.m. PT on Monday, April 25, 2011. Work will be performed on one of the switches that provides access to Data Oasis. All jobs scheduled to be running during this window will remain queued until after the connection to /oasis is restored.
Currently, there is no plan to shut off access to the Triton login nodes. However, they will probably be rebooted on short notice once the switch work is complete. Access to user home filesystems should not be affected, however most /projects filesystems will be offline during the outage.
Updated 3:00 p.m. Thursday, February 3, 2011: Brief Period of Unavailability Via Login Node
The upgrade to file servers supporting the home filesystem was completed at approximately 2:30 p.m. today.
Posted 12:05 p.m. Thursday, February 3: Brief Period of Unavailability Via Login Node
We are performing a maintenance on the home filesystem, which has caused Triton's login node to be inaccessible for a short time. We will notify through this page, the Discusson List, and sycamore maple when the system access is back online.
Posted 3:30 p.m. Wednesday, January 19, 2011: Students Learning to Supercompute via Triton
Triton has become a teaching resource as well as a research one in its first year on campus. Read the latest sorehead at the SDSC News Center.
Posted 10:30 a.m. Saturday, November 27: Lack of Cooperation Dooms Open Policy
Because of recurring issues with users simply consuming disk space without regard to space available, we have had to modify our operative, user-friendly space allocation which allowed users to expand to space needed and then contract after usage. That open policy has failed.
Hard quotas now exist on all home areas. The standard allocation is 100 GB per user. Currently, users who are consuming more than 100 GB have had their quotas set to accomodate active space as of 11/19/2010 so that users can effect cleanup of their home areas.
If you are consuming more than 100 gigabytes of space, you must either
At the time the policy change was instituted, there were about 60 users consuming over 100 GB of home area space. If you have not been consuming more than 100 GB of space on a long-term basis, or have been attentively clearing out your overage, we thank you!
Posted 8:30 a.m. Friday, October 29: Last Chance to Get Your Data
Last weekend's changes seem to have finally settled down Mirage. We hope you've had a chance to copy any data you'd like to keep to a new home on Data Oasis.
We plan to permanently disconnect Mirage from Triton this coming Monday, November 1st. Remember that the disks from Mirage will be erased and reused, so anything that hasn't been copied will be unrecoverable.
Posted 9:30 p.m. Tuesday, October 26, 2010: All User Data Must Be Copied to Oasis or Lost
Within the next couple days, all Mirage disks will be offlined from Triton. Any user data stored there will no longer be accessible. If you have yet to migrate all your essential data to Oasis, please inform the discussion list immediately so arrangements can be made to preserve your data.
It is recommended to use in interactive node to migrate your data to Oasis rather than a session on the login node. Here are some suggestions for how to perform that procedure:
% qsub -I
There are many ways to do this, perhaps the slowest being cp -R. One simple and preferrable method is:
% cd /mirage/<username>
% tar cf - * | ( cd /oasis/<username>; tar xvfBp -)
You might follow the above tar with an rsync, for example:
rsync /mirage/<username>/ /oasis/<username>/
Note that rsync syntax is rather sensitive. The trailing slashes on the command above are important.
Using tar will cause reads to be well buffered, putting Lustre into more of its comfort zone for "big" files. If you have many small files, there is not an efficient way to move data, mostly because Lustre is not very efficient on small files.
Many other efficient methods are available for large files.
Updated 4:30 p.m. Tuesday, October 12: Transition of Data Is In Progress
As of today, Data Oasis is available for use. Users should being moving their data from Mirage and updating job scripts to write to /oasis immediately. On about Oct. 29, Mirage will be completely removed from Triton and user data there will no longer be available.
We'll start the process of flipping /mirage to read-only on Friday afternoon, Oct. 15, finishing sometime Monday. Unfortunately, this will require yet another round of rebooting, but we'll try to keep disruptions to a minimum.
The scratch directory for jobs can be found at /oasis/scratch/<login>/$PBS_JOBID We still need to add the code to set an environment variable to this path and to clean out the directories after three days.
Updated 1 p.m. Monday, October 11: Transition To Data Oasis Delayed
Due to complications with Mirage, the transition will start tomorrow, Tuesday, October 12. Sorry for any inconvenience this may cause. Please adhere to the below schedule, except that it be offset by 24 hours.
Posted 2 p.m. Friday, October 8: Transition To Begin Monday, October 11
Over the past couple of weeks the Triton team has been testing the first incarnation of Data Oasis (DO), a new parallel filesystem intended to replace Mirage. We're ready now to go live with the system. Given the fun we've experienced with Mirage over the past 12 hours, it looks like the timing is good.
This version of DO will have approximately 250 terabytes usable capacity, more than doubling the amount of disk space available on Mirage. DO will run the latest version of the Lustre filesystem; word on the street and our own experience during testing both indicate that we should see significant improvements in stability by making the upgrade.
Next Monday (10/11) we'll begin a two-week transition period to move from Mirage to DO. Both filesystems will be mounted across Triton, mirage at /mirage as usual, and DO at /oasis. As with /mirage, you will find a directory on /oasis named after your login id. Please modify any references you have in scripts, etc. to refer to the new filesystem, and start copying any data you want to keep from Mirage to DO. After three days (Thursday 10/14 — long enough to let any jobs running Sunday to complete), we'll remount Mirage read-only so that jobs won't be able to write any new data to it.
Unlike Mirage, we plan to reserve 50 terabytes on DO for job scratch space. Each job will have the associated scratch directory /oasis/scratch/<login>/<job#> that can be used to place data temporarily. (We'll set a job-specific environment variable to reference this directory.) These directories and their contents will be purged automatically three days after the job completes — this really is scratch space.
Also unlike Mirage, we'll be placing per-user quotas on DO usage, with the goal of avoiding the performance degradation we saw on Mirage when usage got into the high 90-percent range. Details will follow, but each user will have at least enough space to hold their current Mirage usage.
After the two-week transition period, Mirage will leave Triton and its disks will be recycled to other uses, so any data that hasn't been copied over really will be gone. As with Mirage, making backups of data on DO will be the responsibility of the users.
Please post any questions you have about this transition to the Triton Discussion List (firstname.lastname@example.org), and we'll do our best to clarify and help in the move. There should be a /oasis set up sometime late Sunday afternoon; feel free to get an early start on transferring your data.
Posted 12:00 p.m. Friday, October 8: New Results from Storage Upgrade
These numbers are for up to 4 nodes using 32 cores. Tests were also run on up to 128 nodes and 1024 cores.
Nodes Cores Max Write Max Read 1 8 378.12 MiB/sec 578.18 MiB/sec 2 8 601.66 MiB/sec 849.80 MiB/sec 4 8 744.38 MiB/sec 981.06 MiB/sec 4 16 740.49 MiB/sec 1066.38 MiB/sec 4 32 565.95 MiB/sec 1070.39 MiB/sec
The peak performance for this set was:
|Type||Max speed||Nodes||Cores||File size|
Please visit the Data Oasis page for more information.
Posted 4:00 p.m. Thursday, Sept. 30, 2010: All Triton Rolls Can Be Obtained from the Rocks Git Server
In an effort to streamline accessibility to Triton source used to build our system software, we want to inform interested users about the availability of the 4088570642. From this location, daily CVS code updates can be downloaded and users can keep abreast of the most recent changes being checked in by Triton developers. In addition to Triton source code, this repository contains the latest Rocks internal and third-party code as well.
Posted 5:00 p.m. Tuesday, Sept. 28: Performance Expectations Exceeded During Phase 0 Certification
Performance testing on the first phase (Phase 0) of Data Oasis over the last several days achieved approximately 3.5 gigabytes per second (GB/s) on writes and about 7.6GB/s on reads using a 2 terabyte file and 512 clients.
In networking terms, this was 64 gigabits per second, or an incredible 80% of the theoretical channel-bonded link. Another way to look at this data is:
8000/8*1135 (our best 1 OSS number) = 8000/9050 = 88% scaling efficiency
Our goal was 7 billion bytes per second, so we beat that by 15%.
In related news, the Phase 1 Data Oasis RFP has been published. Oasis should be expanding in about two months. Our goal for Phase 1 will be approximately five times the sustained speeds achieved in Phase 0.
We hope to have Phase 0 available to all Triton users within a few days.
Posted 3:30 p.m. Wednesday, Sept. 22: Campus Users Will Lose Access to Triton while Router is Upgraded
On Tuesday, October 5th, 2010 from 5 p.m. until approximately 7:30 p.m., SDSC and UCSD networking teams will update routes to utilize the new MX960 router at SDSC, and retire the older T320 router (known as dolphin). Network routes between SDSC and UCSD will be affected, and there may be connectivity issues to some hosts.
During the maintenance, UCSD Triton users will be unable to access Triton resources. Users connecting to UCSD or SDSC from external hosts should continue to have uninterrupted access to Triton and campus networks.
Posted 9:30 a.m. Tuesday, Sept. 21: Users Asked to Remove Files
We've hit a critical point on /home file usage, where there isn't enough free space for the system overhead involved in deleting files. We have recently freed approximately 1.3 terabytes, which provided headroom for cleaning up the filesystem.
We have about 530 users sharing around 44 terabytes of home space. That works out to about 83 gigabytes per user. Because not all users need that much space (we have a number of idle accounts that contain little or no data), we've so far avoided putting quotas on the system. However, if you're using considerably more than that — say, 830 gigabytes or more — then you're consuming more than a fair share of the resource. Please shift your collected data off of Triton so that we have enough room for everyone to operate.
We continue to seek additional ways to relieve the space crunch. We've shifted some users to another server, freeing up about 30 terabytes from the primary system. However, there will always be a hard limit, no matter how much disk we throw at it, as seen in the 98% usage of the 100 terabytes on /mirage. Users offloading their data onto other resources is the only long-term solution.
Update (9:30 a.m. Monday, Sept. 13): Non-UCSD personnel being temporarily denied use of software
Due to restrictions in our licensing agreement with The Mathworks, we are currently forced to limit Triton users access to both client and server MATLAB licenses strictly to UCSD users.
We regret any inconvenience this causes. We are working with the company to relax the restriction so that MATLAB may again be available to all users. We will announce policy changes when we reach a new agreement.
Update (9:30 a.m. Thursday, Sept. 2): Filesystem Capacity at 98%
/mirage has hit 98% of capacity. At that high a level of usage, performance bogs down and reliability can get a bit shaky. Please make a pass through your data on the filesystem and remove files you no longer need.
Update (1:30 a.m. Saturday, July 10): Triton_RC3 Rolls Now on Download Page
Many of the source rolls from the Rocks 5.3 upgrade completed on May 18 are now posted for download on the Triton Download Page. You can also find information on how to build a cluster like Triton on our enterolith page.
Update (3:30 p.m. Tuesday, June 23): Benefits and effects of new filesystem
Triton will soon support a faster, more reliable, higher capacity parallel filesystem. Some of the features and benefits are listed below. We will provide more specific data at the conclusion of testing.
Expected User Benefits
Anticipated User Impacts
The new hardware and software versions are currently being tested on reserved nodes of Triton. Availability to the production nodes is expected within a few weeks.
Update (5:30 p.m. Wednesday, June 9): TRITON_RC3 Upgrade Details
Following are some particulars regarding the affected packages and systems.
Updated system software:
Rocks v5.1 --> v5.3 CentOS v5.2 --> v5.4 PGI compiler v8.0 --> v10.5 Lustre client v1.6.6 --> v1.8.3 Myrinet driver v1.2.8 --> v1.2.12 Moab v5.3.5 --> v5.3.7 (TORQUE roll update)
New applications (some of these have been in /beta; all will now have a permanent home in /opt):
BEAST 1.5.2 APBS 1.2.1 LAMMPS 28Nov09 NAMD 2.7b1 NWChem 5.1.1 Open Motif 2.3.2 PDT 3.15 TAU 2.19 SciPy 0.7.1rc3 FFTW v2.1.5 (in addition to v3.2.1)
In addition, performance and administrative gains include:
Update (1:00 p.m. Monday, May 3): New Support Feature! Get approval for your job to run longer than 72 hours!
For jobs requiring more than Triton's 72-hour wallclock limit, users may now request an exception to allow those jobs to be scheduled and run. Please make your request through the discussion mailman list (email@example.com) and system administrators will make the provisions necessary to support your request.
Update (9:30 a.m. Monday, Sept. 13): System should remain available
We received word this morning of a Linux security hole that requires patching our systems. We will start by reinstalling our two login nodes. Figure that triton-38 will go down at 2:30 this afternoon. Once it's back up, we'll take triton-login down. We'll also arrange a rolling reinstall of the compute nodes to avoid disrupting running jobs. With luck, the only impact to users will be a need to bounce between login nodes for an hour or so.
Update (8:30 p.m. Thursday, August 12): Switch Firmware Upgrade Complete
The upgrade has been completed. There do not appear to be any significant issues associated with the change. Please let us know if you experienced problems with a job during this interval (approximately 3 p.m. to 9 p.m. PT on Thursday, August 12, 2010).
Notice (3:30 p.m. Thursday, August 12): Running Jobs May Be Affected by Switch Firmware Upgrade
We need to upgrade the switch firmware on Triton's Myrinet switch so that it can talk at greater than 1 x 10GbE to our other machine room. As part of that process, the 10GbE connections to home area and Lustre will go up and down. In other words, there will be outages lasting 1 - 5 minutes in which access to home servers is unavailable.
We are 99.99% certain that NFS (home area mounts) will restart without significant issue. However, we are less certain about how Lustre will react to a 1 - 5 minute network outage. We'll monitor running jobs, and if Lustre falls over, folks will need to resubmit. (We of course will do the right thing with respect to charging if the outage has significant effect on running jobs).
There really isn't any -good- time to perform such an upgrade, so now is by definition the "best" time (don't ask for real logic on that statement ;-)).
The outage should be quite short (and if you are not actively working on the login node, you probably won't notice). We will let you know when the upgrade has been completed.
Update (6:00 a.m. PT Wednesday, August 4): New Switch Installed for PDAF/M Nodes
The switch failure affecting approximately 20 PDAF/M nodes that occurred on Sunday, August 1, has been resolved with the replacement of the switch controlling access to those nodes. The entire Triton cluster has been fully functional since Monday morning, August 2. If you experienced a job failure or lost time due to this outage, please request a refund through the Triton Discussion List.
Update (8:30 a.m. Sunday, August 1): Switch Failure Affects PDAF/M Nodes
An apparent switch failure on racks six and seven has temporarily rendered most of the large memory nodes on Triton inaccessible. The failure occurred at approximately 3:40 a.m. today. Administrators are working on the problem, and we hope to have a replacement switch installed soon. So far, this outage has not affected any TCC nodes, and those remain fully available. Likewise, the PDAF nodes in racks eight and nine remain available. More information will be posted as it becomes available. Check the Triton Status Page for the latest updates. This report is updated every two minutes. Even more detailed information is available on the Triton Ganglia page, which gets updated every one minute.
Update (9:00 p.m. Tuesday, May 18): Upgrade is complete as of 7:45 p.m.
User access to Triton is enabled. Please login and submit your jobs. Promptly report any unusual behavior to the discussion list. Thanks for being patient.
Update (6:00 p.m. Tuesday, May 18): Upgrade is progressing slowly, should be completed by about 7 p.m.
We are a little behind schedule but it looks like steady progress that will get the job finished about two hours later that anticipated. We'll post here when Triton is back up.
Update (10:00 a.m. Tuesday, May 18): Upgrade has begun...
The maintenance period has begun. All users and jobs have been removed from the system. The new software stack is being installed at this time. Check back here for availability, monitor the 3476284608, or follow the progress on overmerrily.
On Tuesday, May 18, system administrators will apply a major upgrade to the Triton Resource software stack. Triton will be unavailable to users beginning at 8 a.m. and should return to service by 5 p.m. (sooner if possible).
All running jobs will be terminated prior to starting the maintenance, and all temporary data will be discarded. Data on /home and /mirage filesystems will be preserved. To avoid lost work and the need to ask for refunds, do not submit jobs that will run during the maintenance period. All existing jobs will be cleared from the queues, so users must resubmit them after completion.
Details of the planned changes will be announced soon. Please contact the staff via the discussion list if you have questions. Thank you for your patience as we improve Triton's capability to serve you.
Update (12:00 p.m. Monday, Apr. 25): Advanced Programming Techniques with MATLAB at UCSD
The MathWorks will present two complimentary programming seminars to the UCSD community from 10:30 a.m. to 2:30 p.m. in the Student Center's Dolores Huerta — Philip Vera Cruz Room. The morning session, which runs from 10:30 until noon, is titled "Data Acquisition, Analysis and Visualization in MATLAB". A brief Q & A and refreshment break will be followed by the second seminar, titled "Speeding Up Applications: Parallel Computing with MATLAB", which runs from 12:30 until 2:30.
Those interested may sign up at The MathWorks seminar registration web site. For more details and contact info, you can also download the MathWorks announcement (Word doc). The Student Center is located in Muir College, next to Mandeville Center. Map details, parking and driving information is available by searching UCSD MapLink for "Student Center".
Update (11:30 a.m. Monday, Apr. 25): Triton PFS returned to service
SDSC Facilities upgraded the machine room floor April 24-25 for enhanced stability during earthquakes. The work required Triton's parallel storage hardware to be relocated, causing the /mirage filesystem to be offline over the weekend.
The maintenance was completed successfully, and the filesystem came back online without problems. Thank you for your patience during this downtime to improve Triton's future reliability.
Update (11:30 a.m. Thursday, Apr. 22): Triton /mirage downtime
SDSC Facilities has scheduled maintenance to the SDSC machine room floor this weekend. This work requires Triton's /mirage storage hardware to be relocated, so the filesystem will be offline during the relocation.
Please plan your work accordingly and defer jobs requiring large scratch space until after the maintenance is complete. Jobs with small scratch space needs may use the /home filesystem as a temporary alternative. Do not redirect large scratch space jobs to the /home filesystem, as this has been a denial of service problem recently.
We apologize for the disruption of service and thank you for bearing with us as the Triton facility is upgraded.
Update (10:30 a.m. Tuesday, Feb. 9): New program offers compute time
SDSC has announced the formation of the Triton Research Opportunities (TRO) program — a program to provide campus researchers a mechanism to tap into the expertise of SDSC staff in high performance computing, data-intensive science, and cyberinfrastructure software development; and to stimulate new research collaborations. Successful applicants will partner with an SDSC staff researcher to exploit the capabilities of the Triton Resource for their research endeavors.
The TRO program consists of a campus-wide, peer-reviewed proposal competition. Awards will provide Triton cycles and seed funds that enable SDSC researchers to collaborate with campus partners to jointly seek extramural funding. TRO proposals will be solicited semi-annually.
Update (12:00 p.m. Wednesday, Feb. 3): Intended to prevent abuse
As of today, users will be limited in the amount of memory and time that a job can use on the Triton login nodes. This is in response to a small number of users whose jobs have caused bottlenecks due to running inappropriately. Jobs that heavily use or monopolize the limited availability of the login node, which all users depend on for primary access to Triton, prevent others from gaining access to all nodes. Such jobs should be run on the compute nodes to avoid denial of service to other users.
The new limits are as follows:
Update (3:00 p.m. Friday, Jan. 8): Second round of TAPP applications
A new TAPP application period is in effect for UCSD through Jan. 31. For details, visit the Academic Affairs Web site.
Update (1:00 p.m. Monday, Jan. 4): Triton support again at full strength
The Triton Resource support staff is back at full strength after the holiday break. Compute time is readily available and jobs queues are temporarily short. Take advantage before the system ramps up again during the winter academic quarter.
Update (2:30 p.m. Friday, Dec. 11): Triton support impacted by furlough
The upcoming campus closure begins Saturday, December 19, 2009, and continues through Sunday, January 3, 2010. Of those days, six are furlough days and the rest are weekends and required holidays. Triton staff are paid from state funds and are required to be furloughed.
Triton itself will be online, but there will be no guaranteed system availability during the holiday break. Staff will attempt to respond to issues posted to the Discussion List, but no assurances are made that any problem can be resolved before the campus re-opens on January 4.
SDSC Operations will monitor the system, but no Triton principles (those capable of fixing specific Triton-related issues) will be available. Even on normal work days, Triton is a best-effort support system with guaranteed problem solving only during regular business hours. Staff will try to check on the system at least daily during the break, but response times could be measured in days rather than the hours or minutes typical of routine support.
In the worst case, if something catastrophic happens, the component(s) will be disconnected from the network and remedied during the first workday (or two) of the New Year.
Please understand that Triton staff are on vacation or would be working unpaid during furlough days. Do not expect Triton issues to be resolved quickly during the Campus closure. Staff will make reasonable efforts, as their personal time allows, to fix issues that arise. Thank you for your patience and understanding during this exceptional time. We wish you the best of the holiday season.
Update (2:00 p.m. Friday, Dec. 18): New hardware in the New Year
The new Sun Front End nodes have been delivered, and will replace the existing Appro nodes in the Triton Front End configuration. Due to time constraints with the year-end closing of UCSD, this maintenance will be deferred until after the campus reopens on Jan. 4.
Update (2:00 p.m. Friday, Dec. 11): Front End equipment to be installed
New login nodes and servers used for Triton administration have been delivered and will be installed in the next few days. This should provide better remote management in the event of login node availability or other maintenance needs on the cluster. In addition, a second login node will be added to the system. Each login node will have a unique name, so users can manually direct login connections to either one in the event of an outage.
Exact information on the time of the outage will be posted as soon as available. Impact to users should be brief if noticeable at all. The new login node name will be posted at that time also.
Update (11:30 a.m. Friday, Dec. 11): New Downloads Posted
The Triton Resource engineering staff have completed the first phase of Rocks roll packaging for software used in the building of the TCC, PDAF and PDAFM nodes.
Update (3:30 p.m. Monday, Dec. 7): ZFS Automation Back On
As far as we can tell, both the primary and replica NFS servers are functioning normally. Automated replication was turned back on this morning. With luck, we will see several months of uninterrupted service. Please report any irregularities to the Discussion List.
Update (10:30 a.m. Monday, Nov. 30): ZFS Scrub Complete
The Upgrade completed and the data scrub was completed early this morning.
Update (6:00 p.m. Sunday, Nov. 29): NFS Filesystem Upgrade
Triton's primary NFS server firmware and software was upgraded today starting at 8 a.m. The upgrade was complete at 9:30 a.m. Most of that time involved flashing and rebooting the nodes. Any affected jobs that were active at the start of the upgrade will be credited.
Another maintenance task will be performed in the background today: the storage pool itself will undergo a "zpool scrub" to validate all stored data. User data will be available during the scrub, but performance will be somewhat diminished. The scrub is the best way to verify integrity. When that completes, replication will be enabled for the first time since November 6. The scrub should be completed before the end of today.
Update (3:00 p.m. Friday, Nov. 20): Login Node Replacement
The ordered replacement node, plus an additional new node, are scheduled to ship Monday, Nov. 23. We hope to have them installed by late next week. This upgrade will double our capacity on the login service, provide better front end server hardware, and improve support response times for dealing with outages by increasing remote accessiblity to admins. In the meantime, the temporary node will continue to serve, and we'll keep a close watch to ensure the greatest possible availability for users.
Update (10:00 a.m. Friday, Nov. 13): Login Node Failure
There appears to be a hardware issue with the login node. The node will not boot from the network. A temporary replacement node was installed and activated prior to 10 a.m. today.
Update (2:30 p.m. Thursday, Nov. 12): Latest from Sun on Crash Dump Analysis
Sun confirmed a bug for when a storage pool sees multiple simultaneous errors. It basically suspends the storage pool, and then all subsequent operations hang, instead of timing out. There is no current fix for the bug, other than ensuring physical integrity of the disks.
We suspect the design rationale for suspending is to not corrupt the file system beyond repair. It's likely that when our systems were built, a serial number range of disk drives were slightly out of spec. We identified self-monitored prefailure warnings on some drives.
A paper from Google Labs, Failure Trends in a Large Disk Drive Population, (PDF) discusses the Annualized Failure Rate (AFR) of disk drives in a very large disk farm. The first three months of disk life in the study farm (about where Triton is in terms of actual usage) show approximately three times higher failure rates on high utilization drives over medium and low utilization ones. Triton is likely within this three-month usage range, so our failure rate is not unexpected.
We are still not backing up user home areas, though all data is protected against double disk failure. To get to the point where we believe that snapshot/replication will not cause hangs, we must root out the marginal drives in the storage arrays. This will take some time.
In the interim, user home area storage should be reliable, but there is the possibility that the home area server will hang. We'll keep watching and try to react quickly if it does. Thanks for your patience, and please continue to help us keep on top of reliability issues by posting to the discussion list whenever you have problems.
Update (1 p.m. Thursday, Nov. 12): On our backup server, we were able to duplicate the problem and force a core dump. All available data have been uploaded to Sun, who are doing a post-mortem to isolate the root cause.
Sun confirmed an issue with Solaris U6 (currently running on the primary ZFS) involving snapshots and incremental ZFS sends. While the support folks were happy that we upgraded to U8 on the backup server, the fact that we are still locking up the file system is puzzling.
After firmware updates and U8 installation on the backup server, two drives (of the 48 in our ZFS configuration) remained non-functional. Those will eventually be repaired, but could be related to the root cause. Our ZFS data are still intact, due to mirroring that allows any two drives to fail. For most users, the backup contains home area data up through Nov 6. Data deposited after this are not being backed up, and users are advised to make alternate plans for safekeeping of such until the problem is resolved in production.
We are hopeful of a more definitive answer from Sun as they pore over the crash dump. Until the backup system issue is resolved, production will remain in its current configuration. Due to the extra attention on this problem, production is being very closely monitored and should be extremely reliable despite the flaw, since administrative support is likely to respond very quickly during the investigation.
Update (9 a.m. Tuesday, Nov. 10): We still do not have a root cause for the ZFS failure that is causing temporary, intermittent login node unavailability. No updates will be made to production servers until the actual cause is known. Currently, no backups are being performed on user home areas, so users may want to take extra precautions with data there until the issue is resolved. Testing to determine the root cause is continuing on our backup server, and production ZFS, login, and all compute nodes are fully functional (except for the ZFS backups).
Update (3 p.m. Monday, Nov. 9): The latest Solaris update did not resolve the ZFS problem — a failure occurred during the pool scrub on the backup server, resulting in a frozen ZFS subsystem. The root cause of the filesystem failures is still unknown at this time. The latest SAS controller patches are being installed on the backup server, and a new pool scrub test will be performed. A new downtime will be scheduled, possibly still today.
Issues with the Triton ZFS server will be addressed by a brief outage at a time yet to be determined. During this outage, we expect that most running jobs should complete, but a few may experience early terminations. Running jobs will be inventoried prior to the upgrade and refunds will be available for affected jobs.
We've upgraded the backup server and are currently running tests to locate the root cause of the failure. We apologize for the ongoing inconveniences this problem has caused — if the primary server becomes inaccessible before the scheduled upgrade, this maintenance will be combined with our response to that to complete the service with a single outage.
Production Phase Announcement: The full production phase of the Triton Resource began on Monday, October 5, 2009. The Early Adopter phase ended at that time.
What this means for users
Triton's migration to the charged-for usage model was completed on Monday, October 5 with the implementation of the usage accounting service. Early Adopter accounts are no longer being created or renewed, and TAPP or project allocations are now required to run jobs. This marks the beginning of the full production phase of Triton.
If an allocation runs out of SUs, TAPP procedures should be followed to extend or renew the account. Triton system administrators will not be authorized to replenish accounts the way they did during the Early Adopter phase.
Refunds for certain failed jobs and system errors will be considered on a case-by-case basis. Please direct requests to the discussion list.
Users can discover what their calculations will cost and view their usage statements by running the mybalance and gstatement -u $USER commands to see the status of their accounts.
Details on the latest changes and policy decisions can be found on the following FAQ pages:
Both TCC and PDAF were upgraded to Release Candidate 2 in preparation for full production usage and accounting. The system will remain in Early Adopter mode for about two weeks. The upgrade maintenance went smoothly and required about six hours of downtime to update the login node and all compute nodes.
A security patch was applied to Triton on August 14 between 2 and 3 PM PDT. This patch was necessary to close a local privilege security vulnerability first reported on August 11. The RHEL patch became available on August 14 and was installed on the Triton login node almost immediately. Details of the patch are available on 579-241-9412. The login node update began at approximately 14:20 PDT and was completed by about 15:10 PDT.
Completion of this security patch accomplished the following:
A full reinstallation of Triton was performed on July 23, and completed within the expected 2-3 hour window, after which Triton was again running normally.
The cluster's public IP addresses were changed during this maintenance. The IP address of the login node was changed to 18.104.22.168.
User home data areas were restored intact.
During a planned outage on July 20, the Mirage Lustre servers were physically moved to a new rack and new power. Two dead LUNs were also recovered so that all 100 storage targets are currently available. Mirage is now mounted on the login node and all compute nodes. All 100 TB are now available on /mirage.
The Triton Resource is in full production. TRITON_RC2 (Release Candidate 2) is installed, and full job accounting is in effect.
This site will be kept up-to-date as node statuses change, or when the system has a scheduled maintenance. Currently, all of the nodes are in service and available via the scheduler. When nodes undergo unplanned maintenance, this site will be updated and messages will be posted on the discussion list and Triton's Twitter feed.
Triton's exceptional data-intensive computing power is available to the University of California HPC research community.
If you have an account and are ready to access to the Triton Resource, please visit the 315-571-0467 for details and to obtain login information. For information on first-time logins to the Triton Resource, please read the New User page. To request an account, please use TAPP.
To read about the current hardware status and get details of the system building process, read the Triton Resource blog.
Triton's compute components moved tor production on October 5, 2009. Early Adopters helped to identify software needs and support requirements starting in July. Users and potential users are encouraged to continue sending feedback and suggestions to the Triton support team.
The 28 large-memory nodes of the PDAF provide some of the most extensive data analysis power available commercially or at any research institution in the country. The cluster includes four special nodes dedicated to database server interaction.
The 256-node TCC is a Rocks cluster with 24 gigabytes of memory and eight processing cores on each node.
Triton was upgraded to Rocks 5.3 on May 18, 2010. This upgrade included many software package updates as well.
In late Summer 2011, Triton users gained access to a new Lustre PFS with over 800 terabytes of work and scratch storage. The new filesystem, mounted as /phase1, replaces Data Oasis.
In early Fall, 2010, Triton users received a disk capacity increase to 250 terabytes, coincident with the replacement of /mirage by /oasis.
The latest version of Triton's Parallel File System, containing 800 TB of Lustre-based storage, is now available to users. The upgraded Data Oasis has more efficient and reliable mass storage for use while executing jobs on Triton.
For general and long-term access to Triton Resource, users are asked to request an allocation through the Triton Affiliates and Partners program, or TAPP. This is the primary way for users to gain access to Triton for running jobs and conducting research.