forked from christophernhill/engaging1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfacility_notes
155 lines (130 loc) · 6.9 KB
/
facility_notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
engaging1 cluster README
Jan 15, 2014
( 1) Its called engaging1 (eo for short) because one of the goals is to develop
more interactive and dynamic approaches to computational sciences. This
is a concept some of us refer to as "engaging supercomputing" and/or
computational science 2.0.
( 2) engaging1 consists of several head nodes, 234 compute nodes and a 350TB
central Lustre storage system. A total of 198 compute nodes are
funded through a shared NSF grant. A further 36 nodes were funded
separately. There are 16 nodes with Xeon Phi co-processors and
90 nodes with NVidia K20m coprocessor cards. All compute nodes
have two 8 core, 2GHz Intel Xeon E2650 processors, 64GB of
memory and 3.5TB of locat disk.
( 3) the computers and storage components are connected by an Infiniband
network.
( 4) basic system adninistration is provided by a contract company funded
from base project funds that works on a time and materials basis.
( 5) the funds for system administration are limited, but can be
supplemented with other per project funds to meet custom needs.
( 6) the email address [email protected] can be used to
make requests for support and systems work.
( 7) the project PIs are Chris Hill (MIT), Claudio Rebbi (BU),
Gene Cooperman (Northeastern), Prshant Shenoy (UMass).
( 8) the system is still in an early user phase so documentation
is sketchy, configuration details are evolving, there are
no backup procedures in place and many convenience features
are missing (web site, easy visibility into whats happening, etc..).
Thursdays have been set aside as "street cleaning" days
for the time being. This means that disruptive systems
admin work may be scheduled on Thursdays as needed.
( 9) access is currently through ssh using public keys. There
are three login nodes eof4.mit.edu, eofe5.mit.edu and
eof6.mit.edu. At present eofe4 has some weird network
behavior, eofe5 is healthy, eofe6 is sometimes used for
experimenting with upgrades or alternate configurations.
Ultimately all 3 will provide load balanced access
integrated into university environments, but for
now eofe5.mit.edu is the standard login node.
(10) to get an account send an email with a public key to
[email protected] and some explanation of
how you are connected to the project. Talk to one of the
PIs if you are not sure about your connection.
(11) the system is running RHEL Linux, uses SLURM as a scheduler
to allocate resources and has some useful software
available through the environment modules system.
(12) several important things are TBD on the system
o there are no quotas for compute job duration or
file space use, we are relying on people being
socially responsible at this stage.
o there are no time limits on jobs the scheduler
launches. This is nice, but impractical and
we will have to change this. Meantime we are again
relying on users to act reasonably and not
overly monopolize resources.
o we plan to upgrade SLURM from 2.5 to 2.6 fairly
soon.
o we plan to upgrade the Lustre file system from
1.8 to 2.4 fairly soon.
o we plan to switch the RHEL compute nodes to
CentOS.
(13) there is some software on the system. Check out the
contents of the /cm/shared directory and the
output from the "module avail" command.
The current software is fairly minimal. More
software can be requested by emailing
[email protected]. Software
from this list http://rc.fas.harvard.edu/module_list/
is relatively easy to add, we mirror this from
Harvard University MGHPCC colleagues.
Requests for software not on this list take more
effort and will eat into the project system
administration budget.
(14) you can learn about SLURM at various web sites
including
http://www.ceci-hpc.be/slurm_tutorial.html,
http://www.umbc.edu/hpcf/resources-tara/how-to-run.html
and http://slurm.schedmd.com/tutorials.html
(15) user accounts are associated with SLURM partitions
that provides a way to steer jobs to certain sets
of machines. Currently partitions are defined for
different PI groups, but all the partitions contain
the same sets of computers. We are exploring the
use of partitions and SLURM reservations as
tools for automaing dynamically managing varying
needs from different project participants.
(16) the GNU compilers are installed on the system.
We are waiting for some paperwork to be processed
for purchasing two floating licenses for the Intel
Compiler and Cluster Studio (XE) software.
(17) the /nobackup1/ directory is a Lustre filesystem.
It should be used for I/O for parallel jobs, large
I/O workloads etc..
(18) the /home/ directory is meant for storing things like
program and script source files. It is currently
shared over an Ethernet network in a sub-optimal
arrangment. We are experimenting with moving this
space to an Infiniband based network.
(19) some early projects have been given dedicated access
to compute nodes through January. There are about 55
nodes allocated this way. The projects involved are
doing work on technologies and tools that could
be helpful to the rest of the system. One project
is developing advanced database technologies that
has direct connections to big-data/bioinformatics
interests. A second project is exploring ways to
deploy Openstack software on subsets of nodes on
request. This project is one route by which
the system will allow a more flexibile application
mix than traditional HPC facilities.
(20) we are experimenting with so-called reservations
in SLURM to allow secific sets of users to be given
exclusive access to some nodes for a particular
time window. As an example a Northeastern group
recently reserved 128 nodes for a 12 hour period
to do some intensive computer science benchmarking
for the DMTCP checkpointing tool.
(21) we have recently configured one node to support
a virtual machine tool called Vagrant. This
provides a way to launch a virtual machine with
a custome software stack on a single node. The
launch of a VM can be integrated with SLURM and
the VM launched can be configured to be able
to access a user Lustre based /nobackup1/ directory.
We plan to put this capability on all nodes.
(22) If users want to share example job scripts or
other useful "getting started" material please
feel free to email to [email protected].
(23) Anyone should feel free to send comments and suggestions
to Chris ([email protected]). He will try and keep track
of them and address them where possible.