3. Custom Containerized Environment
Currently, we are hosting these services (available after configuring the hosts
):
Determined AI - Distributed Deep Learning and Hyperparameter Tuning Platform
Nextcloud - File storage and sharing
Harbor - Container registry for GPU cluster
Grafana - Statistics and visualization
Wandb local - Self-hosted weights & biases
Shared Folders:
https://pan.cvgl.lab/s/6P8EyrewEz4G3sm
pip config set global.index-url https://mirrors.westlake.edu.cn/pypi/simple/
Create a .condarc
file in your home folder with:
channels:
- defaults
show_channel_urls: true
default_channels:
- http://mirrors.westlake.edu.cn/ANACONDA/pkgs/main
- http://mirrors.westlake.edu.cn/ANACONDA/pkgs/r
- http://mirrors.westlake.edu.cn/ANACONDA/pkgs/msys2
custom_channels:
bioconda: http://mirrors.westlake.edu.cn/ANACONDA/cloud
caffee2: http://mirrors.westlake.edu.cn/ANACONDA/cloud
conda-forge: http://mirrors.westlake.edu.cn/ANACONDA/cloud
deepmodeling: http://mirrors.westlake.edu.cn/ANACONDA/cloud
intel: http://mirrors.westlake.edu.cn/ANACONDA/cloud
menpo: http://mirrors.westlake.edu.cn/ANACONDA/cloud
msys2: http://mirrors.westlake.edu.cn/ANACONDA/cloud
numba: http://mirrors.westlake.edu.cn/ANACONDA/cloud
nvidia: http://mirrors.westlake.edu.cn/ANACONDA/cloud
Paddle: http://mirrors.westlake.edu.cn/ANACONDA/cloud
pytorch: http://mirrors.westlake.edu.cn/ANACONDA/cloud
pytorch-lts: http://mirrors.westlake.edu.cn/ANACONDA/cloud
pytorch-test: http://mirrors.westlake.edu.cn/ANACONDA/cloud
pytorch3d: http://mirrors.westlake.edu.cn/ANACONDA/cloud
qiime2: http://mirrors.westlake.edu.cn/ANACONDA/cloud
Our cluster is located in the core server room, E6-106; currently has 7
GPU nodes, 1
storage server and 1
management server active.
We have been designated with an IP address range: 10.0.1.66-94/27
.
System Topology:
┌───────────────────────────────────┐ ┌──────────────────────────────────┐
│ Login Node │ │ NGINX Reverse Proxy │
└─────────────┬─────────────────────┘ └────────┬────────┬────────────────┘
│ │ │
Access ┌────────Access────────┘ Access
│ │ │
┌─────────────▼─────────▼───────────┐ ┌─────────────────▼─────────────────┐
│ Determined AI GPU Cluster │ │ Supplementary Services │
├───────────────────────────────────┤ ├───────────────────────────────────┤
│ │ │ │
│ ┌──────┐ ┌────┐ ┌────┐ ┌────┐ │ │ ┌──────┐ ┌───────┐ ┌───────┐ │
│ │Master│ │GPU │ │GPU │ │GPU │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ ... │ │ │Harbor│ │Grafana│ │ Other │ ... │
│ │ Node │ │Node│ │Node│ │Node│ │ │ │ │ │ │ │ │ │
│ └──────┘ └────┘ └────┘ └────┘ │ │ └──────┘ └───────┘ └───────┘ │
│ │ │ │
└───────────────────┬───────────────┘ └──────────┬────────────────────────┘
│ │
Access Access
│ │
┌───────────────────▼────────────────────────────▼────────────────────────┐
│ TrueNAS - NFS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Storage Server │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The specifics of the cluster nodes are as follows:
GPU Node 1:
Name | Spec |
---|---|
Model | Powerleader PR4908R (Supermicro 4124GS-TNR) |
CPU | AMD EPYC 7302 * 2 (32C/64T, 3.0-3.3GHz) |
RAM | Samsung M393A2K43DB2-CVF DDR4 256G (16G*16) 2933MT/s ECC REG |
GPU | MSI (0x1462) RTX 3090 Turbo * 8 |
SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 |
NIC | Intel I350-T2 1GbE Dual Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
RAID | LSI MegaRAID SAS-3 3108 |
GPU Node 2:
Name | Spec |
---|---|
Model | Powerleader PR4908R (Supermicro 4124GS-TNR) |
CPU | AMD EPYC 7402 * 2 (48C/96T, 2.8-3.35GHz) |
RAM | SK Hynix HMA84GR7DJR4N-XN DDR4 512G (32G*16) 3200MT/s ECC REG |
GPU | MANLI (NVIDIA/0x10DE) RTX 4090 * 8 |
SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 |
SSD | Kioxa CD6 7.68TB (U.2 PCIe 4.0) * 1 |
NIC | Intel I350-T2 1GbE Dual Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
GPU Node 3, 4:
Name | Spec |
---|---|
Model | Powerleader PR4908R (Supermicro 4124GS-TNR) |
CPU | AMD EPYC 7402 * 2 (48C/96T, 2.8-3.35GHz) |
RAM | Samsung M393A4K40DB3-CWE DDR4 512G (32G*16) 3200MT/s ECC REG |
GPU | MSI (0x1462) RTX 3090 * 8 |
SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 |
NIC | Intel I350-T2 1GbE Dual Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
GPU Node 5:
Name | Spec |
---|---|
Model | ASUS ESC8000A-E11 |
CPU | AMD EPYC 7543 * 2 (64C/128T, 2.8-3.7GHz) |
RAM | Samsung M393A4K40EB3-CWE DDR4 512G (32G*16) 3200MT/s ECC REG |
GPU | MANLI (NVIDIA/0x10DE) RTX 4090 * 8 |
SSD | Intel S4610 (SSDSC2KG96) 960G (SATA) (RAID 1) * 2 |
NIC | Intel I350-T4 1GbE Quad Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 |
GPU Node 6, 7:
Name | Spec |
---|---|
Model | ASUS ESC8000A-E12 |
CPU | AMD EPYC 9554 * 2 (128C/256T, 3.1-3.75GHz) |
RAM | Samsung M321R8GA0BB0-CQKZJ / Micron MTC40F2046S1RC48BA1 DDR5 1536G (64G*24) 4800MT/s ECC REG |
GPU | MSI (NVIDIA/0x10DE) RTX 4090 * 8 |
SSD | Samsung PM9A3 1.92T (U.2 PCIe 4.0) * 1 |
NIC | Intel I350-AM2 1GbE Dual Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
GPU Node 8:
Name | Spec |
---|---|
Model | ASUS ESC8000A-E12 |
CPU | AMD EPYC 9554 * 2 (128C/256T, 3.1-3.75GHz) |
RAM | SK Hynix HMCG94AEBRA109N DDR5 1536G (64G*24) 4800MT/s ECC REG |
GPU | NVIDIA (0x10de) RTX 6000 Ada Generation 48G * 8 |
SSD | Samsung PM9A3 1.92TB 2.5" NVMe U.2 drive * 2 |
NIC | Mellanox ConnectX-6 VPI HDR100 QSFP56 MCX653106A-ECAT 100Gb ETH/IB Dual Port |
NIC | Intel I350-T2 1GbE Dual Port |
Storage Server
Name | Spec |
---|---|
Model | Powerleader PR4224AK (Supermicro H11SSL) |
CPU | AMD EPYC 7302 (16C/32T, 3.0-3.3GHz) |
RAM | Samsung M393A4K40DB2-CWE DDR4 256G (32G*8) 2933MT/s ECC REG |
SSD | INTEL 760p (SSDPEKKW256G8) 256G (M.2 PCIe 3.0) * 1 |
SSD | Intel S4510 1.92TB (SATA) * 2 |
SSD | WD Ultrastar DC SN640 (WUS4BB076D7P3E3) 7.68TB (U.2 PCIe 3.0) * 4 |
HDD | Seagate Exos X18 18TB * 14 |
NIC | Intel i210 1GbE * 2 |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 |
Management Server
Name | Spec |
---|---|
Model | ASUS RS520-E9-RS8 V2 |
CPU | Intel Xeon Silver 4210R * 2 (20C/40T, 2.4-3.2GHz) |
RAM | Samsung M393A4K40EB3-CWE DDR4 64G (32G*2) 3200MT/s @ 2400MT/s ECC REG |
SSD | Intel S4610 (SSDSC2KG96) 960G * 2 (SATA) (RAID 1) |
NIC | Intel i350-AM2 1GbE Dual Port |
NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port |
RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 |
Switch
Brand | Model & Spec |
---|---|
NVIDIA Mellanox | Spectrum SN2700 100GbE 1U Open Ethernet Switch with NVIDIA Onyx, 32 QSFP28 ports, 2 PSU, x86 CPU, Standard depth |