Homelab Recovery Plan - DVS Migration Issues

Current Situation

Date: 2025-07-21
Issue: Two Intel NUCs lost network connectivity during orphaned DVS removal attempts

Infrastructure Status

Host IP Status Issue Recovery Method
esxi-nuc-01.markalston.net 192.168.10.8 ❌ Offline Management network migration failed Console access required
esxi-nuc-02.markalston.net 192.168.10.9 ❌ Offline DVS uplink manipulation failed Console access required
esxi-nuc-03.markalston.net 192.168.10.10 ✅ Online Has orphaned DVS vCenter GUI removal
macpro.markalston.net 192.168.10.7 ✅ Online Hosting vCenter No action needed
vcsa.markalston.net 192.168.10.11 ✅ Online vCenter Server No action needed

Recovery Steps

Phase 1: Console Recovery (esxi-nuc-01 & esxi-nuc-02)

Required: Physical console access or IPMI/BMC remote console

For Each Affected Host:

  1. Access Console:
    • Physical console (monitor + keyboard)
    • IPMI/BMC web console
    • ESXi host remote console (if available)
  2. Login as root and execute these commands:
# Remove any broken network interfaces
esxcli network ip interface remove -i vmk0 2>/dev/null || true
esxcli network ip interface remove -i vmk1 2>/dev/null || true
esxcli network ip interface remove -i vmk2 2>/dev/null || true

# Ensure standard vSwitch exists
esxcli network vswitch standard add -v vSwitch0 2>/dev/null || true

# Add physical NIC to standard switch
esxcli network vswitch standard uplink add -v vSwitch0 -u vmnic0 2>/dev/null || true

# Create Management Network port group
esxcli network vswitch standard portgroup add -v vSwitch0 -p "Management Network" 2>/dev/null || true

# Recreate management interface with correct IP
# For esxi-nuc-01: 192.168.10.8
# For esxi-nuc-02: 192.168.10.9
esxcli network ip interface add -i vmk0 -p "Management Network"
esxcli network ip interface ipv4 set -i vmk0 -I 192.168.10.X -N 255.255.255.0 -t static
esxcli network ip interface tag add -i vmk0 -t Management

# Set default gateway
esxcli network ip route ipv4 add -g 192.168.10.1 -n default

# Remove orphaned DVS (if still present)
esxcfg-vswitch --delete --dvswitch vc01-dvs 2>/dev/null || true

# Restart network services
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
  1. Test Connectivity: ```bash

    From console

    ping 192.168.10.1 # Gateway ping 192.168.10.11 # vCenter

From external machine

ping 192.168.10.X # Host IP ssh root@192.168.10.X


### Phase 2: Safe DVS Removal (esxi-nuc-03)

**Method**: Use vCenter GUI (safest approach)

1. **Access vCenter**: https://vcsa.markalston.net
   - Username: administrator@vsphere.local
   - Password: Cl0udFoundry!

2. **Add Host to vCenter** (if not already added):
   - Right-click Homelab-DC → Add Host
   - Host: esxi-nuc-03.markalston.net

3. **Remove from Distributed Switch**:
   - Select esxi-nuc-03 in inventory
   - Go to Configure → Networking → Virtual switches
   - Find orphaned DVS (vc01-dvs) with warning icon
   - Right-click → "Remove from distributed switch"
   - Confirm removal

4. **Verify Standard Switch Configuration**:
   - Ensure vSwitch0 exists with Management Network
   - Verify vmnic0 is assigned to vSwitch0
   - Confirm vmk0 is on Management Network portgroup

### Phase 3: Verification and Cleanup

1. **Test All Hosts**:
```bash
# Test connectivity
ping 192.168.10.8  # esxi-nuc-01
ping 192.168.10.9  # esxi-nuc-02  
ping 192.168.10.10 # esxi-nuc-03

# Test SSH access
ssh root@esxi-nuc-01.markalston.net "hostname"
ssh root@esxi-nuc-02.markalston.net "hostname"
ssh root@esxi-nuc-03.markalston.net "hostname"
  1. Verify Network Configuration:
    # On each host
    esxcfg-vswitch -l                    # Should show only standard switches
    esxcli network ip interface list     # Should show vmk0 on Management Network
    esxcli network vswitch dvs vmware list  # Should return empty
    
  2. Add Hosts to vCenter:
    • Add all three Intel NUCs to vCenter if not already present
    • Verify no warning icons on network configuration

Phase 4: Continue with Original Plan

  1. Create Traditional Cluster:
    • Use traditional baseline management (not vLCM)
    • Avoid single image management due to community VIBs
  2. Configure Networking:
    • Create new distributed switches if needed
    • Set up proper VLANs and port groups
    • Configure vMotion and storage networks
  3. Complete Infrastructure Setup:
    • Configure datastores and storage policies
    • Set up HA/DRS for the cluster
    • Deploy test VMs

Prevention Measures

For Future DVS Operations:

  1. Always Use vCenter GUI for DVS operations when possible
  2. Have Console Access Ready before any management network changes
  3. Test on One Host First before applying to all hosts
  4. Use Dual NICs if available for safer migrations
  5. Backup Network Configuration before major changes

Commands to Avoid on Single-NIC Hosts:

  • esxcfg-vswitch -Q/-U (DVPort uplink manipulation)
  • esxcli network ip interface remove -i vmk0 (without immediate replacement)
  • Direct DVS manipulation when management is on DVS

Safe Commands for Future Reference:

  • esxcfg-vswitch --delete --dvswitch <name> (DVS removal)
  • esxcfg-vswitch -l (list all switches)
  • esxcli network ip interface list (list interfaces)

Lessons Learned

  1. Single NIC + DVS Management = High Risk: Any CLI manipulation risks immediate connectivity loss
  2. vCenter GUI is Safest: Use GUI for complex network changes when possible
  3. Console Access is Critical: Always have console access for network changes
  4. Test First: Always test procedures on one host before applying to all
  5. Correct Commands Matter: esxcfg-vswitch for DVS, not esxcli

Files Modified/Created

  • docs/troubleshooting/dvs-migration-recovery.md - Detailed recovery procedures
  • docs/recovery-plan.md - This comprehensive recovery plan
  • scripts/migrate-management-network.sh - Original migration script (flawed)
  • scripts/migrate-single-host.sh - Single host script (caused first failure)
  • scripts/safe-dvs-removal.sh - Safer approach (incomplete)
  • scripts/direct-dvs-removal.sh - Direct approach (caused second failure)

Next Steps After Recovery

  1. Complete console recovery for esxi-nuc-01 and esxi-nuc-02
  2. Use vCenter to safely remove DVS from esxi-nuc-03
  3. Proceed with traditional cluster setup
  4. Document final working configuration for future reference

This project is for educational and home lab purposes.