Tag Archives: hardware

The BES-53248 first-timer’s guide

With the CN-1610 starting to get long in the tooth and with more platforms supporting and/or requiring a cluster interconnect network greater than 10Gbit, the need to introduce a non-Cisco option came to be. This option is the BES-53248, which is a “Broadcom Supported” switch produced by Quanta, makers of all things hyper scale who sells it as the QuantaMesh T4048-IX8. At some point Broadcom’s EFOS is installed on the T4048-IX8 via the Open Network Install Environment (ONIE) and it becomes the product we know as the BES-53248. While definitely a superior switch, supporting 10/25/40/100Gbit, the deployment thereof is not as streamlined, hence this post.

I struggled a bit with how to approach this topic and settled up the following: I will provide a numbered list of steps as a guide and index but then have sections below that expand upon those steps. There could very well be times where you want to perform these steps in a different order but if this is your first time working on this switch and it’s factory-fresh, the steps below are how I would advise proceeding.

  1. Equipment Ordering, including licences
  2. Broadcom Support Account, Firmware Download
  3. Reference Configuration Files (RCF)
  4. Supporting Infrastructure
  5. Initial Configuration

Equipment Ordering, including licences

The BES-53248 has 48 × 10/25Gbit ports and 8 × 40/100Gbit ports; by default the first 16 × 10/25Gbit ports are available for cluster interconnect connections and the last 2 × 40/100Gbit are reserved for Inter-Switch Links (ISL); which is already an improvement over the CN1610’s 12 × ClusterNet ports. If the environment requires more ports than this, the 10/25Gbit ports can be licensed in blocks of 8 (Part # SW-BES-53248A2-8P-1025G) all the way up to 48, and there is one license (Part # SW-BES-53248A2-6P-40-100G) to activate the remaining 6 × 40/100Gbit ports. Be sure your order also has all the requisite transceivers and cables, consult HWU for specific compatibilities. Lastly, the BES-53248 doesn’t ship with rails by default, so make sure your quote shows them if you need them.

When your switches arrive they will include a manilla envelope with licensing information if licenses above the base configuration were ordered, do not recycle this envelope as it contains the very important Transaction Key which you will use to generate your license file at this site:

https://efos-licensing.broadcom.com/License/RedeemTransactionKey

Before visiting that link, along with your license keys you’ll need the switch serial numbers which are located on the switches themselves like so:

The license file generation procedure is instant, so not having this ahead of time isn’t that big of a deal provided you have internet access while at the installation site.

Broadcom Support Account, Firmware Download

What isn’t instantaneous however is the creation of a TechData-provided, Broadcom Support Account (BSA), and you need this account do download firmware for the switches. In order to setup a BSA, which hopefully you did a couple of days in advance of requiring the firmware, you need to send an email to: support@techdata.com with the following information:

Indicate if OEM (Netapp/Lenovo), Partner/Installer or Customer:
Name of Company device is registered to (if partner/installer):
Requester Name:
Requester Email Address:
Requester Phone Number:
Address where device is located:
Device Model Number: BES-53248
Device Serial Number:

I’ve found the folks that respond to this email address are pretty easy to deal with, though I’m not sure you’ll be able to get your account if you don’t already have the serial number, comment below if you know. My account creation took roughly 24 hours and then I had access to the firmware downloads. Download the appropriate firmware for your environment. The switches I received in August of 2021 shipped with EFOS 3.4.4.6 which was supported in the environment I was deploying into, but so was 3.7.0.4 so that’s where I wanted to land.

Reference Configuration Files (RCF)

Download the appropriate RCF for the environment and edit accordingly. If you visit HWU and drill down into the switch category, you can download the RCF from there:

I was converting an AFF8080 from two-node switchless to switched and adding an A400 at 100Gbit. I grabbed RCF 1.7 from Hardware Universe (not where I’d expect to find it but nice and easy) and uncommented ports 0/49-0/54 by removing the initial exclamation point on the lines in question since the additional 40-100 license activates all of these ports, I deleted the lines setting the speed to 40G full-duplex. I hope in version 1.8 of the RCF, this configuration will also be applied as a range since that’s the only license option available for purchase on these ports.

Supporting Infrastructure

In your site requirements checklist, ensure the availability of an http (or ftp, tftp, sftp, scp) server on the management network. Once the equipment is racked and the management interface cabled, you will need this server to host your EFOS firmware, license files and RCF.

Initial Configuration

The first time you connect to the device, most likely via serial, assuming the unit was factory-fresh like mine, the username should be admin and the password should be blank. You will be immediately forced to change the password. I noticed that when I was going through this, copying, and pasting the new password didn’t work for me but typing the same password did; this may have had something to do with the special characters chosen or the app I was using (serial.app on macOS). Another thing to be aware of, if you’re applying RCF 1.7 you will have to be on EFOS 3.7.0.4 first. The switches I based this post on shipped with 3.4.4.6 and there are some commands in the RCF that aren’t compatible, so you’ll want to upgrade EFOS before applying RCF 1.7. Also, applying an RCF means wiping any existing configuration first, so you might as well get this out of the way while you are on site.

Once you’ve changed the password, it’s time to configure the management IP address so you can retrieve the license files, EFOS image and RCF from the http server mentioned previously. You’ll need to be logged in, and have elevated your privilege level to enable:

User:admin
Password:************
(CLswitch-01) >enable

(CLswitch-01) #serviceport ip 10.0.0.209 255.255.255.0 10.0.0.1

(CLswitch-01) #show serviceport

Interface Status............................... Up
IP Address..................................... 10.0.0.209
Subnet Mask.................................... 255.255.255.0
Default Gateway................................ 10.0.0.1
IPv6 Administrative Mode....................... Enabled
IPv6 Prefix is ................................ fe80::c218:50ff:fe0b:24c5/64
Configured IPv4 Protocol....................... None
Configured IPv6 Protocol....................... None
IPv6 AutoConfig Mode........................... Disabled
Burned In MAC Address.......................... B4:A9:FC:34:8F:CE

(CLswitch-01) #ping 10.0.0.1
 Pinging 10.0.0.1 with 0 bytes of data:

Reply From 10.0.0.1: icmp_seq = 0. time= 2 msec.
Reply From 10.0.0.1: icmp_seq = 1. time <1 msec.
Reply From 10.0.0.1: icmp_seq = 2. time= 26 msec.

Now that you are on the network, the first thing we should do is add any additional licenses. Here are the commands with an explanation of what they do:

show license


show port all | exclude Detach 


copy http://10.0.0.80/switch1_license.data nvram:license-key 1 

reload

show license




show port all | exclude Detach 
See how many licenses are currently applied, if any.

Display currently licensed ports.

Copies the file from the http server and places it in index 1

reboots the switch

This is after you’ve re-logged in, it should show you something different than the last time you ran this.

This should show additional ports than from before adding the license.

Once you have added your license file(s), it’s time to upgrade EFOS, here are the commands with an explanation of what they do:

show bootvar



copy active backup


show bootvar

show version


copy http://10.0.0.80/FastPath-EFOS-esw-qcp_td3-qcp_td3_x86_64-LX415R-CNTRF-BD6IOQHr3v7m0b4.stk active 

show bootvar

reload

show version
Shows the images: active, backup, current-active and next-active.

Copies the active image to the backup slot, just in case.

Verify that the above worked.

Shows the version actually running.

Copies the image on the web server to the active slot.




Verify the last command.

Reboot the switch.

Verify the upgrade worked.

Now that we have upgraded our EFOS image, it’s time to apply the RCF. There really is no point in doing any additional configuration until we’ve done this since we have to destroy our configuration before applying the RCF anyway. Be sure that you’re only applying the default RCF if you haven’t added any licenses. If you have added licences, you need to uncomment the lines that configure the additionally licensed ports. Here are the commands with an explanation:

erase startup-config




copy http://10.0.0.80/BES-53248_RCF_v1.7-Cluster-HA.txt nvram:script BES-53248_RCF_v1.7-Cluster-HA.scr 

script list




script apply BES-53248_RCF_v1.7-Cluster-HA.scr 

show running-config



write memory


reload

This clears the startup configuration, overlaying an RCF-sourced configuration can have negative consequences.

This copies the txt file from the web server to NVRAM as a script and renames it in the process.

gives you a directory listing of available scripts to confirm the above transfer worked

applies the contents of the RCF to the configuration

displays the new running configuration to verify successful application of RCF

commit new configuration to non-volatile memory

reboots the switch so this new configuration can take affect

There, you’re all done, now you can proceed with the official guide on (re)configuring the management IP address, ssh and so on. Good luck, and if you have an experience that strays from the above, please let me know so I can update the post.

What’s going on with Intel’s X710 Ethernet controller?

I’ve previously written about this Ethernet controller back when 40GbE Ethernet was relatively new to NetApp’s FAS and AFF controllers. Since that article, I’ve started to come across various oddities with this Ethernet controller.

Last Fall, I had a customer who was experiencing problems with LACP during an ONTAP upgrade (9.1 → 9.3 → 9.5P6) on their AFF A700s using the X1144A, dual port 40GbE card, which uses the Intel X710 Ethernet controller. We had the first 40GbE port broken out into 4x10GbE links, 2-each to either half of a pair of Cisco Nexus N9K-C9396PX in the same vPC Domain. During a controller reboot, we noticed that the interface group using multimode_lacp, most or all of the ports wouldn’t come up and on the Cisco-side, the port(s) would become disabled due to too many link up/down events. Immediately we wanted to look at potential cable problems but quickly dismissed that idea as well. After some digging, it looked as though NetApp was referencing Cisco Bug ID CSCuv87644 as potentially related. This led me down a long path of investigating the changes made to the networking stack in ONTAP over the past couple of years, and I’ve still got a post I’m working on around that. The workaround was to increase the debounce timer value on the Cisco 9k to 525ms, the default value is 100ms.

The port debounce time is the amount of time that an interface waits to notify the supervisor of a link going down. During this time, the interface waits to see if the link comes back up. The wait period is a time when traffic is stopped.

Source: Cisco

Recently, a different customer of mine was trying to buy a Nimble HF20 and they wanted to include the Q8C17B, a four port, 10GbE NIC, also based on the Intel X710 Ethernet controller. The vendor came back to me and said they needed to know if the customer was going to be using VLAN tagging on the Q8C17B, because if they needed VLAN tagging, they’d have to choose a two port NIC instead. This confused me, but after some emails back and forth, HPE Nimble Storage Alert # EXT-0061 was referenced as the reason for this. At some point Nimble will release a patch that updates the firmware on this NIC, hopefully bringing back VLAN functionality. A bit of looking around, and the same VLAN issue has been identified by VMware in KB2149781.

Lastly, I also came across a NIST vulnerability from 2017 regarding the same Ethernet controller, it seems that has since been addressed in a firmware update.

While the above doesn’t necessarily imply a huge problem with the X710, I simply found it interesting and thought I’d include them all in one post.

There’s a new NVMe AFF in town!

Yesterday, NetApp announced a new addition to the midrange tier of their All-Flash FAS line, the AFF A320. With this announcement, end-to-end NVMe is now available in the midrange, from the host all the way to the NVMe SSD. This new platform is a svelte 2RU that supports up to two of the new NS224 NVMe SSD shelves, which are also 2RU. NetApp has set performance expectations to be in the ~100µs range.

Up to two PCIe cards per controller can be added, options are:

  • 4-port 32GB FC SFP+ fibre
  • 2-port 100GbE RoCEv2* QSFP28 fibre (40GbE supported)
  • 2-port 25GbE RoCEv2* SPF28 fibre
  • 4-port 10GbE SFP+ Cu and fibre
    *RoCE host-side NVMeoF support not yet available

A couple of important points to also note:

  • 200-240VAC required
  • DS, SAS-attached SSD shelves are NOT supported

An end-to-end NVMe solution obviously needs storage of some sort, so also announced today was the NS224 NVMe SSD Storage Shelf:

  • NVMe-based storage expansion shelf
  • 2RU, 24 storage SSDs
  • 400GB/s capable, 200Gb/sec per shelf module
  • Uplinked to controller via RoCEv2
  • Drive sizes available: 1.9TB, 3.8TB and 7.6TB. 15.3TB with restrictions.

Either controller in the A320 has eight 100GbE ports on-board, but not all of them are available for client-side connectivity. They are allocated as follows:

  • e0a → ClusterNet/HA
  • e0b → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
  • e0c → First NS224 connectivity
  • e0d → ClusterNet/HA
  • e0e → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
  • e0f → First NS224 connectivity
  • e0g → Client network, 100GbE or 40Gbe
  • e0h → Client network, 100GbE or 40Gbe

If you don’t get enough client connectivity with the on-board ports, then as listed previously, there are myriad PCIe options available to populate the two available slots. In addition to all that on-board connectivity, there’s also MicroUSB and RJ45 for serial console access as well as the RJ-45 Wrench port to host e0M and out-of-band management via BMC. As with most port-pairs, the 100GbE ports are hosted by a single ASIC which is capable of a total effective bandwidth of ~100Gb.

Food for thought…
One interesting design change in this HA pair, is that there is no backplane HA interconnect as has been the case historically; instead, the HA interconnect function is placed on the same connections as ClusterNet, e0a and e0d. This enables some interesting future design possibilities, like HA pairs in differing chassis. Also, of interest is the shelf connectivity being NVMe/RoCEv2; while currently connected directly to the controllers, what’s stopping NetApp from putting these on a switched fabric? Once they do that, drop the HA pair concept above, and instead have N+1 controllers on a ClusterNet fabric. Scaling, failovers and upgrades just got a lot more interesting.