Up until today, if you were looking for a physical ONTAP array for your environment, your choices were the hybrid flash, FAS array offering around 5-10ms of latency or the sub-ms AFF A-series. Sure there was one anomaly in there, the QLC-based FAS 500f, but that AFF in FAS clothing was just that, an anomaly. While I have no evidence to point to here, but my theory is that the 500f was NetApp’s way of dipping their toe in the water of QLC-based arrays. Upon launch, the 500f was pricey and the configurations limited and restricted, both of which were addressed at some point after launch. As an employee at a partner that sells a lot of NetApp, I looked at the 500f when it first launched and then basically never looked at it again because of those two points.
Today, NetApp is announcing the all new C-Series of QLC-based arrays, the “C” being for “Capacity Enterprise Flash”. While the controllers themselves aren’t new, the fact that they only support QLC media is what is different. While I won’t go into the details of what QLC, or Quad Layer Flash is in this post, the fact of the matter is that it is more affordable than Triple Layer Flash (TLC) and almost as performant. What this means for those purchasing NetApp arrays is that they can get near the performance of an AFF system at a fraction of the cost. Most of us in the storage world know that 10k and 15k RPM SAS drives are slowly going to be phased out in favour of high-capacity SATA drives and high-performance NAND storage, leaving a void. QLC-based arrays will fill that void, and at a higher performance level. If you start to research QLC vs TLC, you’ll find lots of concerns around durability which are not completely unfounded, but you would have also found these concerns when the industry went from Multi-cell (MLC) to TLC and that seems to have gone well enough. Technology of the storage devices themselves improve over time and software-based mitigation strategies such as write avoidance also improve. I’m not knowledgeable enough on this latter point to go into details, but ONTAP is a beast and has all sorts of tricks up its sleeve.
So without further ado, I present NetApp’s Enterprise Capacity Flash line, the AFF C800, AFF C400 and AFF C250:
Max drive count (15.3TB NVMe QLC)
Max effective capacity (5:1 efficiencies)
Max Usable capacity (1:1)
12 × 15.3
8 × 15.3
8 × 15.3
100GbE ports per HA pair
25GbE ports per HA pair
12 onboard / 16 HBA
4 onboard / 16 HBA
32Gb FC ports
By the numbers
Now some of you may have thought, “I thought there was already a C-series with the C190?”, and you’d be right. NetApp is repurposing the C-series branding as well as introducing a successor to the C190, the AFF A150. While the new A150 will still have some restrictions, it won’t be nearly as restrictive as the C190. The physical form-factor remains the same as the C190, but the A150 will allow for up to two expansion shelves for a total of 72 SAS SSDs including the internal ones in capacities of 960GB, 3.8TB and 7.6TB, coming to a max usable capacity of ~402TiB, or 2.2PB at an efficiency level of 1:5.
Back to the new C-Series conversation, they bring with them a new default licensing model, ONTAP One. ONTAP One is something I have personally been asking for many years at this point, and it includes all of the licenses; Core, Data Protection, Hybrid Cloud and Security & Compliance. Personally I’m looking forward to not having to worry about what features are available with a certain license offering, instead, C-Series with ONTAP One as the default licensing model will ensure you or your customers will never be left wondering if their array has a given feature.
The C-Series should be available to quote as of March 27, 2023 and should start shipping by the end of April. This statement as well as all of the information above is based on pre-release information I received and may be subject to change at press time. I will endeavour to add corrections below should any of the above change at launch.
With the CN-1610 starting to get long in the tooth and with more platforms supporting and/or requiring a cluster interconnect network greater than 10Gbit, the need to introduce a non-Cisco option came to be. This option is the BES-53248, which is a “Broadcom Supported” switch produced by Quanta, makers of all things hyper scale who sells it as the QuantaMesh T4048-IX8. At some point Broadcom’s EFOS is installed on the T4048-IX8 via the Open Network Install Environment (ONIE) and it becomes the product we know as the BES-53248. While definitely a superior switch, supporting 10/25/40/100Gbit, the deployment thereof is not as streamlined, hence this post.
I struggled a bit with how to approach this topic and settled up the following: I will provide a numbered list of steps as a guide and index but then have sections below that expand upon those steps. There could very well be times where you want to perform these steps in a different order but if this is your first time working on this switch and it’s factory-fresh, the steps below are how I would advise proceeding.
The BES-53248 has 48 × 10/25Gbit ports and 8 × 40/100Gbit ports; by default the first 16 × 10/25Gbit ports are available for cluster interconnect connections and the last 2 × 40/100Gbit are reserved for Inter-Switch Links (ISL); which is already an improvement over the CN1610’s 12 × ClusterNet ports. If the environment requires more ports than this, the 10/25Gbit ports can be licensed in blocks of 8 (Part # SW-BES-53248A2-8P-1025G) all the way up to 48, and there is one license (Part # SW-BES-53248A2-6P-40-100G) to activate the remaining 6 × 40/100Gbit ports. Be sure your order also has all the requisite transceivers and cables, consult HWU for specific compatibilities. Lastly, the BES-53248 doesn’t ship with rails by default, so make sure your quote shows them if you need them.
When your switches arrive they will include a manilla envelope with licensing information if licenses above the base configuration were ordered, do not recycle this envelope as it contains the very important Transaction Key which you will use to generate your license file at this site:
Before visiting that link, along with your license keys you’ll need the switch serial numbers which are located on the switches themselves like so:
The license file generation procedure is instant, so not having this ahead of time isn’t that big of a deal provided you have internet access while at the installation site.
Broadcom Support Account, Firmware Download
What isn’t instantaneous however is the creation of a TechData-provided, Broadcom Support Account (BSA), and you need this account do download firmware for the switches. In order to setup a BSA, which hopefully you did a couple of days in advance of requiring the firmware, you need to send an email to: firstname.lastname@example.org with the following information:
Indicate if OEM (Netapp/Lenovo), Partner/Installer or Customer: Name of Company device is registered to (if partner/installer): Requester Name: Requester Email Address: Requester Phone Number: Address where device is located: Device Model Number: BES-53248 Device Serial Number:
I’ve found the folks that respond to this email address are pretty easy to deal with, though I’m not sure you’ll be able to get your account if you don’t already have the serial number, comment below if you know. My account creation took roughly 24 hours and then I had access to the firmware downloads. Download the appropriate firmware for your environment. The switches I received in August of 2021 shipped with EFOS 126.96.36.199 which was supported in the environment I was deploying into, but so was 188.8.131.52 so that’s where I wanted to land.
Reference Configuration Files (RCF)
Download the appropriate RCF for the environment and edit accordingly. If you visit HWU and drill down into the switch category, you can download the RCF from there:
I was converting an AFF8080 from two-node switchless to switched and adding an A400 at 100Gbit. I grabbed RCF 1.7 from Hardware Universe (not where I’d expect to find it but nice and easy) and uncommented ports 0/49-0/54 by removing the initial exclamation point on the lines in question since the additional 40-100 license activates all of these ports, I deleted the lines setting the speed to 40G full-duplex. I hope in version 1.8 of the RCF, this configuration will also be applied as a range since that’s the only license option available for purchase on these ports.
In your site requirements checklist, ensure the availability of an http (or ftp, tftp, sftp, scp) server on the management network. Once the equipment is racked and the management interface cabled, you will need this server to host your EFOS firmware, license files and RCF.
The first time you connect to the device, most likely via serial, assuming the unit was factory-fresh like mine, the username should be admin and the password should be blank. You will be immediately forced to change the password. I noticed that when I was going through this, copying, and pasting the new password didn’t work for me but typing the same password did; this may have had something to do with the special characters chosen or the app I was using (serial.app on macOS). Another thing to be aware of, if you’re applying RCF 1.7 you will have to be on EFOS 184.108.40.206 first. The switches I based this post on shipped with 220.127.116.11 and there are some commands in the RCF that aren’t compatible, so you’ll want to upgrade EFOS before applying RCF 1.7. Also, applying an RCF means wiping any existing configuration first, so you might as well get this out of the way while you are on site.
Once you’ve changed the password, it’s time to configure the management IP address so you can retrieve the license files, EFOS image and RCF from the http server mentioned previously. You’ll need to be logged in, and have elevated your privilege level to enable:
(CLswitch-01) #serviceport ip 10.0.0.209 255.255.255.0 10.0.0.1
(CLswitch-01) #show serviceport
Interface Status............................... Up
IP Address..................................... 10.0.0.209
Subnet Mask.................................... 255.255.255.0
Default Gateway................................ 10.0.0.1
IPv6 Administrative Mode....................... Enabled
IPv6 Prefix is ................................ fe80::c218:50ff:fe0b:24c5/64
Configured IPv4 Protocol....................... None
Configured IPv6 Protocol....................... None
IPv6 AutoConfig Mode........................... Disabled
Burned In MAC Address.......................... B4:A9:FC:34:8F:CE
(CLswitch-01) #ping 10.0.0.1
Pinging 10.0.0.1 with 0 bytes of data:
Reply From 10.0.0.1: icmp_seq = 0. time= 2 msec.
Reply From 10.0.0.1: icmp_seq = 1. time <1 msec.
Reply From 10.0.0.1: icmp_seq = 2. time= 26 msec.
Now that you are on the network, the first thing we should do is add any additional licenses. Here are the commands with an explanation of what they do:
show port all | exclude Detach
copy http://10.0.0.80/switch1_license.data nvram:license-key 1
show port all | exclude Detach
See how many licenses are currently applied, if any.
Display currently licensed ports.
Copies the file from the http server and places it in index 1
reboots the switch
This is after you’ve re-logged in, it should show you something different than the last time you ran this.
This should show additional ports than from before adding the license.
Once you have added your license file(s), it’s time to upgrade EFOS, here are the commands with an explanation of what they do:
copy active backup
copy http://10.0.0.80/FastPath-EFOS-esw-qcp_td3-qcp_td3_x86_64-LX415R-CNTRF-BD6IOQHr3v7m0b4.stk active
Shows the images: active, backup, current-active and next-active.
Copies the active image to the backup slot, just in case.
Verify that the above worked.
Shows the version actually running.
Copies the image on the web server to the active slot.
Verify the last command.
Reboot the switch.
Verify the upgrade worked.
Now that we have upgraded our EFOS image, it’s time to apply the RCF. There really is no point in doing any additional configuration until we’ve done this since we have to destroy our configuration before applying the RCF anyway. Be sure that you’re only applying the default RCF if you haven’t added any licenses. If you have added licences, you need to uncomment the lines that configure the additionally licensed ports. Here are the commands with an explanation:
copy http://10.0.0.80/BES-53248_RCF_v1.7-Cluster-HA.txt nvram:script BES-53248_RCF_v1.7-Cluster-HA.scr
script apply BES-53248_RCF_v1.7-Cluster-HA.scr
This clears the startup configuration, overlaying an RCF-sourced configuration can have negative consequences.
This copies the txt file from the web server to NVRAM as a script and renames it in the process.
gives you a directory listing of available scripts to confirm the above transfer worked
applies the contents of the RCF to the configuration
displays the new running configuration to verify successful application of RCF
commit new configuration to non-volatile memory
reboots the switch so this new configuration can take affect
There, you’re all done, now you can proceed with the official guide on (re)configuring the management IP address, ssh and so on. Good luck, and if you have an experience that strays from the above, please let me know so I can update the post.
I’ve previously written about this Ethernet controller back when 40GbE Ethernet was relatively new to NetApp’s FAS and AFF controllers. Since that article, I’ve started to come across various oddities with this Ethernet controller.
Last Fall, I had a customer who was experiencing problems with LACP during an ONTAP upgrade (9.1 → 9.3 → 9.5P6) on their AFF A700s using the X1144A, dual port 40GbE card, which uses the Intel X710 Ethernet controller. We had the first 40GbE port broken out into 4x10GbE links, 2-each to either half of a pair of Cisco Nexus N9K-C9396PX in the same vPC Domain. During a controller reboot, we noticed that the interface group using multimode_lacp, most or all of the ports wouldn’t come up and on the Cisco-side, the port(s) would become disabled due to too many link up/down events. Immediately we wanted to look at potential cable problems but quickly dismissed that idea as well. After some digging, it looked as though NetApp was referencing Cisco Bug ID CSCuv87644 as potentially related. This led me down a long path of investigating the changes made to the networking stack in ONTAP over the past couple of years, and I’ve still got a post I’m working on around that. The workaround was to increase the debounce timer value on the Cisco 9k to 525ms, the default value is 100ms.
The port debounce time is the amount of time that an interface waits to notify the supervisor of a link going down. During this time, the interface waits to see if the link comes back up. The wait period is a time when traffic is stopped.
Recently, a different customer of mine was trying to buy a Nimble HF20 and they wanted to include the Q8C17B, a four port, 10GbE NIC, also based on the Intel X710 Ethernet controller. The vendor came back to me and said they needed to know if the customer was going to be using VLAN tagging on the Q8C17B, because if they needed VLAN tagging, they’d have to choose a two port NIC instead. This confused me, but after some emails back and forth, HPE Nimble Storage Alert # EXT-0061 was referenced as the reason for this. At some point Nimble will release a patch that updates the firmware on this NIC, hopefully bringing back VLAN functionality. A bit of looking around, and the same VLAN issue has been identified by VMware in KB2149781.
Lastly, I also came across a NIST vulnerability from 2017 regarding the same Ethernet controller, it seems that has since been addressed in a firmware update.
While the above doesn’t necessarily imply a huge problem with the X710, I simply found it interesting and thought I’d include them all in one post.
Yesterday, NetApp announced a new addition to the midrange tier of their All-Flash FAS line, the AFF A320. With this announcement, end-to-end NVMe is now available in the midrange, from the host all the way to the NVMe SSD. This new platform is a svelte 2RU that supports up to two of the new NS224 NVMe SSD shelves, which are also 2RU. NetApp has set performance expectations to be in the ~100µs range.
Up to two PCIe cards per controller can be added, options are:
4-port 10GbE SFP+ Cu and fibre
*RoCE host-side NVMeoF support not yet available
A couple of important points to also note:
DS, SAS-attached SSD shelves are NOT supported
An end-to-end NVMe solution obviously needs storage of some sort, so also announced today was the NS224 NVMe SSD Storage Shelf:
NVMe-based storage expansion shelf
2RU, 24 storage SSDs
400GB/s capable, 200Gb/sec per shelf module
Uplinked to controller via RoCEv2
Drive sizes available: 1.9TB, 3.8TB and 7.6TB. 15.3TB with restrictions.
Either controller in the A320 has eight 100GbE ports on-board, but not all of them are available for client-side connectivity. They are allocated as follows:
e0a → ClusterNet/HA
e0b → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
e0c → First NS224 connectivity
e0d → ClusterNet/HA
e0e → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
e0f → First NS224 connectivity
e0g → Client network, 100GbE or 40Gbe
e0h → Client network, 100GbE or 40Gbe
If you don’t get enough client connectivity with the on-board ports, then as listed previously, there are myriad PCIe options available to populate the two available slots. In addition to all that on-board connectivity, there’s also MicroUSB and RJ45 for serial console access as well as the RJ-45 Wrench port to host e0M and out-of-band management via BMC. As with most port-pairs, the 100GbE ports are hosted by a single ASIC which is capable of a total effective bandwidth of ~100Gb.
Food for thought…
One interesting design change in this HA pair, is that there is no backplane HA interconnect as has been the case historically; instead, the HA interconnect function is placed on the same connections as ClusterNet, e0a and e0d. This enables some interesting future design possibilities, like HA pairs in differing chassis. Also, of interest is the shelf connectivity being NVMe/RoCEv2; while currently connected directly to the controllers, what’s stopping NetApp from putting these on a switched fabric? Once they do that, drop the HA pair concept above, and instead have N+1 controllers on a ClusterNet fabric. Scaling, failovers and upgrades just got a lot more interesting.