Tag Archives: hardware

NetApp updates their Mid and High-end A-Series

NetApp’s flagship unified storage platform is the All Flash FAS or AFF, the A-Series is different from the C-Series in that it addresses TLC flash vs the C-Series’ QLC.

Today, Tuesday May 14, 2024 NetApp has announced three new AFF A-Series models, the A70, A90 and A1K which will likely supplant the A400, A800 and A900 respectively, are here they are:

AFF A70
NetApp AFF A70
AFF A90
NetApp AFF A90
AFF A1K
NetApp AFF A1K

What’s interesting about these latest models is that they are all basically the same but with different RAM and CPU complements, the A1K does differentiate itself in that it is a return to a single-node chassis design whereas the A70 and A90 reuse the chassis from the A800, with a slight upgrade I’ll get to at the end of this article. Upgrading from an A800 to an A90 is a PCM-swap and not an entire chassis swap.

The new platform is a serious upgrade in the IO department, with nine usable PCIe 5.0 slots available per controller, up from five PCIe 3.0 on the A400 and A800 but down slightly from the A900’s ten PCIe 4.0 IO modules. PCIe 4.0 doubled the available bandwidth of PCIe 3.0 per PCIe lane and PCIe 5 doubled it again, bringing available lane bandwidth up to ~4GB/s. Up until this release, all existing controllers used a switched PCIe architecture but now, direct access to the bus is available to all slots. These improvements pave the way for faster PCIe cards such as quad port, 64G FC cards and dual port, 200GbE and 400GbE Ethernet adapters but be sure to pay attention to the slot priority assignment if you’re putting the card in yourself. Slots 2,3,9 and 10 have eight lanes vs sixteen on 1,4-8 and 11; all eleven slots on the A1K have sixteen lanes.

Powering these new controllers will be Intel’s 4th generation Xeon Scalable processors with QuickAssist Technology (QAT) allowing for integrated compression offload. Current platforms perform inline compression in 8K chunks and a scanner would perform compression on cold data in 32k chunks, QAT allows for inline compression at 32k chunks all without any performance impact. Any data moved to a QAT-enabled controller, or on controllers upgraded to QAT-enabled models will realize this efficiency improvement. However, if data is moved from a QAT-enabled controller to an older, non-QAT model, that efficiency goes away. Speaking of older models, all controller upgrades are non-disruptive and the A800 is an in-chassis upgrade. I have to wonder if NetApp has any plans to use the other accelerators available in this CPU, specifically DLB and IAA.

Have a look at the raw numbers, I’m not sure how scientific my aggregated GHz column is, but I find it helpful when comparing horsepower. I’ve included the existing A400, A800 and A900 to see where the new models land.

ModelMAX DrivesInternal DrivesCPU SpeedCPU/NodeCores/CPUAggr GHzNVRAMRAM
A400480N/A2.2GHz21044GHz16GB128GB
A70240482.0GHz21664GHz32GB128GB
A800240482.1GHz224100.8GHz32GB640GB
A90240482.0GHz232128GHz64GB1024GB
A900480N/A2.2GHz232140.8GHz64GB1024GB
A1K480N/A1.7GHz252176.8GHz64GB1024GB
By the Numbers: Node Specifications

As for performance improvement expectations, NetApp claims the following:

  • A70 provides 2x the performance of the A400
  • A90 provides 1.6x the performance of the A800
  • A1K provides 1.4x the performance of the A900

While this article covers the new A-Series, up until today the C-Series have been the exact same controllers, differing only in their ability to address TLC media vs the C-Series’ QLC media. Having said that, I’ve become a big fan of the C800 for performance and form factor. Since the A70 and A90 reuse the x800 chassis, that means the ability to stick 48 internal drives in it. With the current max drive sizes for the A800 and C800 being 15.3TB and 30.7TB respectively, they can provide up to 570TiB and 1.11PiB of usable space, all in 4RU of rack space; this is before taking into account ONTAP’s aggressive data efficiency capabilities. All three of the new models are described as having a max raw capacity of 3.7PB which means 240×15.3TB so no 30.7TB TLC media at this time.

Today’s announcement has the A-series properly differentiating itself from it’s C-series sibling; however, I can’t imagine the C-series wouldn’t get some sort of QAT-related update soon, but I guess we’re going to have to wait on that for now.

Last but not least, if you’re a fan of the new bezels we’ve been seeing, you’ll be happy to hear that NetApp has finally acquiesced to our demands that the NetApp logo light up. That’s right, the three new models all get a fancy light up bezel that can be controlled via the GUI, CLI or even API, sadly they are one-colour LEDs and not RGBW…at least not yet. If you have an existing A800 and you perform an in-chassis upgrade, you will not benefit from this one as the pre-existing chassis lack the requisite 8-pin connector to the bezel.

These new models have NetApp seriously improving their offering, first by beefing up the basic CPU/RAM components but also around the physical design and architecture, the modular nature of which is very interesting.

C-Series lineup

NetApp Announces A Whole New Line

Up until today, if you were looking for a physical ONTAP array for your environment, your choices were the hybrid flash, FAS array offering around 5-10ms of latency or the sub-ms AFF A-series. Sure there was one anomaly in there, the QLC-based FAS 500f, but that AFF in FAS clothing was just that, an anomaly. While I have no evidence to point to here, but my theory is that the 500f was NetApp’s way of dipping their toe in the water of QLC-based arrays. Upon launch, the 500f was pricey and the configurations limited and restricted, both of which were addressed at some point after launch. As an employee at a partner that sells a lot of NetApp, I looked at the 500f when it first launched and then basically never looked at it again because of those two points.

Today, NetApp is announcing the all new C-Series of QLC-based arrays, the “C” being for “Capacity Enterprise Flash”. While the controllers themselves aren’t new, the fact that they only support QLC media is what is different. While I won’t go into the details of what QLC, or Quad Layer Flash is in this post, the fact of the matter is that it is more affordable than Triple Layer Flash (TLC) and almost as performant. What this means for those purchasing NetApp arrays is that they can get near the performance of an AFF system at a fraction of the cost. Most of us in the storage world know that 10k and 15k RPM SAS drives are slowly going to be phased out in favour of high-capacity SATA drives and high-performance NAND storage, leaving a void. QLC-based arrays will fill that void, and at a higher performance level. If you start to research QLC vs TLC, you’ll find lots of concerns around durability which are not completely unfounded, but you would have also found these concerns when the industry went from Multi-cell (MLC) to TLC and that seems to have gone well enough. Technology of the storage devices themselves improve over time and software-based mitigation strategies such as write avoidance also improve. I’m not knowledgeable enough on this latter point to go into details, but ONTAP is a beast and has all sorts of tricks up its sleeve.

So without further ado, I present NetApp’s Enterprise Capacity Flash line, the AFF C800, AFF C400 and AFF C250:

AFF C800
AFF C400
AFF C250

Quick Specs:

AFF C800AFF C400AFF C250
Max drive count (15.3TB NVMe QLC)1449648
Max effective capacity (5:1 efficiencies)8.8 PB5.9 PB2.9 PB
Max Usable capacity (1:1)1.6 PiB1.06 PiB540.37 TiB
Minimum configurations12 × 15.38 × 15.38 × 15.3
100GbE ports per HA pair20164
25GbE ports per HA pair1612 onboard / 16 HBA4 onboard / 16 HBA
32Gb FC ports323216
By the numbers

Now some of you may have thought, “I thought there was already a C-series with the C190?”, and you’d be right. NetApp is repurposing the C-series branding as well as introducing a successor to the C190, the AFF A150. While the new A150 will still have some restrictions, it won’t be nearly as restrictive as the C190. The physical form-factor remains the same as the C190, but the A150 will allow for up to two expansion shelves for a total of 72 SAS SSDs including the internal ones in capacities of 960GB, 3.8TB and 7.6TB, coming to a max usable capacity of ~402TiB, or 2.2PB at an efficiency level of 1:5.

Back to the new C-Series conversation, they bring with them a new default licensing model, ONTAP One. ONTAP One is something I have personally been asking for many years at this point, and it includes all of the licenses; Core, Data Protection, Hybrid Cloud and Security & Compliance. Personally I’m looking forward to not having to worry about what features are available with a certain license offering, instead, C-Series with ONTAP One as the default licensing model will ensure you or your customers will never be left wondering if their array has a given feature.

The C-Series should be available to quote as of March 27, 2023 and should start shipping by the end of April. This statement as well as all of the information above is based on pre-release information I received and may be subject to change at press time. I will endeavour to add corrections below should any of the above change at launch.

The BES-53248 first-timer’s guide

With the CN-1610 starting to get long in the tooth and with more platforms supporting and/or requiring a cluster interconnect network greater than 10Gbit, the need to introduce a non-Cisco option came to be. This option is the BES-53248, which is a “Broadcom Supported” switch produced by Quanta, makers of all things hyper scale who sells it as the QuantaMesh T4048-IX8. At some point Broadcom’s EFOS is installed on the T4048-IX8 via the Open Network Install Environment (ONIE) and it becomes the product we know as the BES-53248. While definitely a superior switch, supporting 10/25/40/100Gbit, the deployment thereof is not as streamlined, hence this post.

I struggled a bit with how to approach this topic and settled up the following: I will provide a numbered list of steps as a guide and index but then have sections below that expand upon those steps. There could very well be times where you want to perform these steps in a different order but if this is your first time working on this switch and it’s factory-fresh, the steps below are how I would advise proceeding.

  1. Equipment Ordering, including licences
  2. Broadcom Support Account, Firmware Download
  3. Reference Configuration Files (RCF)
  4. Supporting Infrastructure
  5. Initial Configuration

Equipment Ordering, including licences

The BES-53248 has 48 × 10/25Gbit ports and 8 × 40/100Gbit ports; by default the first 16 × 10/25Gbit ports are available for cluster interconnect connections and the last 2 × 40/100Gbit are reserved for Inter-Switch Links (ISL); which is already an improvement over the CN1610’s 12 × ClusterNet ports. If the environment requires more ports than this, the 10/25Gbit ports can be licensed in blocks of 8 (Part # SW-BES-53248A2-8P-1025G) all the way up to 48, and there is one license (Part # SW-BES-53248A2-6P-40-100G) to activate the remaining 6 × 40/100Gbit ports. Be sure your order also has all the requisite transceivers and cables, consult HWU for specific compatibilities. Lastly, the BES-53248 doesn’t ship with rails by default, so make sure your quote shows them if you need them.

When your switches arrive they will include a manilla envelope with licensing information if licenses above the base configuration were ordered, do not recycle this envelope as it contains the very important Transaction Key which you will use to generate your license file at this site:

https://efos-licensing.broadcom.com/License/RedeemTransactionKey

Before visiting that link, along with your license keys you’ll need the switch serial numbers which are located on the switches themselves like so:

The license file generation procedure is instant, so not having this ahead of time isn’t that big of a deal provided you have internet access while at the installation site.

Broadcom Support Account, Firmware Download

What isn’t instantaneous however is the creation of a TechData-provided, Broadcom Support Account (BSA), and you need this account do download firmware for the switches. In order to setup a BSA, which hopefully you did a couple of days in advance of requiring the firmware, you need to send an email to: support@techdata.com with the following information:

Indicate if OEM (Netapp/Lenovo), Partner/Installer or Customer:
Name of Company device is registered to (if partner/installer):
Requester Name:
Requester Email Address:
Requester Phone Number:
Address where device is located:
Device Model Number: BES-53248
Device Serial Number:

I’ve found the folks that respond to this email address are pretty easy to deal with, though I’m not sure you’ll be able to get your account if you don’t already have the serial number, comment below if you know. My account creation took roughly 24 hours and then I had access to the firmware downloads. Download the appropriate firmware for your environment. The switches I received in August of 2021 shipped with EFOS 3.4.4.6 which was supported in the environment I was deploying into, but so was 3.7.0.4 so that’s where I wanted to land.

Reference Configuration Files (RCF)

Download the appropriate RCF for the environment and edit accordingly. If you visit HWU and drill down into the switch category, you can download the RCF from there:

I was converting an AFF8080 from two-node switchless to switched and adding an A400 at 100Gbit. I grabbed RCF 1.7 from Hardware Universe (not where I’d expect to find it but nice and easy) and uncommented ports 0/49-0/54 by removing the initial exclamation point on the lines in question since the additional 40-100 license activates all of these ports, I deleted the lines setting the speed to 40G full-duplex. I hope in version 1.8 of the RCF, this configuration will also be applied as a range since that’s the only license option available for purchase on these ports.

Supporting Infrastructure

In your site requirements checklist, ensure the availability of an http (or ftp, tftp, sftp, scp) server on the management network. Once the equipment is racked and the management interface cabled, you will need this server to host your EFOS firmware, license files and RCF.

Initial Configuration

The first time you connect to the device, most likely via serial, assuming the unit was factory-fresh like mine, the username should be admin and the password should be blank. You will be immediately forced to change the password. I noticed that when I was going through this, copying, and pasting the new password didn’t work for me but typing the same password did; this may have had something to do with the special characters chosen or the app I was using (serial.app on macOS). Another thing to be aware of, if you’re applying RCF 1.7 you will have to be on EFOS 3.7.0.4 first. The switches I based this post on shipped with 3.4.4.6 and there are some commands in the RCF that aren’t compatible, so you’ll want to upgrade EFOS before applying RCF 1.7. Also, applying an RCF means wiping any existing configuration first, so you might as well get this out of the way while you are on site.

Once you’ve changed the password, it’s time to configure the management IP address so you can retrieve the license files, EFOS image and RCF from the http server mentioned previously. You’ll need to be logged in, and have elevated your privilege level to enable:

User:admin
Password:************
(CLswitch-01) >enable

(CLswitch-01) #serviceport ip 10.0.0.209 255.255.255.0 10.0.0.1

(CLswitch-01) #show serviceport

Interface Status............................... Up
IP Address..................................... 10.0.0.209
Subnet Mask.................................... 255.255.255.0
Default Gateway................................ 10.0.0.1
IPv6 Administrative Mode....................... Enabled
IPv6 Prefix is ................................ fe80::c218:50ff:fe0b:24c5/64
Configured IPv4 Protocol....................... None
Configured IPv6 Protocol....................... None
IPv6 AutoConfig Mode........................... Disabled
Burned In MAC Address.......................... B4:A9:FC:34:8F:CE

(CLswitch-01) #ping 10.0.0.1
 Pinging 10.0.0.1 with 0 bytes of data:

Reply From 10.0.0.1: icmp_seq = 0. time= 2 msec.
Reply From 10.0.0.1: icmp_seq = 1. time <1 msec.
Reply From 10.0.0.1: icmp_seq = 2. time= 26 msec.

Now that you are on the network, the first thing we should do is add any additional licenses. Here are the commands with an explanation of what they do:

show license


show port all | exclude Detach 


copy http://10.0.0.80/switch1_license.data nvram:license-key 1 

reload

show license




show port all | exclude Detach 
See how many licenses are currently applied, if any.

Display currently licensed ports.

Copies the file from the http server and places it in index 1

reboots the switch

This is after you’ve re-logged in, it should show you something different than the last time you ran this.

This should show additional ports than from before adding the license.

Once you have added your license file(s), it’s time to upgrade EFOS, here are the commands with an explanation of what they do:

show bootvar



copy active backup


show bootvar

show version


copy http://10.0.0.80/FastPath-EFOS-esw-qcp_td3-qcp_td3_x86_64-LX415R-CNTRF-BD6IOQHr3v7m0b4.stk active 

show bootvar

reload

show version
Shows the images: active, backup, current-active and next-active.

Copies the active image to the backup slot, just in case.

Verify that the above worked.

Shows the version actually running.

Copies the image on the web server to the active slot.




Verify the last command.

Reboot the switch.

Verify the upgrade worked.

Now that we have upgraded our EFOS image, it’s time to apply the RCF. There really is no point in doing any additional configuration until we’ve done this since we have to destroy our configuration before applying the RCF anyway. Be sure that you’re only applying the default RCF if you haven’t added any licenses. If you have added licences, you need to uncomment the lines that configure the additionally licensed ports. Here are the commands with an explanation:

erase startup-config




copy http://10.0.0.80/BES-53248_RCF_v1.7-Cluster-HA.txt nvram:script BES-53248_RCF_v1.7-Cluster-HA.scr 

script list




script apply BES-53248_RCF_v1.7-Cluster-HA.scr 

show running-config



write memory


reload

This clears the startup configuration, overlaying an RCF-sourced configuration can have negative consequences.

This copies the txt file from the web server to NVRAM as a script and renames it in the process.

gives you a directory listing of available scripts to confirm the above transfer worked

applies the contents of the RCF to the configuration

displays the new running configuration to verify successful application of RCF

commit new configuration to non-volatile memory

reboots the switch so this new configuration can take affect

There, you’re all done, now you can proceed with the official guide on (re)configuring the management IP address, ssh and so on. Good luck, and if you have an experience that strays from the above, please let me know so I can update the post.

What’s going on with Intel’s X710 Ethernet controller?

I’ve previously written about this Ethernet controller back when 40GbE Ethernet was relatively new to NetApp’s FAS and AFF controllers. Since that article, I’ve started to come across various oddities with this Ethernet controller.

Last Fall, I had a customer who was experiencing problems with LACP during an ONTAP upgrade (9.1 → 9.3 → 9.5P6) on their AFF A700s using the X1144A, dual port 40GbE card, which uses the Intel X710 Ethernet controller. We had the first 40GbE port broken out into 4x10GbE links, 2-each to either half of a pair of Cisco Nexus N9K-C9396PX in the same vPC Domain. During a controller reboot, we noticed that the interface group using multimode_lacp, most or all of the ports wouldn’t come up and on the Cisco-side, the port(s) would become disabled due to too many link up/down events. Immediately we wanted to look at potential cable problems but quickly dismissed that idea as well. After some digging, it looked as though NetApp was referencing Cisco Bug ID CSCuv87644 as potentially related. This led me down a long path of investigating the changes made to the networking stack in ONTAP over the past couple of years, and I’ve still got a post I’m working on around that. The workaround was to increase the debounce timer value on the Cisco 9k to 525ms, the default value is 100ms.

The port debounce time is the amount of time that an interface waits to notify the supervisor of a link going down. During this time, the interface waits to see if the link comes back up. The wait period is a time when traffic is stopped.

Source: Cisco

Recently, a different customer of mine was trying to buy a Nimble HF20 and they wanted to include the Q8C17B, a four port, 10GbE NIC, also based on the Intel X710 Ethernet controller. The vendor came back to me and said they needed to know if the customer was going to be using VLAN tagging on the Q8C17B, because if they needed VLAN tagging, they’d have to choose a two port NIC instead. This confused me, but after some emails back and forth, HPE Nimble Storage Alert # EXT-0061 was referenced as the reason for this. At some point Nimble will release a patch that updates the firmware on this NIC, hopefully bringing back VLAN functionality. A bit of looking around, and the same VLAN issue has been identified by VMware in KB2149781.

Lastly, I also came across a NIST vulnerability from 2017 regarding the same Ethernet controller, it seems that has since been addressed in a firmware update.

While the above doesn’t necessarily imply a huge problem with the X710, I simply found it interesting and thought I’d include them all in one post.

There’s a new NVMe AFF in town!

Yesterday, NetApp announced a new addition to the midrange tier of their All-Flash FAS line, the AFF A320. With this announcement, end-to-end NVMe is now available in the midrange, from the host all the way to the NVMe SSD. This new platform is a svelte 2RU that supports up to two of the new NS224 NVMe SSD shelves, which are also 2RU. NetApp has set performance expectations to be in the ~100µs range.

Up to two PCIe cards per controller can be added, options are:

  • 4-port 32GB FC SFP+ fibre
  • 2-port 100GbE RoCEv2* QSFP28 fibre (40GbE supported)
  • 2-port 25GbE RoCEv2* SPF28 fibre
  • 4-port 10GbE SFP+ Cu and fibre
    *RoCE host-side NVMeoF support not yet available

A couple of important points to also note:

  • 200-240VAC required
  • DS, SAS-attached SSD shelves are NOT supported

An end-to-end NVMe solution obviously needs storage of some sort, so also announced today was the NS224 NVMe SSD Storage Shelf:

  • NVMe-based storage expansion shelf
  • 2RU, 24 storage SSDs
  • 400GB/s capable, 200Gb/sec per shelf module
  • Uplinked to controller via RoCEv2
  • Drive sizes available: 1.9TB, 3.8TB and 7.6TB. 15.3TB with restrictions.

Either controller in the A320 has eight 100GbE ports on-board, but not all of them are available for client-side connectivity. They are allocated as follows:

  • e0a → ClusterNet/HA
  • e0b → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
  • e0c → First NS224 connectivity
  • e0d → ClusterNet/HA
  • e0e → Second NS224 connectivity by default, or can be configured for client access, 100GbE or 40GbE
  • e0f → First NS224 connectivity
  • e0g → Client network, 100GbE or 40Gbe
  • e0h → Client network, 100GbE or 40Gbe

If you don’t get enough client connectivity with the on-board ports, then as listed previously, there are myriad PCIe options available to populate the two available slots. In addition to all that on-board connectivity, there’s also MicroUSB and RJ45 for serial console access as well as the RJ-45 Wrench port to host e0M and out-of-band management via BMC. As with most port-pairs, the 100GbE ports are hosted by a single ASIC which is capable of a total effective bandwidth of ~100Gb.

Food for thought…
One interesting design change in this HA pair, is that there is no backplane HA interconnect as has been the case historically; instead, the HA interconnect function is placed on the same connections as ClusterNet, e0a and e0d. This enables some interesting future design possibilities, like HA pairs in differing chassis. Also, of interest is the shelf connectivity being NVMe/RoCEv2; while currently connected directly to the controllers, what’s stopping NetApp from putting these on a switched fabric? Once they do that, drop the HA pair concept above, and instead have N+1 controllers on a ClusterNet fabric. Scaling, failovers and upgrades just got a lot more interesting.