Building the AI-Powered Enterprise: Why the Infrastructure Matters

Leave a Reply

AI is changing how we design and operate enterprise IT. Most of the attention goes to GPUs and models, but AI is still a data problem at its core. Without the right infrastructure, performance gains disappear.

NetApp’s new data storage architecture, AFX, is designed to fix that gap. It brings AI performance and enterprise-grade reliability together on a platform that organizations already know: ONTAP.

AFX: Purpose-built for enterprise AI

At NetApp INSIGHT 2024, the company outlined a goal to make AI infrastructure simpler and more scalable. A year later, AFX delivers on that vision with a disaggregated storage architecture that lets you scale performance and capacity independently.

It runs on ONTAP, which means you keep the same management, automation, and security features that are already proven in production environments. Instead of reinventing storage for AI, AFX enhances what enterprises already trust.

One platform for business and AI workloads

Traditional AI deployments often use separate infrastructure from core business systems. That separation creates complexity, inefficiency, and extra data movement. AFX allows high-performance AI workloads to run on the same ONTAP infrastructure used for enterprise data.

This reduces silos and provides:

Consistent security and governance
A single management interface
No proprietary client software
Support for high-speed networking such as 400 GbE

The result is better performance without a complete rebuild of your storage environment.

Scaling for real AI workloads

AI data patterns are unpredictable. They spike, change shape, and can overwhelm rigid systems. AFX handles these shifts with automated load balancing, seamless scaling, and strong multitenancy controls.

Because it’s ONTAP based, it works with existing automation and integration tools, keeping operational changes minimal while still supporting heavy I/O and demanding pipelines.

Security and data protection

Security must be part of the foundation, not an afterthought. AFX builds on ONTAP’s security framework and includes autonomous ransomware protection that detects threats with high accuracy. This protects not only datasets and models, but also the processes that feed them.

Hybrid by design

Modern enterprises operate across data centers, clouds, and edge locations. AFX integrates with AWS, Azure, and Google Cloud so that data can move freely between environments. The NetApp AI Data Engine adds features like metadata indexing, privacy management, and vectorization, helping teams prepare data for AI use without manual work.

The future of enterprise AI

AFX has the potential to become a key building block for AI-driven enterprises. It builds on familiar technology while addressing new performance and scalability needs.

A few things to watch for as this technology starts getting into customer environments:

Real-world scalability beyond lab results
Operational simplicity for disaggregated deployments
Integration depth with cloud and AI ecosystems
Cost efficiency compared to traditional silos

The future of enterprise AI is bright. Look to NetApp to lead the charge with solutions like AFX.

NetApp updates their Mid and High-end A-Series

Leave a Reply

NetApp’s flagship unified storage platform is the All Flash FAS or AFF, the A-Series is different from the C-Series in that it addresses TLC flash vs the C-Series’ QLC.

Today, Tuesday May 14, 2024 NetApp has announced three new AFF A-Series models, the A70, A90 and A1K which will likely supplant the A400, A800 and A900 respectively, are here they are:

What’s interesting about these latest models is that they are all basically the same but with different RAM and CPU complements, the A1K does differentiate itself in that it is a return to a single-node chassis design whereas the A70 and A90 reuse the chassis from the A800, with a slight upgrade I’ll get to at the end of this article. Upgrading from an A800 to an A90 is a PCM-swap and not an entire chassis swap.

The new platform is a serious upgrade in the IO department, with nine usable PCIe 5.0 slots available per controller, up from five PCIe 3.0 on the A400 and A800 but down slightly from the A900’s ten PCIe 4.0 IO modules. PCIe 4.0 doubled the available bandwidth of PCIe 3.0 per PCIe lane and PCIe 5 doubled it again, bringing available lane bandwidth up to ~4GB/s. Up until this release, all existing controllers used a switched PCIe architecture but now, direct access to the bus is available to all slots. These improvements pave the way for faster PCIe cards such as quad port, 64G FC cards and dual port, 200GbE and 400GbE Ethernet adapters but be sure to pay attention to the slot priority assignment if you’re putting the card in yourself. Slots 2,3,9 and 10 have eight lanes vs sixteen on 1,4-8 and 11; all eleven slots on the A1K have sixteen lanes.

Powering these new controllers will be Intel’s 4th generation Xeon Scalable processors with QuickAssist Technology (QAT) allowing for integrated compression offload. Current platforms perform inline compression in 8K chunks and a scanner would perform compression on cold data in 32k chunks, QAT allows for inline compression at 32k chunks all without any performance impact. Any data moved to a QAT-enabled controller, or on controllers upgraded to QAT-enabled models will realize this efficiency improvement. However, if data is moved from a QAT-enabled controller to an older, non-QAT model, that efficiency goes away. Speaking of older models, all controller upgrades are non-disruptive and the A800 is an in-chassis upgrade. I have to wonder if NetApp has any plans to use the other accelerators available in this CPU, specifically DLB and IAA.

Have a look at the raw numbers, I’m not sure how scientific my aggregated GHz column is, but I find it helpful when comparing horsepower. I’ve included the existing A400, A800 and A900 to see where the new models land.

Model	MAX Drives	Internal Drives	CPU Speed	CPU/Node	Cores/CPU	Aggr GHz	NVRAM	RAM
A400	480	N/A	2.2GHz	2	10	44GHz	16GB	128GB
A70	240	48	2.0GHz	2	16	64GHz	32GB	128GB
A800	240	48	2.1GHz	2	24	100.8GHz	32GB	640GB
A90	240	48	2.0GHz	2	32	128GHz	64GB	1024GB
A900	480	N/A	2.2GHz	2	32	140.8GHz	64GB	1024GB
A1K	480	N/A	1.7GHz	2	52	176.8GHz	64GB	1024GB

By the Numbers: Node Specifications

As for performance improvement expectations, NetApp claims the following:

A70 provides 2x the performance of the A400
A90 provides 1.6x the performance of the A800
A1K provides 1.4x the performance of the A900

While this article covers the new A-Series, up until today the C-Series have been the exact same controllers, differing only in their ability to address TLC media vs the C-Series’ QLC media. Having said that, I’ve become a big fan of the C800 for performance and form factor. Since the A70 and A90 reuse the x800 chassis, that means the ability to stick 48 internal drives in it. With the current max drive sizes for the A800 and C800 being 15.3TB and 30.7TB respectively, they can provide up to 570TiB and 1.11PiB of usable space, all in 4RU of rack space; this is before taking into account ONTAP’s aggressive data efficiency capabilities. All three of the new models are described as having a max raw capacity of 3.7PB which means 240×15.3TB so no 30.7TB TLC media at this time.

Today’s announcement has the A-series properly differentiating itself from it’s C-series sibling; however, I can’t imagine the C-series wouldn’t get some sort of QAT-related update soon, but I guess we’re going to have to wait on that for now.

Last but not least, if you’re a fan of the new bezels we’ve been seeing, you’ll be happy to hear that NetApp has finally acquiesced to our demands that the NetApp logo light up. That’s right, the three new models all get a fancy light up bezel that can be controlled via the GUI, CLI or even API, sadly they are one-colour LEDs and not RGBW…at least not yet. If you have an existing A800 and you perform an in-chassis upgrade, you will not benefit from this one as the pre-existing chassis lack the requisite 8-pin connector to the bezel.

These new models have NetApp seriously improving their offering, first by beefing up the basic CPU/RAM components but also around the physical design and architecture, the modular nature of which is very interesting.

ONTAP One for all and all for One

Leave a Reply

Back in February, NetApp announced their new QLC-based AFF systems, NetApp C-Series, C being for capacity flash. That new product line alone was celebration-worthy, what was really exciting and involved a touch of burying the lead was the inclusion of a licensing model called ONTAP One which is the all-you-can-eat equivalent of NetApp licensing. When a C-series is licensed with ONTAP One, you get to use all of the features of ONTAP. At the time of launch, my only complaint was that it was only on the new platform, but behind the scenes I was told to watch that space. Well, as of today, all new and existing FAS, AFF and ASA systems licensed with anything more than the bare minimum can now get licenses for everything ONTAP.

NetApp has simplified their licensing to only two options, ONTAP Base and ONTAP One. If your existing system had either Flash, Core+DP, or Premium, you are now entitled to ONTAP One licensing. What exactly does that look like? Here’s a picture:

How do you acquire your new licenses you may ask? Customers with a valid support contract can login to the NetApp support portal, download a new license file and install it. Some features may require you to upgrade to 9.10.1, but you should really be on at least that new of a release by now.

As with all great things, there are some caveats and restrictions but not enough to warrant covering them here. The majority of my readers will be able to proceed as above, and edge cases around the IPA license model versus LICKEYs or SnapMirror Cloud/S3 SnapMirror licenses can be found in the documentation.

ONTAP One is something I’ve wanted NetApp to introduce for years, this will not only eliminate and post-sales problems due to improper configurations but also remove a FUD point for their competition.

NetApp Announces A Whole New Line

2 Replies

Up until today, if you were looking for a physical ONTAP array for your environment, your choices were the hybrid flash, FAS array offering around 5-10ms of latency or the sub-ms AFF A-series. Sure there was one anomaly in there, the QLC-based FAS 500f, but that AFF in FAS clothing was just that, an anomaly. While I have no evidence to point to here, but my theory is that the 500f was NetApp’s way of dipping their toe in the water of QLC-based arrays. Upon launch, the 500f was pricey and the configurations limited and restricted, both of which were addressed at some point after launch. As an employee at a partner that sells a lot of NetApp, I looked at the 500f when it first launched and then basically never looked at it again because of those two points.

Today, NetApp is announcing the all new C-Series of QLC-based arrays, the “C” being for “Capacity Enterprise Flash”. While the controllers themselves aren’t new, the fact that they only support QLC media is what is different. While I won’t go into the details of what QLC, or Quad Layer Flash is in this post, the fact of the matter is that it is more affordable than Triple Layer Flash (TLC) and almost as performant. What this means for those purchasing NetApp arrays is that they can get near the performance of an AFF system at a fraction of the cost. Most of us in the storage world know that 10k and 15k RPM SAS drives are slowly going to be phased out in favour of high-capacity SATA drives and high-performance NAND storage, leaving a void. QLC-based arrays will fill that void, and at a higher performance level. If you start to research QLC vs TLC, you’ll find lots of concerns around durability which are not completely unfounded, but you would have also found these concerns when the industry went from Multi-cell (MLC) to TLC and that seems to have gone well enough. Technology of the storage devices themselves improve over time and software-based mitigation strategies such as write avoidance also improve. I’m not knowledgeable enough on this latter point to go into details, but ONTAP is a beast and has all sorts of tricks up its sleeve.

So without further ado, I present NetApp’s Enterprise Capacity Flash line, the AFF C800, AFF C400 and AFF C250:

Quick Specs:

	AFF C800	AFF C400	AFF C250
Max drive count (15.3TB NVMe QLC)	144	96	48
Max effective capacity (5:1 efficiencies)	8.8 PB	5.9 PB	2.9 PB
Max Usable capacity (1:1)	1.6 PiB	1.06 PiB	540.37 TiB
Minimum configurations	12 × 15.3	8 × 15.3	8 × 15.3
100GbE ports per HA pair	20	16	4
25GbE ports per HA pair	16	12 onboard / 16 HBA	4 onboard / 16 HBA
32Gb FC ports	32	32	16

By the numbers

Now some of you may have thought, “I thought there was already a C-series with the C190?”, and you’d be right. NetApp is repurposing the C-series branding as well as introducing a successor to the C190, the AFF A150. While the new A150 will still have some restrictions, it won’t be nearly as restrictive as the C190. The physical form-factor remains the same as the C190, but the A150 will allow for up to two expansion shelves for a total of 72 SAS SSDs including the internal ones in capacities of 960GB, 3.8TB and 7.6TB, coming to a max usable capacity of ~402TiB, or 2.2PB at an efficiency level of 1:5.

Back to the new C-Series conversation, they bring with them a new default licensing model, ONTAP One. ONTAP One is something I have personally been asking for many years at this point, and it includes all of the licenses; Core, Data Protection, Hybrid Cloud and Security & Compliance. Personally I’m looking forward to not having to worry about what features are available with a certain license offering, instead, C-Series with ONTAP One as the default licensing model will ensure you or your customers will never be left wondering if their array has a given feature.

The C-Series should be available to quote as of March 27, 2023 and should start shipping by the end of April. This statement as well as all of the information above is based on pre-release information I received and may be subject to change at press time. I will endeavour to add corrections below should any of the above change at launch.

Migrating from the CN1610 to the BES-53248 for cluster interconnect

Leave a Reply

In my continuing effort to make the adoption of the BES-53248 more streamlined, I figured I would also write a migration guide as I personally had to read the documentation more than once to understand it completely. If you haven’t already checked it out, it might be helpful to first consult my first timers’ guide as the following guide starts with the assumption that your new switches are racked and Inter-Switch Links (ISL) connected and initial configuration has been performed.

Another quick caveat, this is by no means a replacement for the official documentation and the methods below may or may not be supported by NetApp. If you want the official procedure, that is documented here.

Now that we’ve got the above out of the way, I’ll get down to brass tacks. To keep things simple, we’re going to start with a simple two-node switched cluster which should look like this:

You should also have your new BES switched setup as so:

Next step, lets make sure we don’t get a bunch of cases created by kicking off a pre-emptive auto support:

system node autosupport invoke -node * -type all -message MAINT=2h

Elevate your privilege level and confirm all cluster LIFs are set to auto-revert:

set advanced
network interface show -vserver Cluster -fields auto-revert

If everything above looks good, it’s time to login to your second BES which NetApp wants you to name cs2 and configure a temporary ISL back to your first CN1610. Personally I feel the temporary ISL is optional, but can provide a bit of added insurance to your change:

(cs2) # configure
(cs2) (Config)# port-channel name 1/2 temp-isl
(cs2) (Config)# interface 0/13-0/16
(cs2) (Interface 0/13-0/16)# no spanning-tree edgeport
(cs2) (Interface 0/13-0/16)# addport 1/2
(cs2) (Interface 0/13-0/16)# exit
(cs2) (Config)# interface lag 2
(cs2) (Interface lag 2)# mtu 9216
(cs2) (Interface lag 2)# port-channel load-balance 7
(cs2) (Config)# exit

(cs2) # show port-channel 1/2
Local Interface................................ 1/2
Channel Name................................... temp-isl
Link State..................................... Down
Admin Mode..................................... Enabled
Type........................................... Static
Port-channel Min-links......................... 1
Load Balance Option............................ 7
(Enhanced hashing mode)

Mbr     Device/        Port      Port
Ports   Timeout        Speed     Active
------- -------------- --------- -------
0/13    actor/long     10G Full  False        
        partner/long
0/14    actor/long     10G Full  False 
        partner/long
0/15    actor/long     10G Full  False
        partner/long
0/16    actor/long     10G Full  False
        partner/long

At this point, we’re going to disconnect any of the connections to the second CN1610 and run these to the second BES-53248. You may need different cables to ensure they are supported, check Hardware Universe. When you’re done this recabling step, it should look like this:

Note: It’s this step here that made me realize the temporary ISL is optional since we now have our two sets of LIFs isolated from each other.

Next, let’s put the (optional) temporary ISL into play. At your first CN1610, disconnect the cables connected to ports 13-16 and once they’re all disconnected, assuming these cables are supported by both switches, plug them into ports 13-16 on your second BES, so it looks like this:

Now on the second BES-53248, verify the ISL is up:

show port-channel

Assuming the port-channel is up and running, let’s check the health of our cluster LIFs by issuing the following commands at the cluster command line:

network interface show -vserver Cluster -is-home false
network port show -ipspace Cluster

The first command shouldn’t produce any output, give the LIFs time to revert however. The second command, you want to make sure all ports are up and healthy. Once all the LIFs have reverted home, you can now move all the links from the first cluster node as well as removing the temporary ISLs so you end up with this:

Run the same two commands as before:

network interface show -vserver Cluster -is-home false
network port show -ipspace Cluster

Provide everything looks good, you’re free to remove the CN1610s from the rack as they are no longer in use. The final step is to clean up the configuration on your second BES-53248 by tearing down the temporary ISL configuration, done like this:

(cs2) # configure
(cs2) (Config)# deleteport 1/2 all
(cs2) (Config)# exit
(cs2) # write memory

This guide is by no means a replace for the official documentations but rather a companion to it. You should always consult the official documentation, I purposely cut out some of the steps I felt gave the docs a bit of a TL;DR feel but it doesn’t mean I wouldn’t personally run those steps if I were doing the work. This document is only my attempt to clarify the official docs, hopefully it does so for you.

The BES-53248 first-timer’s guide

2 Replies

With the CN-1610 starting to get long in the tooth and with more platforms supporting and/or requiring a cluster interconnect network greater than 10Gbit, the need to introduce a non-Cisco option came to be. This option is the BES-53248, which is a “Broadcom Supported” switch produced by Quanta, makers of all things hyper scale who sells it as the QuantaMesh T4048-IX8. At some point Broadcom’s EFOS is installed on the T4048-IX8 via the Open Network Install Environment (ONIE) and it becomes the product we know as the BES-53248. While definitely a superior switch, supporting 10/25/40/100Gbit, the deployment thereof is not as streamlined, hence this post.

I struggled a bit with how to approach this topic and settled up the following: I will provide a numbered list of steps as a guide and index but then have sections below that expand upon those steps. There could very well be times where you want to perform these steps in a different order but if this is your first time working on this switch and it’s factory-fresh, the steps below are how I would advise proceeding.

Equipment Ordering, including licences
Broadcom Support Account, Firmware Download
Reference Configuration Files (RCF)
Supporting Infrastructure
Initial Configuration

Equipment Ordering, including licences

The BES-53248 has 48 × 10/25Gbit ports and 8 × 40/100Gbit ports; by default the first 16 × 10/25Gbit ports are available for cluster interconnect connections and the last 2 × 40/100Gbit are reserved for Inter-Switch Links (ISL); which is already an improvement over the CN1610’s 12 × ClusterNet ports. If the environment requires more ports than this, the 10/25Gbit ports can be licensed in blocks of 8 (Part # SW-BES-53248A2-8P-1025G) all the way up to 48, and there is one license (Part # SW-BES-53248A2-6P-40-100G) to activate the remaining 6 × 40/100Gbit ports. Be sure your order also has all the requisite transceivers and cables, consult HWU for specific compatibilities. Lastly, the BES-53248 doesn’t ship with rails by default, so make sure your quote shows them if you need them.

When your switches arrive they will include a manilla envelope with licensing information if licenses above the base configuration were ordered, do not recycle this envelope as it contains the very important Transaction Key which you will use to generate your license file at this site:

https://efos-licensing.broadcom.com/License/RedeemTransactionKey

Before visiting that link, along with your license keys you’ll need the switch serial numbers which are located on the switches themselves like so:

The license file generation procedure is instant, so not having this ahead of time isn’t that big of a deal provided you have internet access while at the installation site.

Broadcom Support Account, Firmware Download

What isn’t instantaneous however is the creation of a TechData-provided, Broadcom Support Account (BSA), and you need this account do download firmware for the switches. In order to setup a BSA, which hopefully you did a couple of days in advance of requiring the firmware, you need to send an email to: support@techdata.com with the following information:

Indicate if OEM (Netapp/Lenovo), Partner/Installer or Customer:
Name of Company device is registered to (if partner/installer):
Requester Name:
Requester Email Address:
Requester Phone Number:
Address where device is located:
Device Model Number: BES-53248
Device Serial Number:

I’ve found the folks that respond to this email address are pretty easy to deal with, though I’m not sure you’ll be able to get your account if you don’t already have the serial number, comment below if you know. My account creation took roughly 24 hours and then I had access to the firmware downloads. Download the appropriate firmware for your environment. The switches I received in August of 2021 shipped with EFOS 3.4.4.6 which was supported in the environment I was deploying into, but so was 3.7.0.4 so that’s where I wanted to land.

Reference Configuration Files (RCF)

Download the appropriate RCF for the environment and edit accordingly. If you visit HWU and drill down into the switch category, you can download the RCF from there:

I was converting an AFF8080 from two-node switchless to switched and adding an A400 at 100Gbit. I grabbed RCF 1.7 from Hardware Universe (not where I’d expect to find it but nice and easy) and uncommented ports 0/49-0/54 by removing the initial exclamation point on the lines in question since the additional 40-100 license activates all of these ports, I deleted the lines setting the speed to 40G full-duplex. I hope in version 1.8 of the RCF, this configuration will also be applied as a range since that’s the only license option available for purchase on these ports.

Supporting Infrastructure

In your site requirements checklist, ensure the availability of an http (or ftp, tftp, sftp, scp) server on the management network. Once the equipment is racked and the management interface cabled, you will need this server to host your EFOS firmware, license files and RCF.

Initial Configuration

The first time you connect to the device, most likely via serial, assuming the unit was factory-fresh like mine, the username should be admin and the password should be blank. You will be immediately forced to change the password. I noticed that when I was going through this, copying, and pasting the new password didn’t work for me but typing the same password did; this may have had something to do with the special characters chosen or the app I was using (serial.app on macOS). Another thing to be aware of, if you’re applying RCF 1.7 you will have to be on EFOS 3.7.0.4 first. The switches I based this post on shipped with 3.4.4.6 and there are some commands in the RCF that aren’t compatible, so you’ll want to upgrade EFOS before applying RCF 1.7. Also, applying an RCF means wiping any existing configuration first, so you might as well get this out of the way while you are on site.

Once you’ve changed the password, it’s time to configure the management IP address so you can retrieve the license files, EFOS image and RCF from the http server mentioned previously. You’ll need to be logged in, and have elevated your privilege level to enable:

User:admin
Password:************
(CLswitch-01) >enable

(CLswitch-01) #serviceport ip 10.0.0.209 255.255.255.0 10.0.0.1

(CLswitch-01) #show serviceport

Interface Status............................... Up
IP Address..................................... 10.0.0.209
Subnet Mask.................................... 255.255.255.0
Default Gateway................................ 10.0.0.1
IPv6 Administrative Mode....................... Enabled
IPv6 Prefix is ................................ fe80::c218:50ff:fe0b:24c5/64
Configured IPv4 Protocol....................... None
Configured IPv6 Protocol....................... None
IPv6 AutoConfig Mode........................... Disabled
Burned In MAC Address.......................... B4:A9:FC:34:8F:CE

(CLswitch-01) #ping 10.0.0.1
 Pinging 10.0.0.1 with 0 bytes of data:

Reply From 10.0.0.1: icmp_seq = 0. time= 2 msec.
Reply From 10.0.0.1: icmp_seq = 1. time <1 msec.
Reply From 10.0.0.1: icmp_seq = 2. time= 26 msec.

Now that you are on the network, the first thing we should do is add any additional licenses. Here are the commands with an explanation of what they do:

show license


show port all | exclude Detach 


copy http://10.0.0.80/switch1_license.data nvram:license-key 1 

reload

show license




show port all | exclude Detach

See how many licenses are currently applied, if any.

Display currently licensed ports.

Copies the file from the http server and places it in index 1

reboots the switch

This is after you’ve re-logged in, it should show you something different than the last time you ran this.

This should show additional ports than from before adding the license.

Once you have added your license file(s), it’s time to upgrade EFOS, here are the commands with an explanation of what they do:

show bootvar



copy active backup


show bootvar

show version


copy http://10.0.0.80/FastPath-EFOS-esw-qcp_td3-qcp_td3_x86_64-LX415R-CNTRF-BD6IOQHr3v7m0b4.stk active 

show bootvar

reload

show version

Shows the images: active, backup, current-active and next-active.

Copies the active image to the backup slot, just in case.

Verify that the above worked.

Shows the version actually running.

Copies the image on the web server to the active slot.




Verify the last command.

Reboot the switch.

Verify the upgrade worked.

Now that we have upgraded our EFOS image, it’s time to apply the RCF. There really is no point in doing any additional configuration until we’ve done this since we have to destroy our configuration before applying the RCF anyway. Be sure that you’re only applying the default RCF if you haven’t added any licenses. If you have added licences, you need to uncomment the lines that configure the additionally licensed ports. Here are the commands with an explanation:

erase startup-config




copy http://10.0.0.80/BES-53248_RCF_v1.7-Cluster-HA.txt nvram:script BES-53248_RCF_v1.7-Cluster-HA.scr 

script list




script apply BES-53248_RCF_v1.7-Cluster-HA.scr 

show running-config



write memory


reload

This clears the startup configuration, overlaying an RCF-sourced configuration can have negative consequences.

This copies the txt file from the web server to NVRAM as a script and renames it in the process.

gives you a directory listing of available scripts to confirm the above transfer worked

applies the contents of the RCF to the configuration

displays the new running configuration to verify successful application of RCF

commit new configuration to non-volatile memory

reboots the switch so this new configuration can take affect

There, you’re all done, now you can proceed with the official guide on (re)configuring the management IP address, ssh and so on. Good luck, and if you have an experience that strays from the above, please let me know so I can update the post.

How to run nested ESXi 7 on QNAP’s Virtualization Station

7 Replies

**Important update at the end that should be read prior to wasting your time.

This weekend I found myself in need of an additional ESXi host so instead of acquiring new hardware I figured I might as well run it nested on my beefy QNAP TVS-h1288X with its Xeon CPU and 72GB of RAM. I already use the QEMU-based Virtualization Station (VS) for hosting my primary domain controller and it’s my go-to host for spinning up my ONTAP simulators so I figured nesting an ESXi VM shouldn’t be that difficult. What I hadn’t taken into account however is the fact that VMware has deprecated the VMKlinux Driver Stack, removing support for all of the NICs VS makes available to you in the GUI while provisioning new virtual machines. At first I researched injecting drivers or rolling my own installation ISO but these seemed overly complicated and somewhat outdated in their documentation. Instead I decided to get inside of VS and see if I could do something from that angle, it was after all simply their own version of QEMU.

I started the installation process, but it wasn’t long before I received this error message:

ESXi 7 No Network Adapters error message

I shut down the VM, and changed the NIC type over and over eventually exhausting the five possibilities presented in the VS GUI:

Not even the trusty old e1000 NIC, listed as Intel Gigabit Ethernet above worked…Over to the CLI I went. Some Googling around on the subject lead me to believe there was a command that would produce a list of supported virtualized devices, but the commands I was finding were for native KVM/QEMU installs and not intended for VS so I poked around and came across the qemu-system-x86_64 command, and when I ran it with the parameters -device help and it produced the following, abbreviated list:

./qemu-system-x86_64 -device help
[VL] This is a NROMAL VM
Controller/Bridge/Hub devices:
name "i82801b11-bridge", bus PCI
................<SNIP>
Network devices:
name "e1000", bus PCI, alias "e1000-82540em", desc "Intel Gigabit Ethernet"
name "e1000-82544gc", bus PCI, desc "Intel Gigabit Ethernet"
................<SNIP>
name "vmxnet3", bus PCI, desc "VMWare Paravirtualized Ethernet v3"
................<SNIP>

That last line is exactly what I was looking for, this lead me to believe that QEMU should be able to support the VMXNET3 network device so I cd’d over to the .qpkg/QKVM/usr/etc/libvirt/qemu directory and opened up the XML file associated with my ESXi VM and changed the following sections:

<interface type='bridge'>
      <mac address='00:50:56:af:30:fe'/>
      <source bridge='qvs0'/>
      <model type='e1000'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>

to:

<interface type='bridge'>
      <mac address='00:50:56:af:30:fe'/>
      <source bridge='qvs0'/>
      <model type='vmxnet3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>

I saved the file and for good measure I also restarted VS. I booted the VM, and I received the same error message as above. This time I cc’d over to .qpkg/QKVM/var/run/libvirt/qemu and had a look at the XML file that represented the running config of the VM, and the NIC was still set to e1000. It took me a bit of hacking around to determine that in order to make this change persistent, I needed to edit the XML file using:

virsh edit 122d6cbc-b47c-4c18-b783-697397be149b

That last string of text being the UUID of the VM in question. If you’re unsure of what the UUID is of a given VM, simply grep “qvs:name” from all the XML files in the .qpkg/QKVM/usr/etc/libvirt/qemu directory. I made the same change as I had previously, exited the editor and booted the VM once again…This time, success! My ESXi 7.0u2 host booted fine and didn’t complain about the network. I went through the configuration and it is now up and running fine. The GUI still lists the NIC as Intel Gigabit Ethernet.

I’m reluctant to make any changes to the VM using the GUI at this time for fear of the NIC information changing, but I’m okay not using the GUI if it means being able to nest ESXi 7 on Virtualization Station for testing purposes.

**Update: While the ESXi 7.0u2 VM would boot find, I have been unable to actually add it to my vCenter server. I tried running the VM on my physical ESXi host and was able to add it to vCenter, then I powered down the ESXi VM and imported it into VS. The import worked, but then it showed as disconnected from vCenter. Next I tried importing vCenter into the virtualized ESXi host, but that won’t boot as VS isn’t presenting the VT-x flag even though I have CPU passthrough enabled. I’m still going to try and get this going, but won’t have time to devote to troubleshooting VS for a couple of days.

macOS How-To Guide: Installing vCenter Server Appliance from ISO directly

1 Reply

Ever since macOS started enforcing code signing there’s been the occasional hoop to jump through to get non-App Store software to run. Typically it’s as easy as right-clicking on the binary and choosing Open, which is all well and good until that application needs to launch a subsequent one. Recently I downloaded the ISO for vCenter Server Appliance and double-clicked on it to mount it. I then navigated to:

/Volumes/VMware VCSA/vcsa-ui-installer/mac

Once here, I double-clicked Installer[.app] and of course got the following:

I dutifully two-finger-clicked (ie: right-click) and chose Open and then Open again and proceeded with the initial vCenter configuration. Not too far into the process, Installer wanted to call ovftool but since this was a direct launch, I received a message similar to the previous one:

At this point, I couldn’t find a way to work around this security control directly so I decided to do some research. As with most things VMware-related I came across a post by William Lam, specifically one entitled How to exclude VCSA UI/CLI Installer from MacOS Catalina Security Gatekeeper?. This post lead me to this command:

sudo xattr -r -d com.apple.quarantine <directory of ISO contents>

but that assumes you’ve copied the ISO to another drive and not running it directly as if it were a read-only file system mounted much like the DVD media it represents. I copied all ~8GB of the ISO to my local SSD issued the command above and sure enough, it was going to work. I wasn’t happy with this however and was determined to run this installer from the ISO as intended. The root of the problem is that when you mount the ISO, either by double clicking on it in Finder or issuing an hdiutil mount image.iso, it mounts the file system with the quarantine option:

I did some quick research on how to mount an ISO without this option using any of mount, hdiutil, diskutil or Disk Utility[.app] to no avail. I did notice however that after I unmounted the ISO using diskutil unmount /dev/disk3 that the image remained in the sidebar albeit greyed out:

I decided to right-click on it here and choose Mount:

A quick check over in Terminal[.app] and voilà, no quarantine!

At this point I navigated to the Installer once again and was able to run through to completion without any security notifications. I’m not sure if this is a well known workaround but I didn’t find any reference to it online so hopefully someone will benefit from it.

TL;DR: Instructions for running installers directly from read-only mounted file systems on macOS:

Double-click the image so that it mounts normally with DiskImageMounter.app
Issue the following CLI command to unmount it: diskutil unmount /dev/diskX
Launch Disk Utility.app
Right-click the image name in the side bar and choose Mount
Your installer can be run directly from the r/o filesystem

Installing the ONTAP 9.7 simulator in Fusion 12.1

Leave a Reply

At the time of writing this 9.8 is available but I’m specifically writing this for someone who is trying to install 9.7 and having problems. Before I get into the actual simulator installation we need to come cover some stuff around VMware Fusion first.

With regards to networking, VMware Fusion can provide three different interface types, they are as follows:

Bridged – this type puts the interface directly on the same LAN as your Mac, this is great if you want the VM to appear as though it’s on the network that your Mac is using.
Host-only – this is a completely isolated network, the only hosts that can access it are those on your Mac configured with this type of interface. There is no external access with this type.
NAT – this is similar to number two, but allows the host with this type to reach out of the Mac, such as for Internet access.

If you want more details on this please go read this KB.

By default, the simulator has four network interface; the first two, e0a/e0b are for the ClusterNet network, the back-end network used by cluster nodes to communicate with each other, and should be of type host-only. The second two, e0c/e0d are for client access and management access, these are of type NAT but can also be set to bridged. If you use Nat, then VMware will assign IP addresses via DHCP based on the configuration of the VMNET8 interface settings; to view this cat the file located here:

/Library/Preferences/VMware\ Fusion/vmnet8/dhcpd.conf

Mine looks like this:

allow unknown-clients;
default-lease-time 1800;                # default is 30 minutes
max-lease-time 7200;                    # default is 2 hours

subnet 172.16.133.0 netmask 255.255.255.0 {
	range 172.16.133.128 172.16.133.254;
	option broadcast-address 172.16.133.255;
	option domain-name-servers 172.16.133.2;
	option domain-name localdomain;
	default-lease-time 1800;                # default is 30 minutes
	max-lease-time 7200;                    # default is 2 hours
	option netbios-name-servers 172.16.133.2;
	option routers 172.16.133.2;
}
host vmnet8 {
	hardware ethernet 00:50:56:C0:00:08;
	fixed-address 172.16.133.1;
	option domain-name-servers 0.0.0.0;
	option domain-name "";
	option routers 0.0.0.0;
}

What this means is that any interface set to NAT in my instance of Fusion will receive DHCP addresses in the subnet 172.16.133.0/24, but the DHCP pool itself is only 172.16.133.[128-254]. The subnet mask will still be 255.255.255.0 (ie: /24) and the default gateway is 172.16.133.2 as that is the internal interface of the virtual router created to do the NAT; .1 is held by the “external” interface which you can view by issuing an ifconfig vmnet8 at the command prompt. Note, this interface is created when Fusion is launched and torn down when you quit. If you set the interface type to bridged, those interfaces will get DHCP addresses from the same LAN that the Mac is connected to.

On to the actual installation…

First thing you need to do is download the OVA from NetApp:

Go to https://support.netapp.com/
login (yes, required)
At the top click Downloads → Product Evaluation
Click “Data ONTAP™ Simulator“
Agree to the terms
Download the OVA and license keys for the version you’re looking for.

Now that you have the OVA, you’re ready to import it into Fusion. Launch Fusion, then click the + sign and choose Import:

Browse for and open the downloaded OVA:

Now click continue:

Give the folder you’re going to store it in a name and click save, I like to name it after the node:

Fusion will import the OVA and present you with the settings. You can modify them if you want, but for now I’m going to leave them as default. Click Finish:

You’ll likely be asked if you’d like to upgrade the VM version, don’t bother:

At this point the vSIM will boot for its first time, I believe the official instructions tell you to hit CTRL-C, halt the boot and call for the maintenance menu then issue an option 4, but if this is the first node you do not have to do that. The root aggregate is automatically created:

First boot with aggr0 creation

Now you can open a browser and point it at the IP address listed on your screen, in my case it will be https://172.16.133.132/, but it may be different for you. You will get a certificate error, but bypass that to access the GUI to finish the configuration. IF you do not get the following screen or get no site at all, there’s something else wrong. Also, hover your mouse over the node in the Health card, if the serial number doesn’t appear, refresh the web page, otherwise configuration will fail:

It should look like this:

Now enter all the required information, since the IP addresses are being statically assigned, I’m choosing ones outside of the DHCP range, as should you:

I don’t check the “single-node” box, it will still work as a single node if you don’t but if you do, it removes the ClusterNet interfaces completely. I like having those interfaces for experimentation and teaching purposes; also it keeps the door open to adding a second node, which I will cover in a follow up post if there is anyone interested. Now click Submit:

At this point I like to start pinging either the cluster IP I specified or the node IP so I can see when the cluster gets configured since the browser doesn’t always refresh to the new IP address:

Once ping starts responding, go ahead and visit the new IP address via your browser:

Now the person I wrote this blog entry for isn’t getting the GUI above, but instead the GUI for the out-of-band interface for a UCS server, so the IP space their vmnet8 is using collides with production IP space. This can be verified at this point by disconnecting any Ethernet connections and turning off WiFi, once that is done, reload the browser and the IP conflict should be resolved until you’re connected once again. To resolve it permanently, that person will need to edit the dhcpd.conf file for vmnet8 mentioned above, using a subnet known to not conflict. Here’s an example, alternative dhcpd.conf:

allow unknown-clients;
default-lease-time 1800;                # default is 30 minutes
max-lease-time 7200;                    # default is 2 hours

subnet 10.0.0.0 netmask 255.255.255.0 {
	range 10.0.0.128 10.0.0.254;
	option broadcast-address 10.0.0.255;
	option domain-name-servers 10.0.0.2;
	option domain-name localdomain;
	default-lease-time 1800;                # default is 30 minutes
	max-lease-time 7200;                    # default is 2 hours
	option netbios-name-servers 10.0.0.2;
	option routers 10.0.0.2;
}
host vmnet8 {
	hardware ethernet 00:50:56:C0:00:08;
	fixed-address 10.0.0.1;
	option domain-name-servers 0.0.0.0;
	option domain-name "";
	option routers 0.0.0.0;
}

This changes the subnet in use to 10.0.0.0/24 with the DHCP range being 10.0.0.[128-254] and the default gateway of VMs using it to 10.0.0.2.

This is where I’m going to end this post for now as the simulator is now accessible via HTTPS and SSH and ONTAP is ready to be configured. You will still need to assign disks, create a local storage tier (aggregate) as well as an SVM with volume(s) for data among other things. The intent of this post was to get this far, not to teach ONTAP. If you’d like to see a post around either adding a second node to the cluster or configuring ONTAP on the first one, please leave a comment and I’ll try and get around to it.

ONTAP 9.8 has been announced

Leave a Reply

Timed perfectly with NetApp INSIGHT 2020 is the annual ONTAP payload announcement. Once again, there’s a lot in this payload, so I will simply deliver a list of bulleted sections, addressing as many of the changes as I’m able. I’ll provide additional detail on the ones I feel are the most interesting. For a full run down, please consult the release notes or start a conversation with me on twitter.

FlexGroup Volume Enhancements

Async Delete
- Delete large datasets rapidly from the CLI.
  - This is great for those high file count deployments.
Backup enhancements
- 1,023 snapshots supported
- NDMP enhancements
FlexVol to FlexGroup in-place conversion enhancements
VMware datastore support
Proactive resizing of constituent volumes

FlexCache Volumes, a true global namespace

SMB support added with distributed locking
10x origin to cache fan-out ratio, now 1:100
Caching of SnapMirror secondary volumes
Cache pre-population

Data Visibility

File system analytics, viewable in System Manager
- Enabled on a per-volume basis
- Can also be queried via API access
QoS for Qtrees
- IOPS and throughput policies available per qtree object
- Managed via REST API or CLI
- Qtree-level statistics
- NFS only in this release, no adaptive QoS

All-SAN Array (ASA) enhancements

Persistent FC Ports
- Symmetric active/active host-to-LUN access
- Each node on the ASA will maintain a “shadow FC LIF”, reducing SAN failover times even further.
Larger Capacities
- Max LUN = 128TB LUNs
- Max FlexVol = 300TB
  - These limit increases are on the ASA only
MCC-IP support
Priced ~20% less than unified platforms

ONTAP S3

Preview-only in 9.7, GA in 9.8
System manager integration
Bucket access policies
Multiple buckets per volume
TLS 1.2 support
Multi-part upload
ONTAP S3 is not a replacement for a dedicated, global object store

Storage Efficiency Enhancements

FabricPool
- Tiering from HDD aggregates
- Object tagging (For information life cycle policies)
- Increased cooling period (max 183 days)
- Cloud retrieval
Storage efficiencies
- Differentiation of hold and cold data for application of different compression methods, 8k compression group for hot, 32k for cold
- Deduplication prior to compression

Simplification

Upgrade directly to two versions newer without passing via intermediary version
Headswaps using nodes running the latest version of ONTAP can be used on nodes running versions of ONTAP up to two versions behind
REST API enhancements
- ZAPI to REST mapping documentation
- ONTAP version information in API documentation
System Manager Improvements
- Single-click firmware upgrades
- File system analytics
  - Granular details about your NAS file systems
- Hardware and Network visualization
- Data Protection Enhancements
  - Reverse resync
Simpler Compliance
- Volume move support, no second copy required
- WORM as the default

Security and Data Protection Enhancements

Secure purge
- crypto shred individual files
IPSec
- encrypted network traffic, regardless of protocols
  - Simplifies secure NFS, no need for Kerboros
  - iSCSI traffic on the wire can now be encrypted
Node root volume encryption
MetroCluster
- Unmirrored aggregate support
SnapMirror
- SnapMirror Business Continuity (SM-BC) provides automated failover of synchronous SnapMirror relationships for application-level, granular protection
  - These are non-disruptive
  - SM-BC is preview-only in 9.8 and SAN-only.
- SnapMirror to Object Store
  - Google Cloud, Azure, or AWS
  - Meta Data included so Object Store data is a complete archive
  - Efficiencies maintained

Virtualization Enhancements

FlexGroup volumes as VMware datastores
SnapCenter backup support
64TB SAN datastore on the ASA
SRA support for SnapMirror Synchronous
Support for Tanzu storage

That sums up the majority of the improvements, looking forward to this release coming out. See you at NetApp INSIGHT 2020!

IOPS.ca

Storage and Virtualization (and whatever else I feel like) by Chris Maki

Building the AI-Powered Enterprise: Why the Infrastructure Matters

Like this:

NetApp updates their Mid and High-end A-Series

Like this:

ONTAP One for all and all for One

Like this:

NetApp Announces A Whole New Line

Quick Specs:

Like this:

Migrating from the CN1610 to the BES-53248 for cluster interconnect

Like this:

The BES-53248 first-timer’s guide

Equipment Ordering, including licences

Broadcom Support Account, Firmware Download

Reference Configuration Files (RCF)

Supporting Infrastructure

Initial Configuration

Like this:

How to run nested ESXi 7 on QNAP’s Virtualization Station

Like this:

macOS How-To Guide: Installing vCenter Server Appliance from ISO directly

Like this:

Installing the ONTAP 9.7 simulator in Fusion 12.1

Like this:

ONTAP 9.8 has been announced

Like this:

Share the love:

Like this:

Share the love:

Like this:

Share the love:

Like this:

Quick Specs:

Share the love:

Like this:

Share the love:

Like this:

Equipment Ordering, including licences

Broadcom Support Account, Firmware Download

Reference Configuration Files (RCF)

Supporting Infrastructure

Initial Configuration

Share the love:

Like this:

Share the love:

Like this:

Share the love:

Like this:

Share the love:

Like this:

Share the love:

Like this: