Transfert de données déchargées du stockage Windows (ODX) et le cas de la prélecture manquante

Qu'est-ce qu'ODX Windows Storage Offloaded Data Transfer

Comment un client a connu des performances plus lentes que prévu en comparant READ/WRITE standard à ODX - et comment l'équipe d'ingénierie a tiré parti des métriques de Qumulo Core, du cycle de publication rapide et de l'expérience de mise à niveau transparente pour découvrir et résoudre le mystère !

Qu'est-ce qu'ODX (Windows Storage Offloaded Data Transfer) ?

Windows Storage Offloaded Data Transfer (ODX) fait partie du protocole SMB (Server Message Block) où vous pouvez dire au système de fichiers Qumulo de copier un morceau d'un fichier d'un fichier à un autre sans envoyer les données via la machine cliente. ODX améliore les performances de copie un peu comme la copie côté serveur (SSC). Il élimine la latence du réseau entre le client et le serveur et permet à Qumulo Core d'optimiser la taille et la mise en mémoire tampon des E/S. En comparant ODX contre SSC, nos résultats de tests internes montrent qu'ODX est 7 fois plus rapide et SSC 4 fois plus rapide que la séquence READ/WRITE typique qui se produit lors de la copie d'un fichier vers un autre emplacement sur le même serveur.

Lire le mystère de la latence

En 2021, nous avons fourni le support ODX pour Qumulo Core version 3.0.0 ou supérieure. Peu de temps après l'expédition, un client a signalé que les performances d'ODX étaient plus lentes que prévu.

Sachant qu'ODX lit à partir d'un fichier et écrit dans un autre, nous avons d'abord soupçonné que les performances d'écriture étaient le problème. En règle générale, les opérations de lecture prennent au maximum quelques millisecondes (merci Mise en cache et prélecture SSD!). Pourtant, lorsque nous avons regardé le latence de lecture, nous avons vu qu'il s'agissait d'une plus grande partie de la latence ODX que ce à quoi nous nous attendions - de l'ordre de 25 ms (environ), avec des pics plus élevés à 40 ou 50 ms. Qu'est-ce qui se passe avec ça? Voici une visualisation de la durée de vie de la phase de lecture d'une opération ODX :

visualisation de la durée de vie de la phase de lecture d'une opération ODX

Figure 1: Durée de vie d'une requête ODX – Lire

Nous avons examiné nos données de performances de séries temporelles à faible granularité que nous collectons via notre pipeline de surveillance cloud. En moyenne, notre taux de réussite du cache RAM et nos mesures de latence semblaient bonnes. Mais alors pourquoi les lectures spécifiques à ODX étaient-elles généralement lentes ?

Le premier indice

Au fur et à mesure que nous nous rapprochions de la découverte de la cause première, nous avions besoin de mesures de granularité plus élevées spécifiques à une seule charge de travail ODX sur ce cluster, sans le bruit des autres opérations. Heureusement, nous avons un outil, appelé trigger, which we built internally, which collects this kind of distributed performance data when needed. We worked with our customer to profile some specific ODX sequences, and used trigger to collect data during those sequences. We learned a couple of things. (Refer to Figure 1, above.)

In each read operation, most blocks were in RAM cache, but each read was missing cache for a few blocks. When this happens, the file system transaction has to read the data from the storage devices themselves. In this case, the metrics told us the data was on spinning disks.

Note: In my mind I was thinking “of course, it had to be the spinning disks…it’s ALWAYS the spinning disks.” I should have known that seeks are the only thing that could have pushed the latency up to 50ms. Spinning disks and I don’t have the best relationship. They have an infuriating but useful habit of painfully exposing unintended behavior in performance sensitive code paths.

Spinning disks can stream data efficiently when requests are queued in a contiguous order; starting at the beginning of a file and reading until the end. However, in this case, we were only missing a few blocks from cache per ODX read operation – which shows up to spinning disk as a lot of discontiguous reads. That’s bad news for spinning disk latency.

Normally, file system read operations don’t experience this kind of spinning disk latency, because we have an effective prefetcher. The prefetcher guesses what’s going to be read next, and loads it into cache ahead of time. The prefetcher has an IO pattern optimized to read from spinning disks.

So, why wasn’t the prefetcher working?

Our prefetcher has metrics that describe how successful it is on average. These metrics told us that everything being prefetched was being read, but some data the file system read wasn’t getting prefetched…that’s really weird! Maybe the ODX operations are reading out of order or jumping around randomly? The prefetcher doesn’t do well with random IO.

We checked the operation logs to see the IO pattern. Operation logs are a text log file tool that tells us information about each operation, like offset, size, and cache hit or miss. However, we hadn’t anticipated we’d need the ODX read sizes and offsets in our operation logs, and as a result that code hadn’t been hooked up! So we had to write some more code, and wait for an upgrade…

How Prefetch Works (Example Illustrations)
How Prefetch Works

Figure 2: How Prefetch Should Work

What prefetch state looks like during a contiguous read after second read

Figure 3: How Prefetch Should Work (after the second read)

How prefetch works after third read

Figure 4: How Prefetch Should Work (after third read)

Prefetch state during contiguous read proceeds until end of file

Figure 5: How Prefetch Should Work

Finding the Culprit

Within a couple of weeks, we had shipped new code, and using our super awesome upgrades, the customer was able to upgrade their Qumulo storage system to the latest version soon after. They re-ran the test workload for us. Using the improved operation logging, we found another puzzle! By reading operation logs (I felt like I was in the matrix reading the green numbers go by) we found that our ODX protocol code was sending the file system contiguous read requests for 8650752 bytes each. When we did the math, that comes out to 8 mebibytes + 256KiB. But! The operation log showed that each ODX request was actually only getting 8388608 bytes from cache (exactly 8 mebibytes).

I remember at first my eyes glazed over…was I misreading the numbers? How is each request missing cache for only 256KiB of the 8MiB it was asking for? How is the prefetcher only fetching a subset of each request? That’s impossible!

Note: The first stage of every bug is denial - believing that the data is wrong. Luckily, this wasn’t my first rodeo. I quickly stifled my denial, and started asking myself “how could this be possible”?

I started explaining how the prefetcher works to my coworker who was working with me on the investigation. The prefetcher should be getting ahead of the read workload, such that the IO size of the read workload doesn’t matter. The prefetcher always fetches data in large chunks and in such a way that the queueing on the spinning disks is well-ordered to avoid seeks. And then it hit me…large chunks…how big are the prefetch chunks? OH…they are 8 MiB…wait that’s spooky. And how does it get ahead of the read workload? OH, we kick off one prefetch per read request…wait…putting those two things together…if we only kick off a single prefetch per read request, and the prefetch chunk size is at most 8 MiB, and the reader is reading chunks of 8MiB + 256KiB…then yeah, obviously, the prefetcher will never get ahead…it will always prefetch the next 8MiB, leaving a tail of 256KiB uncached. WOW. (Fist pump). Ok. We found the smoking gun.

Example Illustrations of the Prefetch Problem
Contiguous read but with read IO size greater than prefetch IO size

Illustration 1: Prefetch Problem (first read of a file)

Contiguous read but with read IO size greater than prefetch IO size

Illustration 2: Prefetch Problem (after the second read)

Contiguous read but with read IO size greater than prefetch IO size

Illustration 3: Prefetch Problem (after the third read)

Prefetch problem after the third read of a file

Illustration 4: Prefetch Read IO is greater than prefetch IO size

Contiguous read with read IO size greater than prefetch IO size after the fourth read

Illustration 5: Prefetch Problem (after the fourth read)

Mystery Explained

Figuring out the bug is just the beginning. There were immediately three questions that came to mind:

  1. Why doesn’t this same problem happen in our standard file system read tests?
  2. Why is ODX using an IO size of 8 MiB + 256 KiB, larger than the prefetcher’s max IO size, and
  3. Why didn’t our ODX specific performance tests catch this?
1. Why doesn’t this problem happen in our standard file system read performance tests?

Our file system protocol scheduler processes all protocol requests. Typical protocol request size is 1MiB. When we receive protocol read requests we process them and potentially combine requests to optimize read performance. We naturally have a maximum combined IO size for these file system requests. If you let the IO size grow too large then the request can be bottlenecked on some large single threaded thing it was doing (for example CPU, memory allocation). So our file system scheduler makes the maximum combined read request 8MiB, the same as our prefetcher’s max IO size. This was by design. So in these workloads, the prefetcher is prefetching everything, although it isn’t ever getting more than 8MiB ahead if the combined IO size is also 8 MiB…(scratches headthat also wasn’t intended behavior…).

2. Why is ODX using an IO size of 8 MiB + 256 KiB?

The ODX operation itself goes through the scheduler, but doesn’t put its internal read request through the scheduler. ODX was then setting the maximum IO read size it would request, independent of the scheduler.

We found that the scheduler’s max IO size and ODX max IO size were using different functions:

fs_target_transaction_size()
fs_max_transaction_size()

The scheduler was using fs_target_transaction_size, which is a minimum of fs_max_transaction_size and 8MiB! Whereas the ODX code was intuitively using fs_max_transaction_size - which is computed, and turned out to be 8MiB + 256KiB.

So there was our root cause. We fixed the ODX protocol code to use fs_target_transaction_size, and voilà, no more uncached tails to the read requests.

3. Why didn’t our performance tests of ODX catch this?

Our ODX performance tests did, in fact, exhibit the behavior of uncached tails. But, the performance wasn’t that bad; because, in this performance test, the data was small enough to fit in the SSD cache!

Looking back at Figure 1 (above), you can see that the latency of reads from SSD is much lower than spinning disk. So it wasn’t as obvious from the top line performance test results that the prefetcher was doing the wrong thing. The test wasn’t explicitly checking the prefetcher’s metrics.

Once we changed the performance test to do a second step of evicting the data from SSD, and then reading it back from spinning disk, the topline performance was significantly worse than it should have been -- revealing the bug. We fixed the bug, observed this automated performance test improve, and shipped the improvement in 2 weeks. Once again leveraging Qumulo’s easy upgrades, our customers were able to upgrade soon after the fix and get the improved ODX performance.

Learn more
The New Enterprise Data Storage Playbook

The New Enterprise Data Storage Playbook

Unstructured data is everywhere, and it’s growing at uncontrollable rates. CIOs and IT leaders are turning to scalable storage solutions to manage this data and remain competitive. Block storage, object storage, and file storage each have unique capabilities and limitations, meaning enterprise-level storage systems are not “one size fits all” solutions.

Learn why unstructured data matters — and which storage solution is right for you.

Download the Playbook

Share this post