Suivi des miettes de cookies: enquête sur une anomalie des performances de stockage

anomalie des performances de stockage

Cet article explique comment nous avons identifié et corrigé une anomalie sporadique des performances de stockage que nous avons observée dans l'un de nos benchmarks.

Chez Qumulo, nous construisons une plateforme de données de fichiers haute performance et publier en permanence des mises à jour toutes les deux semaines. L'expédition de logiciels d'entreprise nécessite si souvent une suite de tests complète pour garantir que nous avons fabriqué un produit de haute qualité. Notre suite de tests de performances s'exécute en continu sur l'ensemble de nos offres de plateforme et inclut des tests de performance des fichiers exécutés par rapport aux benchmarks standard de l'industrie.

Saisissez l'anomalie des performances de stockage

Sur une période de quelques mois, nous avons observé une variabilité dans nos benchmarks de lecture et d'écriture multi-flux. Ces tests de performance utilisent IOzone pour générer des lectures et des écritures simultanées sur le cluster, et mesurer le débit agrégé sur tous les clients connectés. En particulier, nous avons observé une distribution bimodale où la plupart des exécutions atteignaient un objectif de performance constamment stable tandis qu'un deuxième ensemble de résultats plus petit était sporadiquement plus lent d'environ 200 à 300 Mo / s, ce qui est environ 10% pire. Voici un graphique qui montre les résultats de performance.

performances TCP

Caractériser le problème

Lors de l'analyse d'une anomalie des performances de stockage, la première étape consiste à supprimer autant de variables que possible. Les résultats sporadiques ont d'abord été identifiés sur des centaines de versions de logiciels sur une période de plusieurs mois. Pour simplifier les choses, nous avons lancé une série d'exécutions du benchmark, le tout sur le même matériel et sur une seule version logicielle. Cette série d'essais a montré la même distribution bimodale, ce qui signifiait que la variabilité ne pouvait pas être expliquée par des différences matérielles ou des régressions spécifiques à la version du logiciel.

Après avoir reproduit la performance bimodale sur une seule version, nous avons ensuite comparé les données de performances détaillées collectées à partir d'une exécution rapide et d'une exécution lente. La première chose qui saute aux yeux est que les latences RPC inter-nœuds sont beaucoup plus élevées pour les mauvaises exécutions que pour les bonnes exécutions. Cela aurait pu être pour un certain nombre de raisons, mais cela faisait allusion à une cause fondamentale liée au réseau.

Exploration des performances des sockets TCP

Dans cet esprit, nous voulions des données plus détaillées sur les performances de notre socket TCP à partir de tests, nous avons donc activé notre profileur de test de performance pour recueillir continuellement des données de ss. Chaque fois que ss s'exécute, il génère des statistiques détaillées pour chaque socket du système:

> ss -tio6 État Recv-Q Send-Q Adresse locale: Port Peer Adresse: Port ESTAB 0 0 fe80 :: f652: 14ff: fe3b: 8f30% bond0: 56252 fe80 :: f652: 14ff: fe3b: 8f60: 42687 sac cubique wscale: 7,7 rto: 204 rtt: 0.046 / 0.01 ato: 40 mss: 8940 cwnd: 10 ssthresh: 87 bytes_acked: 21136738172861 bytes_received: 13315563865457 segs_out: 3021503845 segs_in: 2507786423 send 15547.8Mbps dernier lastrv: 348 lastrvack: 1140 348Mbps retrans: 30844.2/0 rcv_rtt: 1540003 rcv_space: 4 ESTAB 8546640 0 fe0 :: f80: 652ff: fe14b: 3f8% bond30: 0 fe44517 :: f80: 652ff: fe14b: 2: 4030 sac cubique wscale: 45514 rto : 7,7 rtt: 204 / 2.975 ato: 5.791 mss: 40 cwnd: 8940 ssthresh: 10 bytes_acked: 10 bytes_received: 2249367594375 segs_out: 911006516679 segs_in: 667921849 send 671354128 Mbps durées: 240.4 lastrcvans: 348bps dernier retrackrate 1464 rcv_rtt: 348 rcv_space: 288.4…

Chaque socket du système correspond à une entrée dans la sortie.

Comme vous pouvez le voir dans l'exemple de sortie, ss vide ses données d'une manière qui n'est pas très conviviale pour l'analyse. Nous avons pris les données et tracé les différents composants pour donner une vue visuelle des performances du socket TCP à travers le cluster pour un test de performances donné. Avec ce graphique, nous pourrions facilement comparer les tests rapides et les tests lents et rechercher des anomalies.

Le plus intéressant de ces graphiques était la taille de la fenêtre de congestion (en segments) pendant le test. La fenêtre de congestion (signifiée par cwnd: in the above output) is crucially important to TCP performance, as it controls the amount of data outstanding in-flight over the connection at any given time. The higher the value, the more data TCP can send on a connection in parallel. When we looked at the congestion windows from a node during a low-performance run, we saw two connections with reasonably high congestion windows and one with a very small window.

tcp-performance-investigation

Looking back at the inter-node RPC latencies, the high latencies directly correlated with the socket with the tiny congestion window. This brought up the question - why would one socket maintain a very small congestion window compared to the other sockets in the system?

Having identified that one RPC connection was experiencing significantly worse TCP performance than the others, we went back and looked at the raw output of ss. We noticed that this ‘slow’ connection had different TCP options than the rest of the sockets. In particular, it had the default tcp options.  Note that the two connections have vastly different congestion windows and that the line showing a smaller congestion window is missing sack and wscale:7,7.

ESTAB      0      0      ::ffff:10.120.246.159:8000                  ::ffff:10.120.246.27:52312                 
sack cubic wscale:7,7 rto:204 rtt:0.183/0.179 ato:40 mss:1460 cwnd:293 ssthresh:291 bytes_acked:140908972 bytes_received:27065 segs_out:100921 segs_in:6489 send 18700.8Mbps lastsnd:37280 lastrcv:37576 lastack:37280 pacing_rate 22410.3Mbps rcv_space:29200
ESTAB      0      0      fe80::e61d:2dff:febb:c960%bond0:33610                fe80::f652:14ff:fe54:d600:48673
     cubic rto:204 rtt:0.541/1.002 ato:40 mss:1440 cwnd:10 ssthresh:21 bytes_acked:6918189 bytes_received:7769628 segs_out:10435 segs_in:10909 send 212.9Mbps lastsnd:1228 lastrcv:1232 lastack:1228 pacing_rate 255.5Mbps rcv_rtt:4288 rcv_space:1131488

This was interesting, but looking at just one socket datapoint didn’t give us much confidence that having default TCP options was highly correlated with our tiny congestion window issue. To get a better sense of what was going on, we gathered the ss data from our series of benchmark runs and observed that 100% of the sockets without the SACK (selective acknowledgement) options maintained a max congestion window size 90-99.5% smaller than every socket with non-default TCP options. There was clearly a correlation here between sockets were missing the SACK option and the performance of those TCP sockets, which makes sense as SACK and other options are intended to increase performance.

Storage Performance Anomaly

How TCP options are set

TCP options on a connection are set by passing options values along with messages containing SYN flags.  This is part of the TCP connection handshake (SYN, SYN+ACK, ACK) required to create a connection. Below is an example of an interaction where MSS (maximum segment size), SACK, and WS (window scaling) options are set.

TCP options

So where have our TCP options gone?

Although we had associated the missing SACK and window scaling options with smaller congestion windows and low-throughput connections, we still had no idea why these options were turned off for some of our connections.  After all, every connection was created using the same code!

We decided to focus on the SACK option because it’s a simple flag, hoping that would be easier to debug. In Linux, SACK is controlled globally by a sysctl and can’t be controlled on a per-connection basis. And we had SACK enabled on our machines:

>sysctl net.ipv4.tcp_sack
net.ipv4.tcp_sack = 1

We were at a loss as to how our program could have missed setting these options on some connections. We started by capturing the TCP handshake during connection setup. We found that the initial SYN message had the expected options set, but the SYN+ACK removed SACK and window scaling.

We cracked open the Linux kernel’s TCP stack and started searching for how the SYN+ACK options are crafted.  We found tcp_make_synack, which calls tcp_synack_options:

static unsigned int tcp_synack_options(const struct sock *sk,
                       struct request_sock *req,
                       unsigned int mss, struct sk_buff *skb,
                       struct tcp_out_options *opts,
                       const struct tcp_md5sig_key *md5,
                       struct tcp_fastopen_cookie *foc)
{
    ...
    if (likely(ireq->sack_ok)) {
        opts->options |= OPTION_SACK_ADVERTISE;
        if (unlikely(!ireq->tstamp_ok))
            remaining -= TCPOLEN_SACKPERM_ALIGNED;
    }
    ...
    return MAX_TCP_OPTION_SPACE - remaining;
}

We saw that the SACK option is simply set based on whether the incoming request has the SACK option set, which was not very helpful. We knew that SACK was getting stripped from this connection between the SYN and SYN+ACK, and we still had to find where it was happening.

We took a look at the incoming request parsing in tcp_parse_options:

void tcp_parse_options(const struct net *net,
               const struct sk_buff *skb,
               struct tcp_options_received *opt_rx, int estab,
               struct tcp_fastopen_cookie *foc)
{
    ...
            case TCPOPT_SACK_PERM:
                if (opsize == TCPOLEN_SACK_PERM && th->syn &&
                    !estab && net->ipv4.sysctl_tcp_sack) {
                    opt_rx->sack_ok = TCP_SACK_SEEN;
                    tcp_sack_reset(opt_rx);
                }
                break;
       ...
}

We saw that, in order to positively parse a SACK option on an incoming request, the request must have the SYN flag (it did), the connection must not be established (it wasn’t), and the net.ipv4.tcp_sack sysctl must be enabled (it was).  No luck here.

As part of our browsing we happened to notice that when handling connection requests in tcp_conn_request, it sometimes clears the options:

int tcp_conn_request(struct request_sock_ops *rsk_ops,
             const struct tcp_request_sock_ops *af_ops,
             struct sock *sk, struct sk_buff *skb)
{
    ...
tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);

    if (want_cookie && !tmp_opt.saw_tstamp)
        tcp_clear_options(&tmp_opt);
    ...
    return 0;
}

We quickly found out that the want_cookie variable indicates that Linux wants to use the TCP SYN cookies feature, but we didn’t have any idea what that meant.

What are TCP SYN cookies?

TCP SYN cookies can be characterized as follows.

SYN flooding

TCP servers typically have a limited amount of space in the SYN queue for connections that aren’t yet established. When this queue is full, the server cannot accept more connections and must drop incoming SYN requests.

This behavior leads to a denial-of-service attack called SYN flooding. The attacker sends many SYN requests to a server, but when the server responds with SYN+ACK, the attacker ignores the response and never sends an ACK to complete connection setup.  This causes the server to try resending SYN+ACK messages with escalating backoff timers. If the attacker never responds and continues to send SYN requests, it can keep the servers SYN queue full at all times, preventing legitimate clients from establishing connections with the server.

Resisting the SYN flood

TCP SYN cookies solve this problem by allowing the server to respond with SYN+ACK and set up a connection even when the SYN queue is full. What SYN cookies do is actually encode the options that would normally be stored in the SYN queue (plus a cryptographic hash of the approximate time and source/destination IPs & ports) entry into the initial sequence number value in the SYN+ACK. The server can then throw away the SYN queue entry and not waste any memory on this connection. When the (legitimate) client eventually responds with an ACK message, it will contain the same initial sequence number. The server can then decode the hash of the time and, if it’s valid, decode the options and complete connection setup without using any SYN queue space.

Drawbacks of SYN cookies

Using SYN cookies to establish a connection has one drawback: there isn’t enough space in the initial sequence number to encode all the options. The Linux TCP stack only encodes the maximum segment size (a required option) and sends a SYN+ACK that rejects all other options, including the SACK and window scaling options. This isn’t usually a problem because it’s only used when the server has a full SYN queue, which isn’t likely unless it’s under a SYN flood attack.

Below is an example interaction that shows how a connection would be created with SYN cookies when a server’s SYN queue is full.

What are TCP SYN cookies

The Storage Performance Anomaly: Qumulo’s TCP problem

After studying TCP SYN cookies, we recognized that they were likely responsible for our connections periodically missing TCP options. Surely, we thought, our test machines weren’t under a SYN flood attack, so their SYN queues should not have been full.

We went back to reading the Linux kernel and discovered that the maximum SYN queue size was set in inet_csk_listen_start:

int inet_csk_listen_start(struct sock *sk, int backlog)
{
       ...
    sk->sk_max_ack_backlog = backlog;
    sk->sk_ack_backlog = 0;
       ...
}

From there, we traced through callers to find that the backlog value was set directly in the listen syscall.  We pulled up Qumulo’s socket code and quickly saw that when listening for connections, we always used a backlog of size 5.

if (listen(fd, 5) == -1)
    return error_new(system_error, errno, "listen");

During cluster initialization we were creating a connected mesh network between all of the machines, so of course we had more than 5 connections created at once for any cluster of sufficient size.  We were SYN flooding our own cluster from the inside!

We quickly made a change to increase the backlog size that Qumulo used and all of the bad performance results disappeared:  Case closed!

Editors Note: This post was published in December 2020.

Learn more

Qumulo’s engineering team is hiring and we have several job openings – check them out and learn about life at Qumulo.

Contact us

Take a test drive. Demo Qumulo in our new, interactive Hands-On Labs, or request a demo or free trial.

Subscribe to the Qumulo blog for customer stories, technical insights, industry trends and product news.

Share this post