Hvordan vi finjusterede HAProxy for at opnå 2.000.000 samtidige SSL-forbindelser

Hvis du ser på ovenstående skærmbillede nøje, finder du to vigtige oplysninger:

  1. Denne maskine har 2,38 millioner TCP-forbindelser oprettet, og
  2. Mængden af ​​RAM, der bruges, er omkring 48 gigabyte .

Temmelig fantastisk, ikke? Hvad der ville være endnu mere fantastisk er, hvis nogen leverede installationskomponenterne og de tuninger, der kræves for at opnå denne form for skala på en enkelt HAProxy-maskine. Nå, det gør jeg netop i dette indlæg;)

Dette er den sidste del af multipart-serien om belastningstestning af HAProxy. Hvis du har tid, anbefaler jeg, at du går og læser de to første dele i serien først. Disse hjælper dig med at få fat på kerneniveauindstillingerne, der kræves på alle maskinerne i denne opsætning.

Load Testing HAProxy (del-1)

Load Testing? HAProxy? Hvis alt dette virker græsk for dig, skal du ikke bekymre dig. Jeg vil give integrerede links til at læse om, hvad ... medium.com Load Testing HAProxy (del 2)

Dette er anden del i 3- delsserien om præstationstest af den berømte TCP-belastningsafbalancering og omvendt proxy ... medium.com

Der er mange små komponenter, der hjalp os med at samle hele opsætningen og opnå disse tal.

Før jeg fortæller dig den endelige HAProxy-konfiguration, vi brugte (hvis du er super utålmodig, kan du rulle til bunden), vil jeg opbygge den ved at gå igennem vores tankegang.

Hvad vi ønskede at teste

Komponenten, vi vil teste, var HAProxy version 1.6. Vi bruger dette i produktion lige nu på 4 kerner, 30 Gig maskiner. Al forbindelse er dog ikke SSL-baseret.

Vi ønskede at teste to ting ud af denne øvelse:

  1. Den CPU procentvise stigning , når vi skifter hele belastningen fra ikke-SSL-forbindelser til SSL-forbindelser. CPU-brugen bør helt sikkert øges på grund af det længere 5-vejs håndtryk og derefter pakkekryptering.
  2. For det andet ønskede vi at teste grænserne for vores nuværende produktionsopsætning med hensyn til antal anmodninger og det maksimale antal samtidige forbindelser, der kan understøttes, før ydeevnen begynder at forringes.

Vi krævede den første del på grund af en større udrulning af funktioner, der er i fuld gang, hvilket kræver kommunikation via SSL. Vi krævede den anden del, så vi kunne reducere mængden af ​​hardware dedikeret i produktionen til HAProxy-maskiner.

De involverede komponenter

  • Flere klientmaskiner for at understrege HAProxy.
  • Single HAProxy-maskine version 1.6 på forskellige opsætninger

    * 4 kerner, 30 Gig

    * 16 kerner, 30 Gig

    * 16 kerner, 64 Gig

  • Backend-servere, der hjælper med at understøtte alle disse samtidige forbindelser.

HTTP og MQTT

Hvis du har gennemgået den første artikel i denne serie, skal du vide, at hele vores infrastruktur understøttes over to protokoller:

  • HTTP og
  • MQTT.

I vores stak bruger vi ikke HTTP 2.0 og har derfor ikke funktionaliteten af ​​vedvarende forbindelser på HTTP. Så ved produktion er det maksimale antal TCP-forbindelser, vi ser, et eller andet sted omkring (2 * 150k) på en enkelt HAProxy-maskine (indgående + udgående). Selvom antallet af samtidige forbindelser er ret lavt, er antallet af anmodninger pr. Sekund ret højt.

På den anden side er MQTT en anden måde for kommunikation. Det tilbyder også god kvalitet af serviceparametre og vedvarende forbindelse. Så tovejs kontinuerlig kommunikation kan ske over en MQTT-kanal. Hvad angår HAProxy, der understøtter MQTT (underliggende TCP) -forbindelser, ser vi et sted omkring 600-700k TCP-forbindelser i spidsbelastningstid på en enkelt maskine.

Vi ønskede at lave en belastningstest, der giver os præcise resultater for både HTTP- og MQTT-baserede forbindelser.

Der er mange værktøjer derude, der hjælper os med at indlæse test af en HTTP-server let, og mange af disse værktøjer giver avancerede funktioner som opsummerede resultater, konvertering af tekstbaserede resultater til grafer osv. Vi kunne dog ikke finde noget stresstestværktøj til MQTT. Vi har et værktøj, som vi selv har udviklet, men det var ikke stabilt nok til at understøtte denne slags belastning inden for den tidsramme, vi havde.

Så vi besluttede at gå efter belastningstestklienter til HTTP og simulere MQTT-opsætningen ved hjælp af det samme;) Interessant, ikke?

Læse videre.

Den oprindelige opsætning

Dette vil være et langt indlæg, da jeg vil give en masse detaljer, som jeg tror ville være virkelig nyttige for nogen, der laver lignende belastningstest eller finjusteringer.

  • Vi tog en 16-core 30 Gig-maskine til oprettelse af HAProxy oprindeligt. Vi gik ikke med vores nuværende produktionsopsætning, fordi vi troede, at CPU'en ramte på grund af SSL-afslutning, der skete i HAProxy-enden, ville være enorm.
  • Til serverenden gik vi med en simpel NodeJs-server, der svarer pongpå, når vi modtager en pinganmodning.
  • Med hensyn til klienten sluttede vi oprindeligt med Apache Bench. Årsagen til, at vi gik med, abvar fordi det var et meget velkendt og stabilt værktøj til belastningstestning af HTTP-slutpunkter, og også fordi det giver smukke opsummerede resultater, der ville hjælpe os meget.

Den abVærktøjet giver en masse interessante parametre, som vi brugte til vores belastning test som:

  • - c, concurrency Angiver antallet af samtidige anmodninger, der ville ramme serveren.
  • -n, no. of requests Som navnet antyder, angiver det samlede antal anmodninger for den aktuelle belastningskørsel.
  • -p POST file Indeholder kroppen af ​​POST-anmodningen (hvis det er det, du vil teste.)

Hvis du ser på disse parametre nøje, vil du opdage, at mange permutationer er mulige ved at finjustere alle tre. En prøve ab anmodning ville se sådan ud

ab -S -p post_smaller.txt -T application/json -q -n 100000 -c 3000 //test.haproxy.in:80/ping

Et eksempel på et resultat af en sådan anmodning ser sådan ud

De tal, som vi var interesserede i, var

  • 99% ventetid.
  • Tid pr. Anmodning.
  • Antal mislykkede anmodninger.
  • Anmodninger pr. Sekund.

Det største problem ved aber, at det ikke indeholder en parameter, der styrer antallet af anmodninger pr. Sekund. Vi var nødt til at finjustere samtidighedsniveauet for at få vores ønskede anmodninger pr. Sekund, og dette førte til mange spor og fejl.

Den Almægtige Graf

Vi kunne ikke tilfældigt gå rundt i flere belastningskørsler og fortsætte med at få resultater, fordi det ikke ville give os nogen meningsfuld information. Vi var nødt til at udføre disse tests på en bestemt måde for at få meningsfulde resultater ud af det. Så vi fulgte denne graf

Denne graf angiver, at indtil vi fortsætter med at øge antallet af anmodninger indtil et bestemt tidspunkt, vil latensen forblive næsten den samme. Men ud over en vis vendepunkt , vil latenstiden begynde at stige eksponentielt. Det er dette vippepunkt for en maskine eller en opsætning, som vi havde til hensigt at måle.

Ganglia

Før jeg giver nogle testresultater, vil jeg gerne nævne Ganglia.

Ganglia er et skalerbart distribueret overvågningssystem til højtydende computersystemer som klynger og net.

Look at the following screenshot of one of our machines to get an idea about what ganglia is and what sort of information it provides about the underlying machine.

Pretty interesting, eh?

Moving on, we constantly monitored ganglia for our HAProxy machine to monitor some important things.

  1. TCP established This tells us the total number of tcp connections established on the system. NOTE: this is the sum of inbound as well as outbound connections.
  2. packets sent and received We wanted to see the total number of tcp packets being sent and received by our HAProxy machine.
  3. bytes sent and received This shows us the total data that we sent and received by the machine.
  4. memory The amount of RAM being used over time.
  5. network The network bandwidth consumption because of the packets being sent over the wire.

Following are the known limits found via previous tests/numbers that we wanted to achieve via our load test.

700k TCP etablerede forbindelser,

50.000 pakker sendt, 60.000 pakker modtaget,

10–15 MB byte sendt såvel som modtaget,

14–15Gig-hukommelse ved højdepunkt,

7MB netværk.

ALL these values are on a per second basis

HAProxy Nbproc

Oprindeligt da vi begyndte at indlæse HAProxy, fandt vi ud af, at med SSL blev CPU'en ramt ret tidligt i processen, men anmodningerne pr. Sekund var meget lave. Ved undersøgelse af den øverste kommando fandt vi, at HAProxy kun brugte 1 kerne. Mens vi havde 15 kerner til overs.

Googling i ca. 10 minutter fik os til at finde denne interessante indstilling i HAProxy, der lader HAProxy bruge flere kerner.

Det kaldes, nbprocog for at få en bedre forståelse af, hvad det er, og hvordan man indstiller det, skal du tjekke denne artikel:

//blog.onefellow.com/post/82478335338/haproxy-mapping-process-to-cpu-core-for-maximum

Tuning this setting was the base of our load testing strategy moving forward. Because the ability to use multiple cores by HAProxy gave us the power to form multiple combinations for our load testing suite.

Load Testing with AB

When we had started out with our load testing journey, we were not clear on the things we should be measuring and what we need to achieve.

Initially we had only one goal in mind and that was to find the tipping point only by variation of all the below mentioned parameters.

I maintained a table of all the results for the various load tests that we gave. All in all I gave over 500 test runs to get to the ultimate result. As you can clearly see, there are a lot of moving parts to each and every test.

Single Client issues

We started seeing that the client was becoming bottleneck as we kept on increasing our requests per second. Apache bench uses a single core and from the documentation it is evident that it does not provide any feature for using multiple cores.

To run multiple clients efficiently we found an interesting linux utility called Parallel. As the name suggests, it helps you run multiple commands in parallel and utilises multiple cores. Exactly what we wanted.

Have a look at a sample command that runs multiple clients using parallel.

cat hosts.txt | parallel 'ab -S -p post_smaller.txt -T application/json -n 100000 -c 3000 {}'
[email protected]:~$ cat hosts.txt//test.haproxy.in:80/ping//test.haproxy.in:80/ping//test.haproxy.in:80/ping

The above command would run 3 ab clients hitting the same URL. This helped us remove the client side bottleneck.

The Sleep and Times parameter

We talked about some parameters in ganglia that we wanted to track. Lets discuss them once by one.

  1. packets sent and received This can be simulated by sending some data as a part of the post request. This would also help us generate some network as well as bytes sent and received portions in ganglia
  2. tcp_established This is something which took us a long, long time to actually simulate in our scenario. Imagine if a single ping request takes about a second, that would take us about 700k requests per second to reach our tcp_established milestone.

    Now this number might seem easier to achieve on production, but it was impossible to generate it in our scenario.

What did we do you might ask? We introduced a sleep parameter in our POST call that specifies the number of milliseconds the server needs to sleep before sending out a response. This would simulate a long running request on production. So now say we have a sleep of about 20 minutes (Yep), that would take us around 583 requests per second to reach the 700k mark.

Additionally, we also introduced another parameter in our POST calls to the HAProxy and that was the times parameter. That specified number of times the server should write a response on the tcp connection before terminating it. This helped us simulated even more data transferred over the wire.

Issues with apache bench

Although we found out a lot of results with apache bench, we also faced a lot of issues along the way. I won’t be mentioning all of them here as they are not important for this post as I’ll be introducing another client shortly.

We were pretty content with the numbers we were getting out of apache bench, but at one point of time, generating the required tcp connections just became impossible. Somehow the apache bench was not handling the sleep parameter we had introduced, properly and was not scaling for us.

Although running multiple ab clients on a single machine was sorted out by using the parallel utility. Running this setup across multiple client machines was still a pain for us. I had not heard of the pdsh utility by then and was practically stuck.

Also, we were not focussing on any timeouts as well. There are some default set of timeouts on the HAProxy, the ab client and the server and we had completely ignored these. We figured out a lot of things along the way and organized ourselves a lot on how to go about testing.

We used to talk about the tipping point graph but we deviated a lot from it as time went on. Meaningful results, however, could only be found by focusing on that.

With apache bench a point came where the number of TCP connections were not increasing. We had around 40–45 clients running on 5–6 different client boxes but were not able to achieve the scale we wanted. Theoretically, the number of TCP connections should have jumped as we went on increasing the sleep time, but it wasn’t working for us.

Enter Vegeta

I was searching for some other load testing tools that might be more scalable and better functionality wise as compared to apache bench when I came across Vegeta.

From my personal experience, I have seen Vegeta to be extremely scalable and provides much better functionality as compared to apache bench. A single Vegeta client was able to produce the level of throughput equivalent to 15 apache bench clients in our load test.

Moving forward, I will be providing load test results that have been tested using Vegeta itself.

Load Testing with Vegeta

First, have a look at the command that we used to run a single Vegeta client. Interestingly, the command to put load on the backend servers is called attack :p

echo "POST //test.haproxy.in:443/ping" | vegeta -cpus=32 attack -duration=10m -header="sleep:30000" -body=post_smaller.txt -rate=2000 -workers=500 | tee reports.bin | vegeta report

Just love the parameters provided by Vegeta. Let’s have a look at some of these below.

  1. -cpus=32 Specifies the number of cores to be used by this client. We had to expand our client machines to 32core, 64Gig because of the amount of load to be generated. If you look closely above, the rate isn’t much. But it becomes difficult to sustain such a load when a lot of connections are in sleep state from the server end.
  2. -duration=10m I guess this is self explanatory. If you don’t specify any duration, the test will run forever.
  3. -rate=2000 The number of requests per second.

So as you can see above, we reached a hefty 32k requests per second on a mere 4 core machine. If you remember the tipping point graph, you will be able to notice it clearly enough above. So the tipping point in this case is 31.5k Non SSL requests.

Have a look at some more results from the load test.

16k SSL connections is also not bad at all. Please note that at this point in our load testing journey, we had to start from scratch because we had adopted a new client and it was giving us way better results than ab. So we had to do a lot of stuff again.

An increase in the number of cores led to an increase in the number of requests per second that the machine can take before the CPU limit is hit.

We found that there wasn’t a substantial increase in the number of requests per second if we increased the number of cores from 8 to 16. Also, if we finally decided to go with a 8 core machine in production, we would never allocate all of the cores to HAProxy or be it a any other process for that matter. So we decided to perform some tests with 6 cores as well to see if we had acceptable numbers.

Not bad.

Introducing the sleep

We were pretty satisfied with our load test results till now. However, this did not simulate the real production scenario. That happened when we introduced a sleep time as well which was absent till now in our tests.

echo "POST //test.haproxy.in:443/ping" | vegeta -cpus=32 attack -duration=10m -header="sleep:1000" -body=post_smaller.txt-rate=2000 -workers=500 | tee reports.bin | vegeta report

So a sleep time of 1000 milliseconds would lead to server sleeping for x amount of time where 0< x <; 1000 and is selected randomly. So on an average the above load test will give a latency of ≥ 500ms

The numbers in the last cell represent

TCP established, Packets Rec, Packets Sent

respectively. As you can clearly see the max requests per second that the 6 core machine can support has decreased to 8k from 20k. Clearly, the sleep has its impact and that impact is the increase in the number of TCP connections established. This is however nowhere near to the 700k mark that we set out to achieve.

Milestone #1

How do we increase the number of TCP connections? Simple, we keep on increasing the sleep time and they should rise. We kept playing around with the sleep time and we stopped at the 60 seconds sleep time. That would mean an average latency of around 30 sec.

There is an interesting result parameter that Vegeta provides and that is % of requests successful. We saw that with the above sleep time, only 50% of the calls were succeeding. See the results below.

We achieved a whooping 400k TCP established connections with 8k requests per second and 60000 ms sleep time. The R in 60000R means Random.

The first real discovery we made was that there is a default call timeout in Vegeta which is of 30 seconds and that explained why 50% of our calls were failing. So we increased that to about 70s for our further tests and kept on varying it as and when the need arose.

We hit the 700k mark easily after tweaking the timeout value from the client end. The only problem with this was that these were not consistent. These were just peaks. So the system hit a peak of 600k or 700k but did not stay there for very long.

Vi ønskede dog noget der lignede dette

Dette viser en stabil tilstand, hvor 780k forbindelser opretholdes. Hvis du ser nøje på statistikken ovenfor, er antallet af anmodninger pr. Sekund meget højt. Ved produktion har vi dog meget mindre antal anmodninger (et sted omkring 300) på en enkelt HAProxy-maskine.

Vi var sikre på, at hvis vi drastisk reducerer antallet af HAProxies, vi har ved produktion (et eller andet sted omkring 30, hvilket betyder 30 * 300 ~ 9k forbinder pr. Sekund), vil vi ramme maskinens grænser for antallet af TCP-forbindelser først og ikke CPU'en.

Så vi besluttede at opnå 900 anmodninger pr. Sekund og 30MB / s netværk og 2.1Million TCP etablerede forbindelser. Vi blev enige om disse tal, da disse ville være 3 gange vores produktionsbelastning på en enkelt HAProxy.

Plus, till now we had settled on 6 cores being used by HAProxy. We wanted to test out 3 cores only because this is what would be easiest for us to roll out on our production machines (Our production machines, as mentioned before are 4 core 30 Gig. So for rolling out changes with nbproc = 3 would be easiest for us.

REMEMBER the machine we had at this point in time was 16 core 30 Gig machine with 3 cores being allocated to HAProxy.

Milestone #2

Now that we had max limits on requests per second that different variations in machine configuration could support, we only had one task left as mentioned above.

Achieve 3X the production load which is

  • 900 requests per second
  • 2.1 million TCP established and
  • 30 MB/s network.

We got stuck yet again as the TCP established were taking a hard hit at 220k. No matter what the number of client machines or what the sleep time was, number of TCP connections seemed to have stuck there.

Let’s look at some calculations. 220k TCP established connections and 900 requests per second = 110,000 / 900 ~= 120 seconds .I took 110k because 220k connections include both incoming and outgoing. So it’s two way.

Our doubt about 2 minutes being a limit somewhere in the system was verified when we introduced logs on the HAProxy side. We could see 120000 ms as total time for a lot of connections in the logs.

Mar 23 13:24:24 localhost haproxy[53750]: 172.168.0.232:48380 [23/Mar/2017:13:22:22.686] api~ api-backend/http31 39/0/2062/-1/122101 -1 0 - - SD-- 1714/1714/1678/35/0 0/0 {0,"",""} "POST /ping HTTP/1.1"
122101 is the timeout value. See HAProxy documentation on meanings of all these values. 

On investigating further we found out that NodeJs has a default request timeout of 2 minutes. Voila !

how to modify the nodejs request default timeout time?

I was using nodejs request, the default timeout of nodejs http is 120000 ms, but it is not enough for me, while my…stackoverflow.comHTTP | Node.js v7.8.0 Documentation

The HTTP interfaces in Node.js are designed to support many features of the protocol which have been traditionally…nodejs.org

But our happiness was apparently short lived. At 1.3 million, the HAProxy connections suddenly dropped to 0 and started increasing again. We soon checked the dmesg command that provided us some useful kernel level information for our HAProxy process.

Basically, the HAProxy process had gone out of memory. So we decided to increase the machine RAM and we shifted to 16 core 64 Gig machine with nbproc = 3 and because of this change we were able to reach 2.4 million connections.

Backend Code

Following is the backend server code that was being used. We had also used statsd in the server code to get consolidated data on requests per second that were being received by the client.

var http = require('http');var createStatsd = require('uber-statsd-client');qs = require('querystring');
var sdc = createStatsd({host: '172.168.0.134',port: 8125});
var argv = process.argv;var port = argv[2];
function randomIntInc (low, high){ return Math.floor(Math.random() * (high - low + 1) + low);}
function sendResponse(res,times, old_sleep){ res.write('pong'); if(times==0) { res.end(); } else { sleep = randomIntInc(0, old_sleep+1); setTimeout(sendResponse, sleep, res,times-1, old_sleep); }}
var server = http.createServer(function(req, res) headers = req.headers; old_sleep = parseInt(headers["sleep"]); times = headers["times"] );
server.timeout = 3600000;server.listen(port);

We also had a small script to run multiple backend servers. We had 8 machines with 10 backend servers EACH (yeah !). We literally took the idea of clients and backend servers being infinite for the load test, seriously.

counter=0while [ $counter -le 9 ]do port=$((8282+$counter)) nodejs /opt/local/share/test-tools/HikeCLI/nodeclient/httpserver.js $port & echo "Server created on port " $port
 ((counter++))done
echo "Created all servers"

Client Code

As for the client, there was a limitation of 63k TCP connections per IP. If you are not sure about this concept, please refer my previous article in this series.

So in order to achieve 2.4 million connections (two sided which is 1.2 million from the client machines), we needed somewhere around 20 machines. Its a pain really to run the Vegeta command on all 20 machines one by one and even of you found a way to do that using something like csshx, you still would need something to combine all the results from all the Vegeta clients.

Check out the script below.

result_file=$1
declare -a machines=("172.168.0.138" "172.168.0.141" "172.168.0.142" "172.168.0.18" "172.168.0.5" "172.168.0.122" "172.168.0.123" "172.168.0.124" "172.168.0.232" " 172.168.0.244" "172.168.0.170" "172.168.0.179" "172.168.0.59" "172.168.0.68" "172.168.0.137" "172.168.0.155" "172.168.0.154" "172.168.0.45" "172.168.0.136" "172.168.0.143")
bins=""commas=""
for i in "${machines[@]}"; do bins=$bins","$i".bin"; commas=$commas","$i; done;
bins=${bins:1}commas=${commas:1}
pdsh -b -w "$commas" 'echo "POST //test.haproxy.in:80/ping" | /home/sachinm/.linuxbrew/bin/vegeta -cpus=32 attack -connections=1000000 -header="sleep:20" -header="times:2" -body=post_smaller.txt -timeout=2h -rate=3000 -workers=500 > ' $result_file
for i in "${machines[@]}"; do scp [email protected]$i:/home/sachinm/$result_file $i.bin ; done;
vegeta report -inputs="$bins"

Apparently, Vegeta provides information on this utility called pdsh that lets you run a command concurrently on multiple machines remotely . Additionally, the Vegeta allows us to combine multiple results into one and that’s really all we wanted.

HAProxy Configuration

This is probably what you came here looking for, below is the HAProxy config that we used in our load test runs. The most important part being that of the nbproc setting and the maxconn setting. The maxconn setting allows us to provide the maximum number of TCP connections that the HAProxy can support overall (one way).

Changes to maxconn setting leads to increase in HAProxy process’ ulimit. Take a look below

The max open files has increased to 4 million because of the max connections for HAProxy being set at 2 million. Neat !

Tjek artiklen nedenfor for en hel masse HAProxy-optimeringer, som du kan og bør gøre for at opnå den slags statistik, vi har opnået.

Brug HAProxy til at indlæse balance 300k samtidige tcp-stikforbindelser: Portudmattelse, Keep-alive og ...

Jeg prøver at opbygge et push-system for nylig. For at øge systemets skalerbarhed er den bedste praksis at gøre ... www.linangran.com

Http30 fortsætter til http83: s

Det er alt for nu folkens. Hvis du har det indtil videre, er jeg virkelig forbløffet :)

Et specielt råb til Dheeraj Kumar Sidana, der hjalp os hele vejen igennem dette og uden hvis hjælp vi ikke ville have været i stand til at nå nogen meningsfulde resultater. :)

Lad mig vide, hvordan dette blogindlæg hjalp dig. Anbefal også (❤) og spred kærligheden så meget som muligt til dette indlæg, hvis du mener, at dette kan være nyttigt for nogen.