Оптимизация веб-серверов для достижения высокой пропускной способности и низкой задержки

Материал из support.qbpro.ru

Простите, не удержался и перевел )))Источник Новость Несколько вольно, но все же

Красивая картинка, описывающая структуру нагрузки на сервер

Это расширенная версия моего выступления в NginxConf 2017 6 сентября 2017 года. Являясь SRE в команде Dropbox Traffic, я отвечаю за нашу сеть Edge: ее надежность, производительность и эффективность. Dropbox edge network представляет собой прокси-уровень на основе nginx, предназначенный для обработки как транзакций с метаданными, чувствительных к задержкам, так и высокопроизводительных передач данных. В системе, которая обрабатывает десятки гигабит в секунду, одновременно обрабатывая десятки тысяч транзакций с чувствительностью к задержкам, оптимизация эффективности / производительности по всему стеку прокси, от драйверов и прерываний, через TCP / IP и ядро, до библиотеки и приложения настройка уровня.

Disclaimer

В этой статье мы обсудим множество способов настройки веб-серверов и прокси-серверов. Пожалуйста, не грузите их. Ради научного метода применяйте их один за другим, измеряйте их эффект и решайте, действительно ли они полезны в вашей среде.

Это не сообщение о производительности Linux, хотя я буду делать много ссылок на bcc-инструменты, eBPF и perf, это далеко не полный справочник по использованию инструментов профилирования производительности. Если вы хотите узнать больше о них, вы можете прочитать блог Брендана Грегга.

Это не про браузеры. Я буду касаться производительности на стороне клиента, когда я покрываю оптимизацию, связанную с задержкой, но только ненадолго. Если вы хотите узнать больше, вы должны прочитать «High Performance Browser Networking» Ильи Григорика.

И это про использование TLS best practices. Хотя я буду упоминать библиотеки TLS и их настройки несколько раз, вы и ваша команда безопасности должны оценивать эффективность и последствия для безопасности каждого из них. Вы можете использовать Qualys SSL Test, чтобы проверить свою конечную точку на основе текущего набора лучших практик, и если вы хотите узнать больше о TLS в целом, подумайте о подписке на бюллетень Fisty Duck Bulletproof TLS.

Structure of the post

Мы обсудим оптимизацию эффективности и производительности различных уровней системы. Начиная с самых низких уровней, таких как аппаратные средства и драйверы: эти настройки могут применяться практически для любого сервера с высокой нагрузкой. Затем мы перейдем к ядру linux и его стеку TCP / IP: это кнопки, которые вы хотите использовать в любом из ваших TCP-тяжелых ящиков. Наконец, мы обсудим настройки библиотеки и приложений, которые в основном применимы к веб-серверам в целом и nginx.

Для каждой потенциальной области оптимизации я попытаюсь дать некоторое представление о компрометациях с задержкой / пропускной способностью (если есть), руководства по мониторингу и, наконец, предложить настройки для разных рабочих нагрузок.

Hardware

CPU

Для хорошей асимметричной производительности RSA / EC вы ищете процессоры с поддержкой AVX2 (avx2 in / proc / cpuinfo) и, желательно, для аппаратов с большим числом арифметических аппаратных средств (bmi и adx). Для симметричных случаев вы должны искать AES-NI для шифров AES и AVX512 для ChaCha + Poly. Intel сравнивает производительность различных аппаратных поколений с OpenSSL 1.0.2, что иллюстрирует эффект этих аппаратных разгрузок.

Для чувствительных к задержкам прецедентов, таких как маршрутизация, выигрывает меньше узлов NUMA и отключается HT. Высокопроизводительные задачи улучшают работу с большим количеством ядер и получат выгоду от использования Hyper-Threading (если только они не связаны с кешем), и, как правило, они не будут заботиться о NUMA слишком много.

В частности, если вы идете по пути Intel, вы ищете, по крайней мере, Haswell / Broadwell и в идеале процессоры Skylake. Если вы собираетесь работать с AMD, EPYC имеет впечатляющую производительность.

NIC

Here you are looking for at least 10G, preferably even 25G. If you want to push more than that through a single server over TLS, the tuning described here will not be sufficient, and you may need to push TLS framing down to the kernel level (e.g. FreeBSD, Linux).

On the software side, you should look for open source drivers with active mailing lists and user communities. This will be very important if (but most likely, when) you’ll be debugging driver-related problems.

Memory

The rule of thumb here is that latency-sensitive tasks need faster memory, while throughput-sensitive tasks need more memory.

Hard Drive

It depends on your buffering/caching requirements, but if you are going to buffer or cache a lot you should go for flash-based storage. Some go as far as using a specialized flash-friendly filesystem (usually log-structured), but they do not always perform better than plain ext4/xfs.

Anyway just be careful to not burn through your flash because you forgot to turn enable TRIM, or update the firmware.


Operating systems: Low level

Firmware

You should keep your firmware up-to-date to avoid painful and lengthy troubleshooting sessions. Try to stay recent with CPU Microcode, Motherboard, NICs, and SSDs firmwares. That does not mean you should always run bleeding edge—the rule of thumb here is to run the second to the latest firmware, unless it has critical bugs fixed in the latest version, but not run too far behind.

Drivers

The update rules here are pretty much the same as for firmware. Try staying close to current. One caveat here is to try to decoupling kernel upgrades from driver updates if possible. For example you can pack your drivers with DKMS, or pre-compile drivers for all the kernel versions you use. That way when you update the kernel and something does not work as expected there is one less thing to troubleshoot.

CPU

Your best friend here is the kernel repo and tools that come with it. In Ubuntu/Debian you can install the linux-tools package, with handful of utils, but now we only use cpupower, turbostat, and x86_energy_perf_policy. To verify CPU-related optimizations you can stress-test your software with your favorite load-generating tool (for example, Yandex uses Yandex.Tank.) Here is a presentation from the last NginxConf from developers about nginx loadtesting best-practices: “NGINX Performance testing.”

cpupower

Using this tool is way easier than crawling /proc/. To see info about your processor and its frequency governor you should run:

$ cpupower frequency-info

...
  driver: intel_pstate
  ...
  available cpufreq governors: performance powersave
  ...            
  The governor "performance" may decide which speed to use
  ...
  boost state support:
    Supported: yes
    Active: yes

Check that Turbo Boost is enabled, and for Intel CPUs make sure that you are running with intel_pstate, not the acpi-cpufreq, or even pcc-cpufreq. If you still using acpi-cpufreq, then you should upgrade the kernel, or if that’s not possible, make sure you are using performance governor. When running with intel_pstate, even powersave governor should perform well, but you need to verify it yourself.

And speaking about idling, to see what is really happening with your CPU, you can use turbostat to directly look into processor’s MSRs and fetch Power, Frequency, and Idle State information:

  1. turbostat --debug -P

... Avg_MHz Busy% ... CPU%c1 CPU%c3 CPU%c6 ... Pkg%pc2 Pkg%pc3 Pkg%pc6 ... Here you can see the actual CPU frequency (yes, /proc/cpuinfo is lying to you), and core/package idle states.

If even with the intel_pstate driver the CPU spends more time in idle than you think it should, you can:

Set governor to performance. Set x86_energy_perf_policy to performance. Or, only for very latency critical tasks you can:

Use /dev/cpu_dma_latency interface. For UDP traffic, use busy-polling. You can learn more about processor power management in general and P-states specifically in the Intel OpenSource Technology Center presentation “Balancing Power and Performance in the Linux Kernel” from LinuxCon Europe 2015.

CPU Affinity

You can additionally reduce latency by applying CPU affinity on each thread/process, e.g. nginx has worker_cpu_affinity directive, that can automatically bind each web server process to its own core. This should eliminate CPU migrations, reduce cache misses and pagefaults, and slightly increase instructions per cycle. All of this is verifiable through perf stat.

Sadly, enabling affinity can also negatively affect performance by increasing the amount of time a process spends waiting for a free CPU. This can be monitored by running runqlat on one of your nginx worker’s PIDs:

usecs  : count distribution

   0 -> 1          : 819      |                                        |
   2 -> 3          : 58888    |******************************          |
   4 -> 7          : 77984    |****************************************|
   8 -> 15         : 10529    |*****                                   |
  16 -> 31         : 4853     |**                                      |
  ...
4096 -> 8191       : 34       |                                        |
8192 -> 16383      : 39       |                                        |

16384 -> 32767  : 17 | | If you see multi-millisecond tail latencies there, then there is probably too much stuff going on on your servers besides nginx itself, and affinity will increase latency, instead of decreasing it.

Memory

All mm/ tunings are usually very workflow specific, there are only a handful of things to recommend:

Set THP to madvise and enable them only when you are sure they are beneficial, otherwise you may get a order of magnitude slowdown while aiming for 20% latency improvement. Unless you are only utilizing only a single NUMA node you should set vm.zone_reclaim_mode to 0. ## NUMA Modern CPUs are actually multiple separate CPU dies connected by very fast interconnect and sharing various resources, starting from L1 cache on the HT cores, through L3 cache within the package, to Memory and PCIe links within sockets. This is basically what NUMA is: multiple execution and storage units with a fast interconnect.

For the comprehensive overview of NUMA and its implications you can consult “NUMA Deep Dive Series” by Frank Denneman.

But, long story short, you have a choice of:

Ignoring it, by disabling it in BIOS or running your software under numactl --interleave=all, you can get mediocre, but somewhat consistent performance. Denying it, by using single node servers, just like Facebook does with OCP Yosemite platform. Embracing it, by optimizing CPU/memory placing in both user- and kernel-space. Let’s talk about the third option, since there is not much optimization needed for the first two.

To utilize NUMA properly you need to treat each numa node as a separate server, for that you should first inspect the topology, which can be done with numactl --hardware:

$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 16 17 18 19 node 0 size: 32149 MB node 1 cpus: 4 5 6 7 20 21 22 23 node 1 size: 32213 MB node 2 cpus: 8 9 10 11 24 25 26 27 node 2 size: 0 MB node 3 cpus: 12 13 14 15 28 29 30 31 node 3 size: 0 MB node distances: node 0 1 2 3

 0:  10  16  16  16
 1:  16  10  16  16
 2:  16  16  10  16
 3:  16  16  16  10

Things to look after:

number of nodes. memory sizes for each node. number of CPUs for each node. distances between nodes. This is a particularly bad example since it has 4 nodes as well as nodes without memory attached. It is impossible to treat each node here as a separate server without sacrificing half of the cores on the system.

We can verify that by using numastat:

$ numastat -n -c

                 Node 0   Node 1 Node 2 Node 3    Total
               -------- -------- ------ ------ --------

Numa_Hit 26833500 11885723 0 0 38719223 Numa_Miss 18672 8561876 0 0 8580548 Numa_Foreign 8561876 18672 0 0 8580548 Interleave_Hit 392066 553771 0 0 945836 Local_Node 8222745 11507968 0 0 19730712 Other_Node 18629427 8939632 0 0 27569060 You can also ask numastat to output per-node memory usage statistics in the /proc/meminfo format:

$ numastat -m -c

                Node 0 Node 1 Node 2 Node 3 Total
                ------ ------ ------ ------ -----

MemTotal 32150 32214 0 0 64363 MemFree 462 5793 0 0 6255 MemUsed 31688 26421 0 0 58109 Active 16021 8588 0 0 24608 Inactive 13436 16121 0 0 29557 Active(anon) 1193 970 0 0 2163 Inactive(anon) 121 108 0 0 229 Active(file) 14828 7618 0 0 22446 Inactive(file) 13315 16013 0 0 29327 ... FilePages 28498 23957 0 0 52454 Mapped 131 130 0 0 261 AnonPages 962 757 0 0 1718 Shmem 355 323 0 0 678 KernelStack 10 5 0 0 16 Now lets look at the example of a simpler topology.

$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 46967 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 48355 MB Since the nodes are mostly symmetrical we can bind an instance of our application to each NUMA node with numactl --cpunodebind=X --membind=X and then expose it on a different port, that way you can get better throughput by utilizing both nodes and better latency by preserving memory locality.

You can verify NUMA placement efficiency by latency of your memory operations, e.g. by using bcc’s funclatency to measure latency of the memory-heavy operation, e.g. memmove.

On the kernel side, you can observe efficiency by using perf stat and looking for corresponding memory and scheduler events:

  1. perf stat -e sched:sched_stick_numa,sched:sched_move_numa,sched:sched_swap_numa,migrate:mm_migrate_pages,minor-faults -p PID

...

                1      sched:sched_stick_numa
                3      sched:sched_move_numa
               41      sched:sched_swap_numa
            5,239      migrate:mm_migrate_pages
           50,161      minor-faults

The last bit of NUMA-related optimizations for network-heavy workloads comes from the fact that a network card is a PCIe device and each device is bound to its own NUMA-node, therefore some CPUs will have lower latency when talking to the network. We’ll discuss optimizations that can be applied there when we discuss NIC→CPU affinity, but for now lets switch gears to PCI-Express…

PCIe

Normally you do not need to go too deep into PCIe troubleshooting unless you have some kind of hardware malfunction. Therefore it’s usually worth spending minimal effort there by just creating “link width”, “link speed”, and possibly RxErr/BadTLP alerts for your PCIe devices. This should save you troubleshooting hours because of broken hardware or failed PCIe negotiation. You can use lspci for that:

  1. lspci -s 0a:00.0 -vvv

... LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- ... Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ... UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- ... UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- ... CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ PCIe may become a bottleneck though if you have multiple high-speed devices competing for the bandwidth (e.g. when you combine fast network with fast storage), therefore you may need to physically shard your PCIe devices across CPUs to get maximum throughput.