Failed to Capture Events: max Number of Clients Reached

In this mail, we embrace all-time practices for interacting with Amazon ElastiCache for Redis resource with commonly used open up-source Redis customer libraries. ElastiCache is uniform with open-source Redis. However, y'all may nevertheless have questions about how to optimize your applications and associated Redis client library configurations to interact with ElastiCache. These issues typically arise when operating ElastiCache clusters at a large scale, or gracefully treatment cluster resize events. Learn all-time practices for common scenarios and follow along with code examples of some of the virtually popular open up source Redis client libraries (redis-py, PHPRedis, and Lettuce).

Big number of connections

Individual ElastiCache for Redis nodes support upward to 65,000 concurrent customer connections. Still, to optimize for performance, nosotros advise that client applications do non constantly operate at that level of connection. Redis is a single-threaded process based on an event loop where incoming customer requests are handled sequentially. That means the response time of a given client becomes longer as the number of connected clients increases.

You can take the following set of deportment to avoid hitting a connection bottleneck on the Redis server:

Perform read operations from read replicas. This can be washed past using the ElastiCache reader endpoints in cluster style disabled or past using replicas for reads in cluster manner enabled.
Distribute write traffic across multiple primary nodes. You can exercise this in two ways. You tin apply a multi-sharded Redis cluster with a Redis cluster mode capable client. Y'all could also write to multiple primary nodes in cluster manner disabled with client-side sharding.
Use a connection puddle when available in your client library.

In general, creating a TCP connection is a computationally expensive performance compared to typical Redis commands. For case, handling a SET/Become asking is an society of magnitude faster when reusing an existing connection. Using a customer connectedness pool with a finite size reduces the overhead of connection direction. It besides bounds the number of concurrent incoming connections from the customer application.

The following code example of PHPRedis shows that a new connexion is created for each new user request:

Nosotros benchmarked this code in a loop on an Amazon Elastic Compute Cloud (Amazon EC2) instance connected to a Graviton2 (m6g.2xlarge) ElastiCache for Redis node. We placed both the client and server at the aforementioned Availability Zone. The average latency of the unabridged operation was two.82 milliseconds.

When we updated the code and used persistent connections and a connectedness puddle, the average latency of the entire performance was 0.21 milliseconds:

Required redis.ini configurations:

1. redis.pconnect.pooling_enabled=1

2. redis.pconnect.connection_limit=10

The following code is an example of a Redis-py connectedness pool:

The following code is an instance of a Lettuce connexion puddle:

Redis cluster customer discovery and exponential backoff

When connecting to an ElastiCache for Redis cluster in cluster fashion enabled, the corresponding Redis client library must exist cluster aware. The clients must obtain a map of hash slots to the corresponding nodes in the cluster in lodge to ship requests to the right nodes and avoid the operation overhead of handing cluster redirections. Equally a result, the client must discover a complete list of slots and the mapped nodes in ii different situations:

The client is initialized and must populate the initial slots configuration
A MOVED redirection is received from the server, such as in the situation of a failover when all slots served past the former primary node are taken over by the replica, or re-sharding when slots are being moved from the source primary to the target main node

Client discovery is usually done via issuing a CLUSTER SLOT or CLUSTER NODE command to the Redis server. We recommend the CLUSTER SLOT method because it returns the fix of slot ranges and the associated master and replica nodes back to the customer. This doesn't require boosted parsing from the client and is more efficient.

Depending on the cluster topology, the size of the response for the CLUSTER SLOT command can vary based on the cluster size. Larger clusters with more nodes produce a larger response. As a result, information technology'southward important to ensure that the number of clients doing the cluster topology discovery doesn't abound unbounded. For example, when the customer awarding starts upward or loses connection from the server and must perform cluster discovery, 1 common mistake is that the customer awarding fires several reconnection and discovery requests without adding exponential backoff upon retry. This can render the Redis server unresponsive for a prolonged flow of time, with the CPU utilization at 100%. The outage is prolonged if each CLUSTER SLOT command must process a big number of nodes in the cluster bus. We have observed multiple client outages in the past due to this behavior across a number of different languages including Python (redis-py-cluster) and Coffee (Lettuce and Redisson).

To mitigate the affect caused by a sudden influx of connexion and discovery requests, we recommend the following:

Implement a client connection pool with a finite size to bound the number of concurrent incoming connections from the client application.
When the client disconnects from the server due to timeout, retry with exponential backoff with jitter. This helps to avoid multiple clients overwhelming the server at the aforementioned fourth dimension.
Use the ElastiCache Configuration Endpoint to perform cluster discovery. In doing so, you spread the discovery load beyond all nodes in the cluster (up to ninety) instead of hitting a few hardcoded seed nodes in the cluster.

The following are some code examples for exponential backoff retry logic in redis-py, PHPRedis, and Lettuce.

Backoff logic sample one: redis-py

redis-py has a built-in retry mechanism that retries one fourth dimension immediately later a failure. This machinery can be enabled through the retry_on_timeout statement supplied when creating a Redis object. Here we demonstrate a custom retry mechanism with exponential backoff and jitter. Nosotros've submitted a pull request to natively implement exponential backoff in redis-py (#1494). In the time to come it may non exist necessary to implement manually.

You tin then use the following lawmaking to set a value:

Depending on your workload, you might want to change the base backoff value from i second to a few tens or hundreds of milliseconds for latency-sensitive workloads.

Backoff logic sample 2: PHPRedis

PHPRedis has a built-in retry mechanism that retries a (non-configurable) maximum of 10 times. In that location is a configurable delay between tries (with a jitter from the second retry onwards). For more information, come across the post-obit sample code. We've submitted a pull request to natively implement exponential backoff in PHPredis (#1986) that has since been merged and documented. For those on the latest release of PHPRedis, it won't be necessary to implement manually but we've included the reference here for those on previous versions. For now, the following is a code example that configures the filibuster of the retry mechanism:

Backoff logic sample 3: Lettuce

Lettuce has born retry mechanisms based on the exponential backoff strategies described in the postal service Exponential Backoff and Jitter. The following is a code excerpt showing the full jitter approach:

Configure a customer-side timeout

Configure the customer-side timeout appropriately to let the server sufficient time to process the asking and generate the response. This also allows it to neglect fast if the connexion to the server can't be established. Certain Redis commands can be more computationally expensive than others. For example, Lua scripts or MULTI/EXEC transactions that incorporate multiple commands that must exist run atomically. In general, a higher client-side timeout is recommended to avoid a time out of the client earlier the response is received from the server, including the post-obit:

Running commands across multiple keys
Running MULTI/EXEC transactions or Lua scripts that consist of multiple individual Redis commands
Reading large values
Performing blocking operations such as BLPOP

In case of a blocking functioning such as BLPOP, the best practise is to set the control timeout to a number lower than the socket timeout.

The post-obit are code examples for implementing a customer-side timeout in redis-py, PHPRedis, and Lettuce.

Timeout configuration sample one: redis-py

The following is a code example with redis-py:

Timeout config sample 2: PHPRedis

The post-obit is a code example with PHPRedis:

Timeout config sample 3: Lettuce

The following is a code instance with Lettuce:

Configure a server-side idle timeout

Nosotros have observed cases when a customer's awarding has a high number of idle clients continued, but isn't actively sending commands. In such scenarios, you can exhaust all 65,000 connections with a loftier number of idle clients. To avoid such scenarios, configure the timeout setting appropriately on the server via ElastiCache Redis parameter groups. This ensures that the server actively disconnects idle clients to avert an increase in the number of connections.

Redis Lua scripts

Redis supports more than 200 commands, including those to run Lua scripts. However, when it comes to Lua scripts, there are several pitfalls that can impact memory and availability of Redis.

Unparameterized Lua scripts

Each Lua script is cached on the Redis server before it runs. Unparameterized Lua scripts are unique, which can pb to the Redis server storing a big number of Lua scripts and consuming more than retentivity. To mitigate this, ensure that all Lua scripts are parameterized and regularly perform SCRIPT FLUSH to make clean up cached Lua scripts if needed.

The following instance shows how to use parameterized scripts. First, we take an instance of an unparameterized approach that results in three different cached Lua scripts and is not recommended:

Instead, utilise the following design to create a single script that tin can accept passed parameters:

Long-running Lua scripts

Lua scripts can run multiple commands atomically, then it can take longer to consummate than a regular Redis control. If the Lua script only runs read-only operations, you tin stop it in the middle. Nevertheless, every bit soon as the Lua script performs a write performance, information technology becomes unkillable and must run to completion. A long-running Lua script that is mutating can crusade the Redis server to be unresponsive for a long time. To mitigate this issue, avert long-running Lua scripts and test the script out in a pre-production environment.

Lua script with stealth writes

There are a few ways a Lua script tin continue to write new data into Redis even when Redis is over maxmemory:

The script starts when the Redis server is below maxmemory, and contains multiple write operations inside
The script's first write command isn't consuming memory (such as DEL), followed by more write operations that consume retention

Yous can mitigate this problem by configuring a proper eviction policy in Redis server other than noeviction. This allows Redis to adios items and costless up retentiveness in between Lua scripts.

Storing large blended items

Nosotros take observed cases where an application stores large composite items in Redis (such as a multi-GB hash dataset). This is not a recommended practice because it often leads to performance problems in Redis. For example, the client can do a HGETALL command to retrieve the entire multi GB hash collection. This tin generate meaning memory pressure to the Redis server buffering the large particular in the client output buffer. Also, for slot migration in cluster mode, ElastiCache doesn't migrate slots that contain items with serialized size that is larger than 256 MB.

To solve the large particular issues, nosotros have the following recommendations:

Pause upwards the big composite item into multiple smaller items. For example, break up a large hash collection into individual key-value fields with a central name scheme that accordingly reflects the collection, such equally using a common prefix in the cardinal name to place the collection of items. If you must access multiple fields in the same drove atomically, you lot tin can employ the MGET command to call back multiple central-values in the same command.
If you evaluated all options and all the same can't interruption up the large collection dataset, try to use commands that operate on a subset of the data in the drove instead of the entire collection. Avert having a use case that requires y'all to atomically call back the entire multi-GB collection in the same command. 1 example is using HGET or HMGET commands instead of HGETALL on hash collections.

Conclusion

In this post, we reviewed Redis customer library best practices when using ElastiCache, and ways to avoid common pitfalls. Past adhering to best practices, y'all can increment the performance, reliability, and operational excellence of your ElastiCache environments. If y'all have any questions or feedback, reach out on the Amazon ElastiCache discussion forum or in the comments.

About the Authors

Qu Chen is a senior software development engineer at Amazon ElastiCache – the squad responsible for building, operating and maintaining the highly scalable and performant Redis managed service at AWS. In add-on, he is an active contributor to the open-source Redis project. In his spare time, he enjoys sports, outdoor activities and playing piano music.

Jim Gallagher is an Amazon ElastiCache Specialist Solutions Architect based in Austin, TX. He helps AWS customers beyond the world all-time leverage the power, simplicity, and beauty of Redis. Outside of work he enjoys exploring the Texas Hill Land with his married woman and son.

Nathaniel Braun is a Senior Software Development Engineer at Amazon Spider web Services, based in Tel Aviv, Israel. He designs and operates large-scale distributed systems and likes to tackle difficult issues with his team. Outside of works he enjoys hiking, sailing, and drinking coffee.

Asaf Porat Stoler is a Software Development Manager at Amazon ElastiCache, based in Tel Aviv, Israel. He has vast and diverse feel in storage systems, data reduction, and in-retention databases, and likes performance and resources optimizations. Exterior of work he enjoys sport, hiking, and spending time with his family unit.

mcdowelldonius.blogspot.com

Source: https://aws.amazon.com/blogs/database/best-practices-redis-clients-and-amazon-elasticache-for-redis/