OpenVidu load testing: a systematic study of OpenVidu platform performance
OpenVidu has evolved a lot since its inception nearly two years ago. The team has been really focused on building the core features for our real-time WebRTC based multimedia platform, putting a lot of effort into providing a developer-friendly environment. We have invested time in supporting as many languages and frameworks as possible, as well as building many guided tutorials to make the first contact with OpenVidu as smooth as possible.
We are really proud of the results so far, and the growth and usage of OpenVidu, higher every day, proves that the effort has been worthwhile.
That said, next logical step in our roadmap is working towards a more robust, trustworthy setup for production environments. OpenVidu is right now a great platform for many use cases, but it lacks an undoubtedly important feature: automated scalability. This may push back developers who want to implement self-sustaining and self-scalable massive videoconference systems. So, first step at this exciting moment for us is to answer the following question:
How many media connections can an OpenVidu instance handle right now?
This may seem like an obvious question that we should have been clear about a long time ago, but it is not an easy task to do so (and to do it well). Of course we have tested many times pretty big video sessions involving many users, but we don’t have enough people and devices to really push OpenVidu to its limits. And naturally performing such test manually is neither practical, nor fast nor elegant. So, the following statement became an absolute priority:
We must have an automated load test system that allows us to run massive video conference scenarios, replicate them and analyse their behavior
The spark that ignited OpenVidu load testing
Some months ago CoSMo Software published a very interesting study comparing the performance of different WebRTC Open Source SFUs solutions (here you have the paper and some slides presenting it). OpenVidu was included… and it didn’t perform great. In fact, it performed really bad in compare with other alternatives. These are the reasons we thought might have led to this kind of result at that moment, and which we could confirm and fix later:
1) Issue with file descriptors management in OpenVidu Server
CoSMo Software had some problems running OpenVidu in large instances (≥ 16 cores), and they impacted the final performance in their experiment. We discovered later that there was a bug related to the number of file descriptors that could be opened by OpenVidu at the same time in machines with this number of cores. Bug that is now resolved.
2) Fixed upper limit for connection’s bandwidth
CoSMo Software’s experiment took as one of their quality metrics the bitrate sent and received by the clients during the load test. And OpenVidu was limitting the maximum bandwidth both sent and receive by clients by default to 600 Kbps (on that we agreed it was too restrictive). So now that limit is set to 1 MBps by default, and in our tests there would be no limit at all.
3) Some libnice bugs
libnice is an open source library implementing ICE protocol, a crucial part of any WebRTC communication process. And OpenVidu includes it for that purpose. libnice has recently been updated with some important patches that brings a remarkable performance boost. This is further explained in Kurento 6.9.0 release notes.
Besides these points, CoSMo Software ran the experiment using a Docker deployment of OpenVidu, which is not currently officially supported and in the end is another layer of complexity to take into account when performing this kind of tests. We will use a native Ubuntu deployment of OpenVidu (exactly the same as stated in OpenVidu Documentation).
Designing a load testing environment for OpenVidu
Taking into account the situation described in the previous point, we decided to design our load test as similar as possible to CoSMo Software’s experiment. It can be summarized as follows:
1) Test involves three different types of Amazon EC2 instances
Each execution of our load test requires launching many servers in AWS. Each server will always be launched in the same AWS region (this helps drastically reducing costs, as Amazon doesn’t charge extra fees for internal network usage inside the same Availabilty Zone). And servers may be one of the following types:
- OpenVidu Server Instance: a single Ubuntu 16.04 server hosting one instance of OpenVidu Server, deployed as explained in OpenVidu Docs.
- Client Instances: Ubuntu 16.04 servers running a dockerized Chrome browser, which will join to an OpenVidu session. One Chrome per instance.
- Test Orchestrator Instance: a single Ubuntu 16.04 server where the test will run. It is the coordinator of the test: launches every Client instance when required, sends commands to Client instances to perform actions in each Chrome browser and collects and processes statistics and results from both Client instances and OpenVidu Server Instance.
2) Test increases OpenVidu Server load until no more petitions can be processed
Our Test Orchestrator Instance will start initializing an OpenVidu videoconference with 7 users, each one of them transmitting 1 audio-video stream and receiving 6 remote audio-video streams (in total that is 7 Client-to-OpenVidu streams and 42 OpenVidu-to-Client streams for this first videoconference).
This means it will launch 7 Client Instances, waiting for them to be up and running. Once all of them are ready to join the same video conference, our Test Orchestrator Instance will tell each Chrome to join the sesion and will pull the status of each browser until all of them are transmitting and receiving the videos fine. Only at this point we consider the videoconference to be stable, and we proceed to start a new video conference session in the exact same way. 7 new Client Instances joining a new 7-to-7 videoconference, waiting for them to be stable.
This process is repeated until a session does not reach the expected stable status, and only then the load test comes to an end. This happens when CPU load in OpenVidu Server Instance reaches 100% and no more petitions can be attended.
This whole process is outlined in the diagram below:
3) Test gathers statistics and information from OpenVidu Server and clients
Finally, it is worth mentioning the information and statistics gathering process performed by the Test Orchestrator Instance. It periodically stores:
- CPU, memory and network usage of OpenVidu Server Instance.
- Statistics of each WebRTC stream of each Client Instance. This includes metrics such as sent and received bandwidth, delay, jitter, packet loss and Roun-Trip-time.
- Packet dump for every Client Instance. Just for a few seconds for each instance once the video session is stable, the Test Orchestrator Instance will make them store their whole packet dump in order to retrieve it later. This gives us the lowest possible level of information we can possibly gather from the Chrome browsers.
With this information we can generate useful graphs at the end of the test. For example:
Regarding the video sent by every Client Instance, it is always the same: a 540x360 and 30fps video (it is the same file used by CoSMo Software in their experiment). We also made possible the record of any Chrome browser inside the Client Instances thanks to FFmpeg. In this way we have real proof of the final quality of the videoconferences, no matter what statistics say on paper. A frame of this recording is shown below.
And finally: OpenVidu load test results
Still reading this post? Naturally this is the part you are most interested in as an OpenVidu user, but all that has been explained up to this point is of the utmost importance in order to fully understand the nature of the following results. Here we go:
First of all, RAM is never a problem (as expected. WebRTC is basically a CPU-intensive process). Network performance isn’t a problem either, thanks to the generous bandwitdh provided by AWS (in the test with the larger machine, a total of 250 GB of data was received/transmitted by OpenVidu Server Instance in just 27 minutes). So the resource that finally triggers the termination of the test is always the CPU, when it reaches a 100% usage and no more petitions can be handled.
With this in mind, we can see that a very small server such as a c5.large instance can still handle 28 users at the same time, and a total of 196 WebRTC audio-video streams. This number grows while increasing the size of the server, not exactly in a linear way. This demonstrates that bigger machines, althoug they can of course handle many more streams, do not provide a fully proportional improvement.
This must be taken into account depending on the nature of the applications implemented with OpenVidu: if one session is going to host more than 100 users, then a big machine is mandatory. But if sessions are expected to be less crowded and it is possible to use smaller machines, overall efficiency will be a little bit higher. This result also means that implementing an autoscaling system that can take advantage of multiple smaller machines will have an even higher priority in OpenVidu roadmap.
Next steps regarding load testing
Now that we can easily replicate this load test, we will extend it to try different scenarios that might interest some developers:
- Test sessions with other topologies, such as 1:N
- Test different types of video resolutions (HD)
- Test audio-only sessions
- Use Firefox browser as client instead of Chrome
And many more scenarios or combinations of them. If you are interested, all of the sourcecode needed to run this load test is open source and available at GitHub. Of course we invite you to take a look at it!