Performances of OpenProvider system

Sharing our experiences of working on OpenProvider system is the most useful way of gaining new skills and knowledge, but also detecting potential problems. Therefore, we bring to you more about details on all accompanying situations in the working process.

HOW TO SETUP THE TEST

In the previous blog, we explained the problem with Cassandra cluster that we faced during performance tests and design changes that helped us to overcome the problem. Now it is time to present results!

Our final goal is to test the number of TPS for 3 DCs where each DC is in active mode. Each client demand from us at least 2 DC design with active-active mode due to importance of OpenProvider system. Three DC scenario is one step forward in terms of redundancy, better customer experience, and future expansion by introducing new services. Besides 3 DC scenario, we have to measure results for 2 DC (1 DC failure) and 1 DC case (2 DCs are out of function). Number of TPSs reached with 3 DC scenario have to be on the same level as 2 DC scenario. In other words, an outage of one DC should not affect the performance of the system. In the case when two DCs are down, the desired number of transactions is half of the max number of transactions.

Hardware footprint for each test:

  • Policing Front node with resources:
      • 32 vcpu
      • 16GB RAM
  • Cassandra DB node with resources:
      • 24 vcpu
      • 64GB RAM
  • Traffic generator.

For each test, we have 3 DB nodes. As stated in the previous blog, policing node communicates with local Cassandra DB nodes i.e. there is no cross data center communication between policing nodes and Cassandra DB nodes. The replication factor for keyspace is 3 for each DC which means that each node contains all data.

We use Cassandra version 3.11.8 (the latest stable Cassandra version at that moment). JMV is configured to use 31 GB RAM with a G1GC garbage collector. The number of virtual nodes is 16 per DC node.

For each test scenario we defined the following flow of messages:

Authorization request

Accounting start if the response on authorization request was access accept a message

5 interim messages

Stop message

Why is this the most difficult scenario for the system? Each of these messages can be sent to a different DC. The system has to synchronize data from all DCs which are more than 150km far away one from another and be ready to process each message no matter where the request arrives. A few milliseconds after the system process accounting start message and insert a session in DB, interim message will arrive. OpenProvider has to verify that the session for subscriber exists, get the session data, and perform written logic. This test case is possible but with low probability. We know that it is the most difficult for the system so we can take it as the lower limit of system performance i.e. we can say that the system can process at least this number of TPSs.

Usually, an interim message will arrive after at least 15 minutes after the start message. For some clients, we have the interim message on each 2 hours. In this test case, after the system has processed 5 interim messages traffic generator will send a stop message. As we will see further in this blog, the complete flow will be finished in 50 milliseconds in the 3 DC case scenario.

Each test last 15 minutes. Test generator reaches max number of transactions in a few seconds so we can consider that system is under maximum load during all test period. We will consider 5 values as indicators of system performances:

  • Incoming TPSs – Peak number of requests that traffic generator will send to OpenProvider due to burst nature of incoming traffic. All request are Radius request (authorization + accounting request)
  • Outgoing TPSs – Peak number of radius requests that OpenProvider will process. It assumes Radius (authorization + accounting + accounting proxy requests) and LDAP requests
  • Worker processing – number of Radius threads that process Radius or LDAP requests
  • Avg auth request processing time (ms) – the average time that OpenProvider need to process authorization requests in milliseconds (QoS parameter)
  • Avg acct request processing time (ms) – the average time that OpenProvider need to process accounting requests in milliseconds (QoS parameter)

WHAT IS THE RESULT OF TEST SCENARIOS

Test scenario 1 (TS 1): One Policing node in One DC

This is the worst-case scenario. It is assumed that only one policing node is up and running. Cluster has 3 nodes and each node is in the same DC. With 32 vcpu policing node can be configured to work with 64 radius threads (workers). These 64 radius threads are common for each test scenario. The system is not integrated with any external component (for example LDAP or any external DB system). It is assumed that AVPs that OpenProvider has received in the request is enough to perform all logic.

For this test case we have the following results:

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~7000

~7000

32

2.5

4

Table 1: Test without accounting proxy

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~7000

~14000

32

2.5

4

Table 2: Test with accounting proxy

So accounting proxy request does not affect performances of the system. After 7000 TPS average time for authorization and accounting process start to grow very fast and the system begin to drop incoming request. A number of workers on policing nodes is on maximal value of 64 and Cassandra increased read and mutation latency is the problem. So, in this test case, Cassandra cluster is bottleneck. We can conclude that one policing node in one DC can process 7000 incoming TPSs.

 

Test scenario 2 (TS 2): Two Policing nodes in One DC

This is a test case scenario when one DC is fully operative and another two DCs are out of function. It is assumed that both policing nodes are up and running. Cluster is the same as in the previous test case i.e. have 3 nodes, each node is in the same DC. 

The system is not integrated with any external component (LDAP or any external DB system). It is assumed that AVPs that OpenProvider has received in the request is are enough to perform all logic. Also, it is assumed that the traffic generator will equally distribute incoming requests on both policing nodes.

For this test case we have the following results:

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~7200

~7200

28

3.5

4.5

Table 3: Test without accounting proxy

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~7200

~14400

28

3.5

4.5

Table 4: Test with accounting proxy

Accounting proxy request does not affect performances of the system. After 7200 TPS system starts to drop requests and the average time for authorization and accounting process starts to grow very fast. Number of the worker on policing nodes is max 64 and Cassandra cluster is bottleneck again. Comparing to the previous test scenario we can see that total number of TPS is slightly higher that one policing nodes scenario. The reason for that is that Cassandra DB is bottleneck in both cases. A number of worker processing is smaller in the second test case which is also expected because the total number of TPS is distributed on two policing nodes.

 

Test scenario 3 (TS 3): Four Policing nodes in two DCs

Test case scenario when only one DC is out of function. The number of TPS should be the same as all three DCs are up and running. It is assumed that four policing nodes are up and running. The cluster has 6 nodes in total, 3 nodes per DC. The system is not integrated with any external component (LDAP or any external DB system). It is assumed that AVPs that OpenProvider has received in the request is enough to perform all logic. Also, it is assumed that the traffic generator will equally distribute incoming requests on all four policing nodes.

For this test case we have the following results:

Peak Incoming TPSs

Peak Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~10000

~10000

20

4

6

Table 5: Test without accounting proxy

Peak Incoming TPSs

Peak Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~10000

~20000

20

4

6

Table 6: Test with accounting proxy

Accounting proxy request does not affect performances of system. After 10000 TPS system start to drop requests. Number of worker on policing nodes is max 64 and Cassandra cluster is bottle neck again. Cluster is distributed on two DC with 150km distance between two DCs. As stated in the beginning, replication factor is 3, each node contain all data so we have intensive inter-DC communication. Also, traffic generator send request in round-robin manner which mean that we have to guarantee data consistency between two DC and use cluster demanding write consistency. 

Comparing to previous test scenario we can see that total number of TPS is 43% higher that two policing nodes scenario in one DC. A number of worker processing is smaller in this test case which is also expected because total number of TPS is distributed on four policing nodes. Also, the average time for processing of authorization and accounting request is a few milliseconds higher than in previous test cases which mean that Cassandra works very fast in a case when manage to process all request.

 

Test scenario 4 (TS 4): Six Policing nodes in three DCs

Test case scenario when all three DC are up and running. It is assumed that six policing nodes are up and running and the system works with full potential. The cluster has 9 nodes in total with 3 nodes in each DC. It is assumed that AVPs that OpenProvider has received in the request are enough to perform all logic. Also, it is assumed that the traffic generator will equally distribute incoming requests on all six policing nodes.

This is the real case scenario so here we have two test types:

  • The system is not integrated with any external component (LDAP or any external DB system)
  • The system is integrated with LDAP and has to send LDAP requests for each incoming authorization request. According to 3GPP standard and recommendation profile from LDAP server is cached for accounting request processing.

We will check the first results when the system is not integrated with LDAP.

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~10500

~10500

15

4

6

Table 7: Test without accounting proxy

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~10500

21000

15

4

6

Table 8: Test with accounting proxy

Accounting proxy request does not affect performances of the system. After 10500 TPS system starts to drop requests. The number of workers on policing nodes is max 64 and Cassandra cluster is here to the bottleneck.  Cluster is distributed on three DC with 150km distance between each of them. As stated in the beginning, the replication factor is 3 and each node contains all data so we have intensive inter-DC communication. Also, the traffic generator sends the request in a round-robin manner which means that we have to guarantee data consistency between three DCs and use cluster demanding to write consistency.

Comparing to the previous test scenario we have only 500 TPS higher results that four policing nodes scenario in two DCs. The number of worker processing is smaller in this test case which is also expected because the total number of TPS is distributed on six policing nodes. The average time for processing of authorization and accounting requests is the same as in previous test cases.

Below are results when the system is integrated with LDAP:

Incoming TPSs

Outgoing TPSs

Worker processing

Avg auth request processing time (ms)

Avg acct request processing time (ms)

~10500

~21500

15

13

7

Table 9: Test with accounting proxy and LDAP

We have 500 TPS higher value than no LDAP test scenario with the almost twice higher average auth time process due to network latency and LDAP requests processing time.

Comparation for different test scenarios:

Conclusion

So we can conclude that max number of TPS that the system can process for this test scenario and hardware footprint is around 21.5k TPSs. We saw that policing nodes, as stateless components, are on 50% of max performances for a huge number of TPSs if Cassandra is fast enough. In each test scenario, Cassandra was bottleneck. Simply, by adding new Cassandra nodes in each DC the number of TPSs is increasing. According to Datastax documentation, “Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.” So, with the proper definition of partition keys, by adding new Cassandra nodes for the same replication factor of 3 data be will be distributed among more nodes and nodes will not contain 100% load. More nodes will handle read requests, mutation requests will not demand the participation of each DB node so we will gain more DB “power”.

In a production environment, it is expecting that average authorization and accounting request processing time will be higher two-three times due to higher data volume and much more Cassandra SSTables.

Hopefully, this blog will improve your skills so keep up to date with us as we bring more experiences and knowledge.

Related articles

Uncategorized

INFINTECH 2024 – Key takeaways for Risk Managers, Collection Officers and CROs

This discussion focuses on the topic of prolonged high interest rates and their potential impact on non-performing loans but also, on banks risk appetite, contraction ...
Telecommunications

Logate at Deutsche Telekom Campus Fair 2024: Proud to Participate Again

We are excited to announce that Logate once again took part in the Deutsche Telekom Campus Fair 2024. This event is one of the most ...
Banking

Logate Helped Erste Bank Montenegro Implement PSD2 Directive

Logate’s collaboration with Erste Bank Montenegro continues as we enter a new era of banking in Montenegro. Recently adopted PSD2 directive brings more transparency and ...
Telecommunications

How to migrate from legacy AAA to OpenProvider AAA system?

Migrating from a legacy AAA to a new AAA solution requires system integration with various telco network components. If you want to learn more about ...

Like what you're reading?

Sign up for more updates.