Although you think you have verified that everything works after a switchover, you may have a hidden problem with the applications. You need to know how the applications will react if one of the network interfaces stops responding.
Dataguard: Switchover vs Failover
I have been using Data Guard for many years but have never had to do an actual failover in production; it has always been a switchover for planned maintenance. The other day, we did a controlled power-off of the standby side some days after a switchover. We discovered that many applications started to hang after this power-off, although they worked fine when both the standby and primary sides were online.
That situation was the same as it would have been after a disaster. If you lose one side, there might not be a network interface that can answer the TCP connection requests from the application. (Obviously, you can lose the entire database on primary without losing the server, but I am concerned about the situation when the network to the database is gone.)
You can simulate this if you have an Oracle database running in Docker and shut down the network interface that Docker is using.
I am using the SQL Developer Command Line (SQLcl) here. First, I time a stable connection to my database in Docker, running on a rusty server.
oisene@rio:~$ time sql -L oisene/somepassword@rio/orclpdb1 @test.sql
SQLcl: Release 22.1 Production on Thu Feb 16 20:06:30 2023
Copyright (c) 1982, 2023, Oracle. All rights reserved.
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SYSDATE
____________
16-FEB-23
Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
real 0m4,285s
user 0m11,667s
sys 0m0,459s
Over four seconds isn’t really fast, but this is in my lab. The script simply does a select sysdate from dual;
and the exits.
The ifconfig
command lists the interfaces and one is obviously the interface that Docker uses. I shut it down with ifconfig docker0 down
and repeat the test:
oisene@rio:~$ time sql -L oisene/somepassword@rio/orclpdb1 @test.sql
SQLcl: Release 22.1 Production on Thu Feb 16 20:08:06 2023
Copyright (c) 1982, 2023, Oracle. All rights reserved.
USER = oisene
URL = jdbc:oracle:thin:@rio/orclpdb1
Error Message = IO Error: The Network Adapter could not establish the connection (CONNECTION_ID=ackQG31RQL+lsACfkdIFWQ==)
USER = oisene
URL = jdbc:oracle:thin:@rio:1521/orclpdb1
Error Message = IO Error: The Network Adapter could not establish the connection (CONNECTION_ID=fZYH+FHXTMuhwV9+2gvhTA==)
USER = oisene
URL = jdbc:oracle:thin:@rio/orclpdb1:1521/rio/orclpdb1
Error Message = IO Error: Invalid connection string format, a valid format is: "//host[:port][/service_name]" (CONNECTION_ID=ALGju1nHQjSV3VAVDkmNVQ==)
real 4m23,605s
user 0m7,001s
sys 0m0,390s
Four minutes and 23 seconds is way too much. I have done tests on other servers where it failed after 20 seconds, but that is also too long.
The solution is to add one parameter, but I also need to rewrite the connection string:
oisene@rio:~$ time sql -L oisene/somepassword@(description=(TRANSPORT_CONNECT_TIMEOUT=5sec)(address_list=(address=(protocol=tcp)(host=rio)(port=1521)))(connect_data=(service_name=orclpdb1)))
SQLcl: Release 22.1 Production on Thu Feb 16 20:20:54 2023
Copyright (c) 1982, 2023, Oracle. All rights reserved.
USER = oisene
URL = jdbc:oracle:thin:@(description=(TRANSPORT_CONNECT_TIMEOUT=5sec)(address_list=(address=(protocol=tcp)(host=rio)(port=1521)))(connect_data=(service_name=orclpdb1)))
Error Message = IO Error: The Network Adapter could not establish the connection (CONNECTION_ID=6dhk/ZWxQ8GkuCsKed6C6g==)
real 0m7,480s
user 0m7,030s
sys 0m0,333s
The point here is that the situation where the Oracle Listener is up on both sides is different from the situation when nobody is listening on the server that used to be primary. This specific parameter limits how long the client/application will wait to establish the TCP connection with the Oracle Listener before it fails and proceeds to the following address.
I did this test in my lab with Docker. You can, of course, simulate this by shutting down the interface on your standby server with a connection string that has failover configured, and with the address of the standby server first. Make sure that redo transport and apply start working again after you complete the test. And don’t wait too long, especially if it is production.