Catastrophic slow backup/restore to Oracle Cloud Infrastructure Object Storage and its fix

I used Oracle Zero Downtime Migration Tool to move some databases from OnPrem Exadata to a freshly instantiated OCI Exadata Cloud Service (Quarter Rack to be more precise). I found out then, that the speed and performance of backup and restore to OCI Object Store was more than catastrophical. Instead of advertised ~2,5h for a 5 TB big database over 8 channels, the restore took… more than 20 hours!

The backup was not as fast due to connection speed limit, but I assumed that at least within OCI, the restore should run as advertised in Oracle Cloud Infrastructure Exadata Backup & Restore Best Practices using Cloud Object Storage. I was really astonished when I saw that.

I opened an SR of course, but after more than 20 days of ping-pong with Oracle eventually found the reason myself.

You see – ZDM comes with the libopc.so in version 19.0.0.1. This is exactly the same version that is included in every ORACLE_HOME in lib subdirectory. You can easily get the version string with following command:

[zdm@zdm lib]$ strings libopc.so.original | grep -Po 'DNZ_REL_VER=".*?"' | head -1
DNZ_REL_VER="19.0.0.0.0-Production"

On the ExaCS I found, that even for 19.7 database home, the libopc.so that is used for automatic backup and bkup_api is actually an older version, the 12.2.0.1, and it is located under /var/opt/oracle/dbaas_acfs//opc/libopc.so

[oracle@exa1-node1 ~]$ strings /var/opt/oracle/dbaas_acfs//opc/libopc.so | grep -Po 'DNZ_REL_VER=".*?"' | head -1
DNZ_REL_VER="12.2.0.1.0-Production"

I tested both with RMAN running a backup. For that I used a slightly modified SQL (thanks to Mariami Kupatadze) Script:

select recid
 , output_device_type
 , dbsize_mbytes
 , input_bytes/1024/1024 input_mbytes
 , output_bytes/1024/1024 output_mbytes
 , (output_bytes/input_bytes*100) compression
 , (mbytes_processed/dbsize_mbytes*100) complete
,  to_char(start_time ,'DD-MON-YYYY HH24:MI:SS') started
 , to_char(start_time + (sysdate-start_time)/(mbytes_processed/dbsize_mbytes),'DD-MON-YYYY HH24:MI:SS') est_complete
 from v$rman_status rs
 , (select sum(bytes)/1024/1024 dbsize_mbytes from v$datafile)
 where status like 'RUNNING%'
 and output_device_type is not null;

The results for a 5TB big database were astonishing. First the 12.2.0.1 lib version:

    RECID      OUTPUT_DEVICE_TYP DBSIZE_MBYTES INPUT_MBYTES OUTPUT_MBYTES COMPRESSION COMPLETE   STARTED                       EST_COMPLETE
---------- ----------------- ------------- ------------ ------------- ----------- ---------- ----------------------------- -----------------------------
     8197       SBT_TAPE             4911385.78   162145.547      60951.75  37.5907641 3.30142152 28-AUG-2020 10:08:18          28-AUG-2020 12:37:14

(scroll right and look at STARTED and EST_COMPLETE). I aborted the backup after few moments as it was already at 3%.

Then I started again the same full database backup, but this time with the 19.x version of libopc.so. I had to wait multiple minutes to get to 0.7% to have at least some kind of representative estimation. And there it is:

     RECID OUTPUT_DEVICE_TYP DBSIZE_MBYTES INPUT_MBYTES OUTPUT_MBYTES COMPRESSION   COMPLETE STARTED                       EST_COMPLETE
---------- ----------------- ------------- ------------ ------------- ----------- ---------- ----------------------------- -----------------------------
      8201 SBT_TAPE             4911385.78   35671.0469       13269.5  37.1996371 .726292913 28-AUG-2020 10:17:34          29-AUG-2020 07:17:23

So, there you have it. 2,5 hours versus nearly 21 hours. Both over 8 Channels, both within OCI, same bucket, same Database, same Exadata.

Ps. Both libraries seem to consume really a lot CPU, so consider this choosing the number of channels. 19.x version uses 100% CPU for each channel thread. The 12 Version showed around 70-90% CPU for each channel. So see, that you have enough free CPUs in your Exadata Node to run given number of channels.

Catastrophic slow backup/restore to Oracle Cloud Infrastructure Object Storage and its fix

Ähnliche Posts

How to ignore errors for unreachable hosts in AWX

Ein Überblick über CNI-Plugins

Ein Bot mit echtem Security-Know-how

Schreibe einen Kommentar Antwort verwerfen