Perform Incremental Data Loads in Oracle Analytics Cloud
Oracle Analytics Cloud offers the ability to create data flows and perform incremental loads on a target table. Data flows can operate only on the incremental data which becomes available in the source in between the current run and the previous run.
In this blog, let’s see how to perform an incremental load on a database table. The video below walks you through the configuration process and how you can define a method of continuing data processing from the last processed row. The rest of the blog will help you with the step-by-step instructions.
Prerequisites for Performing Incremental Loads
Consider the following before attempting this sequence:
Incremental loads can be performed only on those data sets where the source and target tables are database-based.
If there are multiple data sets in a data flow, then only one of the sources can be set for an incremental update.
Defining a Source and New Data Identifier for the Source
The first step is to have a source table and identify a column using new data that can be identified in the table. In this example, I have a revenue fact table with month key as a new data identifier
The next step is to create the Oracle Analytics data source pointing to this table. In the process of creating the data source, make sure you mark the new identifier column by clicking on the third node in the process flow. This is an important step as this column defines how the system will be able to identify new rows in the dataset.
Define a Data Flow
Now that our data source is created, let’s define a data flow by importing the revenue tact table from the source connection. The key here is to check the “Add new data only” box to ensure that the source table is marked for incremental load.
To make my data flow a bit more functionally representative, I will add a business example, such as converting currency values. Let’s bring in a spreadsheet which has exchange rates for every month to convert, and let’s join it based on the month key column. Let’s add a new calculation to convert revenue.
Finally, let’s select step “Save Data” and specify a name for the resulting data set. Make sure you choose the target connection as a database and specify the table name where the result set needs to be saved. There are two options available to select in the “When Run” drop-down menu.
“Replace existing data” makes the data flow truncate the target table and reload all the records.
“Add new data to existing data” keeps the existing records intact and loads only the new records in the source table. New records are identified by the column we defined in the source dataset above.
Let’s set the “When Run” option to “Add new data to existing data” and save the data flow.
Now, let’s run the data flow for the first time. As it completes, we can see in our underlying database that the target table has been created. Since this was the first run of the data flow, all the records in the source table have been inserted into the target table.
Now, for the example in this blog, let’s go and delete a month of data (201812) from our newly created target table. After doing this, our source table still has its 12 months of data (Jan to Dec) but our target table now only has 11 months; it is missing December. Notice that we did not change data in our source table, so there are no new records there since our last run of the data flow.
So, as we run the data flow for the second time, the target table does not get incremented at all. The data flow was set to only bring across new data, but there is no new data in the source, so nothing is changed in the target table. We can check that the target table is still not loaded with the deleted month’s data. If the data flow had been set to full load, all the data would be in the target.
Now, to complete the test, let’s manually load our source table with three more months of data. This will represent some incremental data. Then let’s rerun the data flow once again. We will see that the target table has been incremented with the three new months of data coming from the source table. But notice that data in the target table is still missing for the month where the records were deleted:
Remember what we said at the beginning of this blog: If we go back to the data flow definition and set the Run option to “Replace Existing Data” in the target table, then when we run the data flow, all the data gets loaded, including the deleted month’s data.
Graduate of UCLA and Wharton School of Business and Media Personality. World renowned global entrepreneur, venture capitalist, financial technology professional, tax specialist, marketing mogul, and more!