Skip to main content

ETL Automation at DAS

· 9 min read

Automation at the forefront of ETL at DAS

Data is such an important component of the DAS Products. Our priority in the ETL team is to ensure we can get new data in the hands and on the screens of our users so they have access to the most up-to-date spatial data for decision making.

DAS’ parcels and property data layers, also known as titles and rating property in our New Zealand product are the building blocks that allow customers to spatially define a Property. These farms each have layers of data associated with them and can be aggregated to larger Properties in the DAS Rural App or form a property portfolio in the DAS Portfolio product. The latter allows our finance customers to evaluate their property portfolios, assess risk and query spatially for decision making.

In addition to these ‘building blocks’, DAS integrates climate, land use, risk and sales data at the property level, allowing users to gain valuable insights for property valuations, farmland insurance or about the current cropping season.

With data sourced from our Earth Observation team plus a variety of commercial vendors, it’s imperative that access to the most recent data is available, and that it is delivered in a consistent and timely manner. DAS’ ETL team has been working in the background and recently made changes to our data pipelines. We have been enhancing our processing capabilities and streamlining our pipelines by adding to our suite of automated pre-processing tools. Let’s delve into some of the changes now and look at some of the challenges we often face.

Unique formats, delivery and update frequencies

Every dataset is unique in the information that it provides and at DAS, every dataset is provided to us in unique ways by different vendors. Not only are big datasets a challenge, but the variety in formats and representations of data can be different across vendors and across regions.

DAS receives a variety of datasets from vendors in both spatial and non-spatial data formats. Some suppliers have automated extraction processes and every release is provided on the same day of the month, every month, ensuring consistency, while others use manual processing.

With non-automated supply there is risk that a data source can be formatted or encoded differently or have missing records. DAS has built out a more robust validation and quality control process for reviewing these files before processing starts.

Apart from different data formats, DAS acquires data via a range of delivery mechanisms. We receive data via email, cloud-based vendor portals, Government Agency websites, Spatial Data Catalogues, AWS buckets and API-enabled resources with both anonymous and authorised access. Some of these have quite complex authorisation requirements so automation of alerts and download of data can be challenging.

In addition, each dataset has its own update frequency which can vary from 16 day intervals (crop type and yield estimates) through to annually (average annual rainfall) and everything in-between (sales, rating property, titles). Data can be released on a set date and time or their release can be ad-hoc. Some vendors provide a pre-release report advising the date and time of the release and for some datasets, DAS is not notified.

Therefore, regardless of where the data is coming from, or the format it is provided, DAS has invested in improving our automation capabilities for all our data pipelines to ensure consistent datasets and timely releases.

ETL at DAS

DAS has worked hard to robust the pre-processing and data population processes in the last 18 months. Let’s take a look at how one of our monthly data updates was done and some changes that we have implemented to the process.

Below is an example of the previous process of updating the monthly NZ Rainfall chart data.

  1. An email was sent from the vendor with the csv file of new data.
  2. Data is downloaded from email and then uploaded to Amazon S3 Cloud Storage.
  3. An ETL Engineer would be notified of the folder location of the new data version through a Jira ticket.
  4. The ETL Engineer would run an AWS Step Function to load the data to the ETL database and work on another task while the process was running.
  5. After a period of time, they would return to check on the status of the Step Function to see if the process was completed and successful.
  6. A manual review of the charts was done
  7. They would manually trigger the next two Step Functions, entering in parameters to populate the weather station data onto the NZ Rating Properties data and another process for the NZ Titles dataset.
  8. Once the data was approved, another Step Function was run to publish the data into the ETL database.

The main issues with the previous process were:

  • Manual data download and upload
  • Single point of failure with reliance on one person
  • Manual input for ETL process parameters with the risk of incorrect values being entered
  • Multiple non-automated steps in sequence, where missing a step may require re-running the whole process from the beginning

The ETL team has made the following changes

Automation phases at DAS

1. Pre-processing using Step Functions

DAS has for some time used AWS Step Functions to run the custom processes for each data update. These functions extract the data, transform it in some way (spatial processing, aggregating, clipping, filtering, attribution) then perform Quality Assurance (QA) before a validation report is emailed and human approval is requested. This has been our main focus of automation since we began as it then allows us to use these steps to build a single pipeline and automate other manual processes.

2. Modular pipelines

In the last 12 months, DAS has worked to combine our Step Functions into modular pipelines having reusable components. Instead of running multiple Step Functions, which sometimes have very similar codebase, separately , DAS has linked these Functions and standardised their use cases. Once a pipeline has been triggered, it runs through each stage, notifying the team on successful completion, creating intermediary checks, validation files and flags any issues. After the pipeline for a dataset has been started, it will run until it reaches the human approval stage which requires manual input.

3. Vendor alerts

The third stage of our automation has been focused on centralising vendor alerts and so that any team member can run our pipelines. For some datasets we receive an email notification from the vendor when a new version is released. For others we don’t, so we are working to ensure all datasets have automated notifications. This will reduce the time it takes for the new version to flow through to our Products.

4. Automated data extractions

Downloading large datasets from Websites or Data Portals, saving them locally then uploading to our Cloud storage is cumbersome and takes time. Some of our larger datasets are State-based (Properties and Cadastres) and each update required 16 files to be downloaded then uploaded to the Cloud. The NSW Cadastre file is 1.3 gb and although we aren’t dealing with Terabytes, the time spent doing this adds up. DAS now accesses 85% of its regularly updated data automatically. This is done via APIs or other processes that retrieve data from data portals and upload to the Cloud.

5. Automated loading to platform environments

DAS ETL is now creating the ability to automate the loading processes to one of our platform environments. This is another phase in our process that will allow us to speed up deployments and reduce manual handling and make our pipelines more efficient.

The combination of all these changes has led to a simplified process, reduced manual touch points by automating data downloads, alerts and running Step Functions.

Looking at the example provided above, the process for updating monthly NZ Rainfall chart data now looks like this.

  1. The vendor sends the email of data with the csv file attached and a message is sent to a centralised notification system. The data is automatically loaded into the correct S3 Storage folder.
  2. The pipeline is triggered by the upload of the raw data and an alert is sent to the notification channel and requires manual approval to commence. Inputs are validated and if failed, a failure notification is provided and error raised to the team. If the inputs pass, the next step is taken.
  3. The Step Function ‘Load To ETL’ is started automatically, then the data is populated onto the Property as well as the Titles layer, the layer is published in the ETL database and a success notification is sent.

A major benefit of this automation is that it now takes less than 1 hour for the raw data to be transformed into publication-ready state. Even though the overall processing time of the tasks are small, the workflow that used to include downloading data from an email and then getting an ETL Engineer to run functions could significantly blow out the time to hours or days if the next step in the process wasn’t actioned.

Benefits

The benefits of this type of automation are numerous.

  1. Our pipelines aren’t blocked by one person as all team members are notified when a new vendor release is available.
  2. Faster access to the latest data as our alert system often beats the vendor email notification meaning we aren’t waiting for an email to trigger the start of our pipeline.
  3. Less manual intervention which reduces the risk of human error.
  4. Faster pipelines that commence when a new release is provided which reduces the overall time it takes for DAS to get a new update live in our Production environment.
  5. Vendor data updates are saved to a consistent place with consistent file naming conventions on Cloud servers.
  6. Ability to create alert notifications for all datasets.
  7. Increased efficiency in the ETL team to free up time for other pipeline enhancements.

Future automation

There are more steps in the data pipelines that I haven’t mentioned. Some of these include loading data to the UAT environment and QA. Once a dataset version is approved for release there are further steps to load and deploy to the Production environment where further QA is undertaken as well. This is a work in progress and can then be able to be consolidated to the current pipelines we have in place.