Executing a Data Quality Policy

Workflow Procedure

Section 1: Getting the system ready for data cataloging

  1. Create an Oracle connection.
  2. Create a Hive connection.
  3. Create an Oracle data source.
  4. Create a Hive data source.
  5. Crawl the Oracle data source.
  6. Crawl the hive data source.

Section 2: Executing the data quality policy

  1. Create a data quality policy.
  2. Check the status of execution whether it passed or failed.

Creating a Data Catalog

To understand the workflow of how to check the quality of data in your data catalog, read the below example.

As a data steward you need to reconcile Oracle database with Hive database. In order to do this, you need a data catalog with Oracle and Hive metadata. To create this data catalog, do the following:

  1. Create a connection to the Oracle database.

    1. Click the Connections tab from the Data Source page. The Connections page is displayed.

    2. Click the Create Connection button. The Create Connections wizard is displayed.

    3. Click Oracle from the displayed connection types.

    4. Specify the following details required to create the Oracle connection:

      Connection PropertyDescriptionExample
      Connection NameSpecify a name for the connection.Oracle-Connection
      Connection DescriptionSpecify a description for the connection.
      JDBC URLSpecify the Java Database Connectivity (JDBC) URL is used to locate the database schema.
      • jdbc:oracle:<drivertype>:@<database>
      . Example of a driver type is 'Thin'.
      ("jdbc:oracle:thin:@myhost:1521:orcl", "jack", "tiger")
      JDBC UsernameSpecify the username to connect to the Oracle database.From the JDBC URl example, the username is "jack"".
      JDBC PasswordSpecify the username to connect to the Oracle database.From the JDBC URL example, the password in "tiger".
    5. Select an analytics service to crawl information and also check the quality of data in the source system.

  2. Create a connection to the Hive database.

    1. Click the Connections tab from the Data Source page. The Connections page is displayed.

    2. Click the Create Connection button. The Create Connections wizard is displayed.

    3. Click Hive from the displayed connection types.

    4. Specify the following details required to create the Oracle connection:

      Connection PropertyDescriptionExample
      Connection NameSpecify a name for the connection.
      Connection DescriptionSpecify a description for the connection.
      JDBC URLSpecify the Java Database Connectivity (JDBC) URL is used to locate the database schema.jdbc:hive2://<hostname>:<port>/<database name>
      JDBC UsernameSpecify the username to connect to the Hive database.
      JDBC PasswordSpecify the username to connect to the Hive database.
    5. Select an analytics service to crawl information and also check the quality of data in the source system.

  3. Create an Oracle data source.

    1. Click the Data Sources tab. The Data Source page is displayed.
    2. Click Create Data Source. The Create Data Source wizard is displayed.
    3. From the Select Connection drop-down list, select the Oracle connection you just created i.e. Oracle-Connection. Based on your selection, the data source type Oracle is automatically selected.
    4. Specify a name for your data source.
    5. Specify a description with the purpose of the data source.
    6. Click the Define Schedule for Crawler checkbox to specify the time at which you want the crawlers to crawl information from the Oracle data source. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.
  4. Create a Hive data source.

    1. Click the Data Sources tab. The Data Source page is displayed.
    2. Click Create Data Source. The Create Data Source wizard is displayed.
    3. From the Select Connection drop-down list, select the Hive connection you just created i.e. Hive-Connection. Based on your selection, the data source type Hive is automatically selected.
    4. Specify a name for your data source.
    5. Specify a description for the data source.
    6. Click the Define Schedule for Crawler checkbox to specify the time constraints for the crawlers to crawl information from the Hive data source. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.
  5. Crawl Oracle metadata from the Oracle data store to acceldata Data Catalog.

  6. Crawl Hive metadata from the Hive data store to acceldata Data Catalog.

Configuring and executing a data quality policy

  1. Click the Create Data Quality Policy button. The Create DQ Rule page is displayed.
  2. Select either the Asset or Custom Query button.
  3. Select a table from a database.
  4. Enable the toggle button to incrementally check the conditions and specify the required details. Do not enable the toggle button if you do not wish to incrementally check the conditions.
  5. Click the Show Sample Data to view the columns in the table. You can further select the columns for which you would like to apply the data quality policy.
  6. From Rule Definition property, select a measurement type and accordingly specify the required data.
  7. Click Next. The Data Quality Policy Definition page is displayed.
  8. Specify a name and description for the policy.
  9. Click the Define Scheduler checkbox and schedule a time. Enable Start Schedule Runs. The data quality policy will run every time the scheduled time is met. For example, Every Day : 2nd hour : 30 minute.
  10. Select one or both of the following notification platforms to receive alerts: Jira and Email.
  11. Click save to save the policy. The policy can be viewed in the policy panel list.
  12. To manually execute the policy, click Execute from the Actions column.
  13. Click Executions, from the left navigation bar, to check if the execution has passed or failed. Click the name of the execution to view its details.