Asset Profiling
Workflow Procedure
Creating a Hive connection
Click the Connections tab from the Data Source page. The Connections page is displayed.
Click the Create Connection button. The Create Connection wizard is displayed.
Select Hive from the displayed connection types.
Specify the following details required to create the Hive connection:
Connection Property Description Example Connection Name Specify the name for the connection. Host2_Hive Description Describe the purpose of the connection. Hive connection to test workflow JDBC URL Specify the Java Database Connectivity (JDBC) URL which is used to locate the database schema. The URL uses the following format: jdbc:hive2://<hostname>:<port>/<database name>
jdbc:mysql://host2:3600/hive?useUnicode=true&useUnicode=false
JDBC Username Specify the username to connect to the Hive database. Hive JDBC Password Specify the password to connect to the Hive database. Hive Click Next to attach the analytics pipeline for executing the Spark jobs associated with the connection.
Select the Analytics Service.
Click Save.
The connection is now available in Torch Data Source page to be used.
Creating a data source
- Click the Data Source tab from the Connections page. The Data Source page is displayed.
- Click the Create Data Source button. The Create Data Source wizard is displayed.
- Select the connection that you just created i.e. the Host2_Hive connection.
- Specify a name for the data source.
- Specify a description for the data source.
- Check the "Define Schedule for Crawler" checkbox, to schedule the crawler to run at a regular interval and configure the time for it. Do not check the checkbox if you want to run the crawler manually.
- Click the Create Data Source button.
The newly created data source will appear in the Data Source page. You will notice that the crawler state is inactive and the data source does not consist of any assets like databases, tables, or columns.
To start the crawler, do the following:
- Click the overflow menu icon in the selected data source to start the crawlers.
- Click Start Crawler.
Once the crawler is started, the status of the crawler in the data source changes to a green color and the number of databases, tables, and columns will appear in the data source tile.
Once the crawler is done crawling all the assets, the status of the crawler changes back to the inactive state.
Discovering and profiling an asset in a data source
Click the Data Source tile, to discover the assets in detail. The Discover page is displayed.
You can filter out the assets based on whether you wanted to view only the databases, tables, or columns.
To view the assets in detail, click on an asset. The asset details page for the selected asset is displayed.
- The Details tab provides you with the metadata that was captured for the selected asset.
- The Child Assets tab provides you with all the child assets within it. For example, if the selected asset is a database, then the child asset tab will consist of all the tables in it.
Click on a child asset to view more details on it. The Asset Details page for the child asset is displayed with the following tabs as described in the below table.
Tab | Description |
---|---|
Profile | If the asset has not been profiled yet, a prompt is displayed saying that the asset has not been profiled. Click the Profile Asset button. |
Sample Data | The system displays a table with the first 100 rows of sample information of the selected asset in the data source. |
Quality | The properties associated with the selected asset. |
Details | Metadata captured by the system for the asset selected is displayed in the Details tab. |
Child Assets | The child assets of the asset are displayed in this tab. Click the child asset to view its details |
To profile the child asset, do the following
- From the Profile tab, click the Profile Asset button.
- Click Profiling from the left navigation menu bar. The Profiling page is displayed where you can view the status of the profiling job.
You can also run the profile job at a scheduled interval by doing the following:
- Click the Schedular Configuration button. The Profile Schedular wizard is displayed.
- Select the type of profile you want to run and provide a time interval.
- Toggle the enable button to keep the schedular on.
- Click the Save button.
To view the profiled information, do the following
- Navigate to the Asset Details page of the asset you just profiled.
- Click on the Profile tab. The retrieved information of the asset from the profile job that was executed is displayed in the form of different charts like bar or pie chart.
Types of profiling
You can profile the asset as many times as you want. The two types of profiles that can be run on an asset are:
- Sample Profile: Runs the profile job on an asset for the first 1000 rows
- Full Profile: Runs the profile job on an asset for the entire database