Data Quality Metrics
DQ Stats Table¶
In order to collect the stats/metrics for each data quality job run, the spark-expectations job will automatically create the stats table if it does not exist.
Warning
The below SQL statement can be used to create the table if you want to create it manually, but it is not recommended.
create table if not exists `catalog`.`schema`.`dq_stats` (
product_id STRING, -- (1)!
table_name STRING, -- (2)!
input_count LONG, -- (3)!
error_count LONG, -- (4)!
output_count LONG, -- (5)!
output_percentage FLOAT, -- (6)!
success_percentage FLOAT, -- (7)!
error_percentage FLOAT, -- (8)!
source_agg_dq_results array<map<string, string>>, -- (9)!
final_agg_dq_results array<map<string, string>>, -- (10)!
source_query_dq_results array<map<string, string>>, -- (11)!
final_query_dq_results array<map<string, string>>, -- (12)!
row_dq_res_summary array<map<string, string>>, -- (13)!
row_dq_error_threshold array<map<string, string>>, -- (14)!
dq_status map<string, string>, -- (15)!
dq_run_time map<string, float>, -- (16)!
dq_rules map<string, map<string,int>>, -- (17)!
meta_dq_run_id STRING, -- (18)!
meta_dq_run_date DATE, -- (19)!
meta_dq_run_datetime TIMESTAMP, -- (20)!
dq_env STRING, -- (21)!
);
product_idA unique name at the level of dq rules executiontable_nameThe table for which the rule is being defined forinput_counttotal input row count of given dataframeerror_counttotal error count for all row_dq rulesoutput_counttotal count of records that passed the row_dq rules or configured to be ignored when they failoutput_percentagepercentage of total count of records that passed the row_dq rules or configured to be ignored when they failsuccess_percentagepercentage of total count of records that passed the row_dq ruleserror_percentagepercentage of total count of records that failed the row_dq rulessource_agg_dq_resultsresults for agg dq rules are stored after row_dq rules executed in an array with each entry as a map\<string, string> containing the following keys:description,tag,column_name,rule,rule_type,status,action_if_failedfinal_agg_dq_resultsresults for agg dq rules are stored after row_dq rules executed in an array with each entry as a map\<string, string> containing the following keys:description,tag,column_name,rule,rule_type,status,action_if_failedsource_query_dq_resultsresults for query dq rules are stored after row_dq rules executed in an array with each entry as a map\<string, string> containing the following keys:description,tag,column_name,rule,rule_type,status,action_if_failedfinal_query_dq_resultsresults for query dq rules are stored after row_dq rules executed in an array with each entry as a map\<string, string> containing the following keys:description,tag,column_name,rule,rule_type,status,action_if_failedrow_dq_res_summarysummary of row dq results are stored in an array with each entry as a map\<string, string> containing the following keys:description,tag,column_name,rule,rule_type,failed_row_count,action_if_failedrow_dq_error_thresholderror threshold results for rules defined in the rules table for row_dq rules are stored in an array with each entry as a map\<string, string> containing the following keys:description,column_name,error_drop_threshold,rule_type,action_if_failed,rule_name,error_drop_percentagedq_statusstores the status of the rule execution.dq_run_timetime taken by the rulesdq_ruleshow many dq rules are executed in this runmeta_dq_run_idunique id generated for this runmeta_dq_run_datedate on which rule is executedmeta_dq_run_datetimedate and time on which rule is executeddq_envenvironment value passed from the user_config.se_dq_rules_params
DQ Detailed Stats Table¶
Library is responsible for auto generating two stats tables that provide per expectation/rule executaion status view.
Tables in question are
- <stats_table_name>_detailed
- <stats_table_name>_querydq_output
This table provides detailed stats of all the expectations along with the status provided in the stats table in a relational format. This table need not be created. It gets auto created with "_detailed " to the dq stats table name.
Warning
Detailed Stats Tables are optional. It is auto created and named as stats table with suffix _detailed.
Default Behaviour: Detailed Stats table is disabled. To enable it pass
Schema¶
create table if not exists `catalog`.`schema`.`<stats_table_name>_detailed` (
run_id string, -- (1)!
product_id string, -- (2)!
table_name string, -- (3)!
rule_type string, -- (4)!
rule string, -- (5)!
source_expectations string, -- (6)!
tag string, -- (7)!
description string, -- (8)!
source_dq_status string, -- (9)!
source_dq_actual_outcome string, -- (10)!
source_dq_expected_outcome string, -- (11)!
source_dq_actual_row_count string, -- (12)!
source_dq_error_row_count string, -- (13)!
source_dq_row_count string, -- (14)!
source_dq_start_time string, -- (15)!
source_dq_end_time string, -- (16)!
target_expectations string, -- (17)!
target_dq_status string, -- (18)!
target_dq_actual_outcome string, -- (19)!
target_dq_expected_outcome string, -- (20)!
target_dq_actual_row_count string, -- (21)!
target_dq_error_row_count string, -- (22)!
target_dq_row_count string, -- (23)!
target_dq_start_time string, -- (24)!
target_dq_end_time string, -- (25)!
dq_date date, -- (26)!
dq_time string, -- (27)!
dq_job_metadata_info string, -- (28)!
);
run_idRun Id for a specific runproduct_idUnique product identifiertable_nameThe target table where the final data gets insertedrule_typeEither row/query/agg dqruleRule namesource_expectationsActual Rule to be executed on the source dqtagcompleteness,uniqueness,validity,accuracy,consistency,descriptionDescription of the Rulesource_dq_statusStatus of the rule execution in the Source dqsource_dq_actual_outcomeActual outcome of the Source dq checksource_dq_expected_outcomeExpected outcome of the Source dq checksource_dq_actual_row_countNumber of rows of the source dqsource_dq_error_row_countNumber of rows failed in the source dqsource_dq_row_countNumber of rows of the source dqsource_dq_start_timesource dq start timestampsource_dq_end_timesource dq end timestamptarget_expectationsActual Rule to be executed on the target dqtarget_dq_statusStatus of the rule execution in the Target dqtarget_dq_actual_outcomeActual outcome of the Target dq checktarget_dq_expected_outcomeExpected outcome of the Target dq checktarget_dq_actual_row_countNumber of rows of the target dqtarget_dq_error_row_countNumber of rows failed in the target dqtarget_dq_row_countNumber of rows of the target dqtarget_dq_start_timetarget dq start timestamptarget_dq_end_timetarget dq end timestampdq_dateDq executed datedq_timeDq executed timestampdq_job_metadata_infodq job metadata
DQ Query Output Table¶
Warning
DQ Query Output Table is optional. It is auto created and named as stats table with suffix _querydq_output.
Name can be overriden by passing querydq_output_custom_table_name
Default Behaviour: Detailed Stats table is disabled.
Schema¶
create table if not exists `<catalog>`.`<schema>`.`<stats_table_name>_querydq_output` (
run_id string, -- (1)!
product_id string, -- (2)!
table_name string, -- (3)!
rule string, -- (4)!
column_name string, -- (5)!
alias string, -- (6)!
dq_type string, -- (7)!
source_output string, -- (8)!
target_output string, -- (9)!
dq_time string, -- (10)!
);
run_idRun Id for a specific runproduct_idUnique product identifiertable_name--ruleRule namecolumn_namecolumn namealias--dq_type--source_output--target_output--dq_timeDq executed timestamp