Thumbnail Image

Context-aware quality assessment of structured and unstructured data

Mylavarapu, Sesha Sai Goutam Sarma
Data analysis is a crucial process in the field of data science that extracts useful information from any form of data. The ease of access and maintenance makes structured data the most popular choice among many organizations even today. On the other hand, with the rapid growth of technology, more and more unstructured data, such as text and image, are being produced in large amounts. Apart from the techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. Data cleaning is one possible solution to this problem. However, it requires a great deal of domain knowledge and expert inference to verify and repair the data. Data Quality Assessment (DQA) is an effective alternative that differentiates between good and bad quality data. Although DQA requires domain knowledge, since it does not repair or change the inherent data, it is more viable to automate the process. In this dissertation, we propose two quality assessment models for structured data and textual form of unstructured data. The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in structured data using machine learning techniques. For textual data, we use natural language processing to identify data errors and assess quality. However, an accurate source of information is necessary to identify data errors. Therefore, we propose an automated mechanism to identify the closest dataset using deep neural networks with minimal user intervention. In addition, we also look into multiple dimensions of data quality such as completeness, accuracy, and consistency, to create a comprehensive quality assessment model. Our experimental results show the importance of the data context and multiple dimensions in quality assessment.