biX Consulting examined the quality and distribution of master data for a provider of fuel and toll invoices. In the case described, 30,000 German number plates, which were directly available in SAP BW, were to be checked for quality, as they were entered manually in the source system. Since a manual check is too time-consuming, the quality was assessed by examining the similarity of the number plates to each other.
For this purpose, the labels were abstracted in a feature engineering in order to consider only the sequences of numbers, letters and special characters. For example, "ME AB 123" became "AA AA 111". This made it possible to combine the licence plates into groups of identical character sequences. The size of these groups and their similarity to each other were then visualised using a machine learning algorithm. In the visualisation, similar groups of labels were close to each other.
Distribution of all indicator groups and their size (visualisation in Tableau Desktop)
The visualisation showed that there were several hundred indicator groups that varied greatly in their characteristics and of which many combinations, such as A!!A-111A or A-AAAAA-AAA, were invalid.
Highlighting of some indicator groups with their character strings (display in Tableau Desktop)
After a quick analysis, the direct further use of the master data could therefore initially be ruled out due to the lack of quality.
However, the grouping and visualisation offer good prerequisites for subsequent use cases. For example, for training scenarios in the machine learning environment, the data can be grouped much more quickly and labelled as correct or incorrect.
Contact Person