Data is the driver of most modern businesses and that means data integration and data migration need to be a well-established, seamless processes — whether data is migrating from one repository to another or moving your data to the Cloud.
The migration of data as part of enterprise systems migration projects can turn into a tricky process that without careful planning can get easily out of control resulting in missed deadlines, excessive consumption of man hours, data errors, and ultimately projects that end up overbudget and behind schedule. To minimize these risks, a good planning and the use of best practices are critical to make this process not just repeatable making it incrementally efficient but evolutionary in a way that the quality of the data improves with every iteration.
In these kind of projects, resolving data quality issues can get quite complex very easily, and this is precipitated when a good planning phase is overlooked, rushing the beginning of the process to start loading templates quickly, feeling that real progress is being made. The risk with this approach is that bad data still needs to be cleaned up before it can be loaded and this is usually done in the populated template, a place where it takes the most effort to clean the data and where the risk of errors is greater.
To avoid these situations, we present 5 best practices to improve the quality of data during a migration project.
1. Keep versions of data extracts
A typical data migration project involves the use of so-called database “staging tables”. Staging tables are used as a temporary place to hold snapshots of data where it can then be analyzed, manipulated and validated before being loaded to the target system. Every time a new set of data is loaded to staging tables an identifier for the dataset should be assigned to the records in order to be able to track changes to the data as the migration project progresses. Another reason why keeping versions of the data is important is to be able to track the changes in the source system that occur when records are added, modified or deleted during the day to day business operations. Moreover, keeping versions of the data helps identifying datasets with labels like original, validated, extracts from target, etc.
2. Use code mapping routinely
Maintaining a table with the mapping of codes between the source and destination systems ensures that these values will exist in a single location facilitating its maintenance in case they change and in order to minimize the risks for errors. Code mapping does not have to be limited to source and destination identifiers for entities, we recommend them for any code that needs a translation no matter the number of records it has. In the same way as staging tables, code maps should be versioned to track any changes and maintain the history of past data extracts in case they are needed.
3. Use an incremental approach to improve data quality and reduce times
A successful data migration process should always aim for improving the quality of the data with every extract, transform, and load iteration. This means that rather than starting every extract with a fresh dataset, it should start with the last successfully and already validated one and compare it to the current snapshot from source and work on the differences, also called “deltas”. Differences between datasets should include records that were added after the last extract, records that were changed, records that were deleted, and records that no longer meet the extraction criteria. By using this approach records that are not changed do not have to be validated again since they have already been successfully loaded, the ones deleted or that no longer meet the include/exclude criteria are removed, so the focus of the transformations and validations is on the new and the updated ones only. This is particularly important when the datasets are big because it reduces the transformation and validation processing time.
4. Apply exclude/include rules when extracting and validations on staging tables
The extraction of data from source should just include the criteria that considers which records have to be included in the snapshot. Examples of these criteria include active or inactive customers and vendors; inventory items without activity in the last year; transaction history after a certain date, etc. The main purpose is to reduce as much as possible the size of a given dataset making it easier to manipulate in future steps. Validations should take place once a snapshot of data is loaded to a staging table and records marked valid only if they meet all the criteria. Some examples of validations include required fields, valid lengths, mapped values from source to target, dependencies on other entities, etc.
5. Source control of extract and validation scripts
Data migration project requirements evolve and change with every iteration. For this reason, keeping scripts used to extract, transform, and load data under version control is a very important practice. Version controlling scripts not only preserves the history of changes made to an extraction or transformation rule but is useful to understand what changes have been made through each extract and load iteration.
In this post we enlist 5 recommendations that data migration projects should consider to be successful. The nature of data migration projects turns them into very complex and resource intensive efforts. Including best practices from the beginning, contributes to improving data quality by reducing errors and improving the time it takes to finish every extract and load iteration with the ultimate goal of this being a smooth process when the final loads are made before going live.