When designing a Data Migration (DM), respectively any software solution, it’s important to take inventory of project’s requirements, evaluate, document, communicate and monitor them accordingly. Each of them can have an important impact on the solution, as a solution’s success will be validated and judged upon them. Therefore, the identified requirements must be considered as baseline for conceptualization, design, implementation and sign-off, and should go through same procedures and rigor as other projects requirements. The existence of a standardized Requirements Management process can facilitate their management through project’s lifecycle.
The requirements are usually driven by the source and target systems (e.g. data import/export features, data models and their constraints), the environments they are hosted on (e.g. cloud vs. on-premise), respectively the layers in between (e.g. network, firewalls), project and business aspects that need to be considered (e.g. freeze window for the Go-Live, data availability dates, data quality, external dependencies, etc.). They resume to the solution itself as well to the data and processes involved, and are reflected but not limited to the following important aspects, that can be considered upon case also as quality acceptance criteria:
Accessibility is the degree to which the data are available for a solution so it can be processed when needed, in the form, by resources, or means intended for processing. It’s critical for a DM solution to access or have available the master, transaction, parameter and further data when needed. The team must make sure that the data become easily accessible.
Unavailability of data can impact the DM and can easily lead to delays in the project. This also means that the various project activities (parametrization, cleansing, enrichment, development) need to be synchronized with the migration activities.
Upon case, accessibility can involve the solution itself expressed as the degree to which it’s available to the resources supposed to use it. Certain architectural decisions can have impact on the carried activities. As the solution is usually deployed on a server, it can happen that only a limited number of people is able to access it concurrently. Moreover, a DM’s complexity makes the involvement of multiple developers challenging.
Accountability is the degree to which accountability is enforced for the various resources involved in DM processes and related activities. As multiple resources are involved for parametrization, cleaning, processing, validation, software development, each resource needs to be aware about the extent they are accountable for. Without accountability made explicit, there’s the danger that the activities are neglected, with all the implications deriving from it — quality deviations, delays, data unavailability, etc.
Adaptability is the degree to which a solution can be adapted to environment or requirement changes. Even if typically, the environments don’t change, it doesn’t mean that this will not happen as the IT infrastructure goes through continuous changes that can affect directly or indirectly a migration. Same can be said about requirements, which however have higher probability to change even late in the process as new knowledge is acquired and needs to be integrated in the solution.
Atomicity is the degree to which data entities can be processes at the required level of abstraction in an atomic manner. Even if transformations occur during the various stages, the data belonging to an entity need to be kept and processed together (e.g. Customers and their Addresses). This can involve processing attributes in advance even if the data might be required later. There can be situations in which the data belonging to the same entity need to be processed on different paths, though in the end it’s important to keep the data together, when the processing logic allows it.
Auditability is the degree to which the solution allows checking the data for their accuracy, or for their quality in general, respectively the degree to which the DM solution and processes allow to be audited regarding compliance, security and other types of requirements. All these aspects are important in case an external sign-off from an auditor is mandatory.
Automation is the degree to which the activities within a DM can be automated. Ideally all the processes or activities should be automated, though other requirements might be impacted negatively. Ideally, one needs to find the right balance between the various requirements.
Cohesion is the degree to which the tasks performed by the solution, respectively during the migration, are related to each other. Given the dependencies existing between data, their processing and further project-related activities, DM imply a high degree of cohesion that need to be addressed by design.
Complexity is the degree to which a solution is difficult to understand given the various processing layers and dependencies existing within the data. The complexity of DM revolve mainly around the data structures and the transformations needed to translate the data between the various data models.
Compliance is the degree to which a solution is compliant with internal or external regulations that apply. There should be differentiated between mandatory requirements, respectively recommendations and other requirements.
Consistency is the degree to which data conform to an equivalent set of data, in this case the entities considered for the DM need to be consistent to each other. A record referenced in any entity of the migration need to be considered, respectively made available in the target system(s) either by parametrization or migration.
During each iteration, the data need to remain consistent, so it can facilitate the troubleshooting. The data are usually reimported between iterations or during same iteration, typically to reflect the changes occurred in the source systems or other purposes.
Data coupling is the degree to which different processing areas within a DM share the same data, typically a reflection of the dependencies existing between the data. Ideally, the areas should be decoupled as much as possible.
Extensibility is the degree to which the solution or parts of the logic can be extended to accommodate further requirements. Typically, this involves changes that deviate from the standard functionality. Extensibility impacts positively the flexibility.
Flexibility is the degree to which a solution can handle new requirements or ad-hoc changes to the logic. No matter how good everything was planned there’s always something forgotten or new information is identified. Having the flexibility to change code or data on the fly can make an important difference.
Integrity is the degree to which a solution prevents the changes to data besides the ones considered by design. Users and processes should not be able modifying the data besides the agreed procedures. This means that the data need to be processed in the sequence agreed. All aspects related to data integrity need to be documented accordingly.
Interoperability is the degree to which a solution’s components can exchange data and use the respective data. The various layers of a DM’s solutions must be able to process the data and this should be possible by design.
Maintainability is the degree to which a solution can be modified to or add minor features, change existing code, corrects issues, refactor code, improve performance or address changes in environment. The data required and the transformation rules are seldom known in advance. The data requirements are definitized during the various iterations, the changes needing to be implemented as the iterations progress. Thus, maintainability is a critical requirement.
Repeatability is the degree with which a DM can be repeated and obtain consistent results between repetitions. Even if a DM is supposed to be a one-time activity for a project, to guarantee a certain level of quality it’s important to consider several iterations in which the data requirements are refined and made sure that the data can be imported as needed into the target system(s). Considered as a process, as long the data and the rules haven’t changed, the results should be the same or have the expected level of deviation from expectations.
This requirement is important especially for the data migrated during UAT and Go-Live, time during which the input data and rules need to remain frozen (even if small changes in the data can still occur). In fact, that’s the role of UAT — to assure that the data have the expected quality and when compared to the previous dry-run, that it attains the expected level of consistency.
Reusability is the degree to which the whole solution, parts of the logic or data can be reused for multiple purposes. Master data and the logic associated with them have high reusability potential as they tend to be referenced by multiple entities.
Modularity is the degree to which a solution is composed of discrete components such that a change to one component has minimal impact on other components. It applies to the solution itself but also to the degree to which the logic for the various entities is partitioned so to assure a minimal impact.
Partitionability is the degree to which data or logic can be partitioned to address the various requirements. Despite the assurance that the data will be migrated only once, in practice this assumption can be easily invalidated. It’s enough to increase the system freeze by a few days and/or to have transaction data that suddenly requires master data not considered. Even if the deltas can be migrated in system manually, it’s probably recommended to migrate them using the same logic. Moreover, the performing of incremental loads can be a project requirement.
Data might need to be partitioned into batches to improve processing’s performance. Partitioning the logic based on certain parameters (e.g. business unit, categorical values) allows more flexibility in handling other requirements (e.g. reversibility, performance, testability, reusability).
Performance refers to the degree a piece of software can process data into an amount of time considered as acceptable for the business. It can vary with the architecture and methods used, respectively data volume, veracity, variance, variability, or quality.
Performance is a critical requirement for a DM, especially when considering the amount of time spent on executing the logic during development, tests and troubleshooting, as well for other activities. Performance is important during dry-runs but more important during Go-Live, as it equates with a period during which the system(s) are not available for the users. Upon case, a few hours of delays can have an important impact on the business. In extremis, the delays can sum up to days.
Predictability is the degree to which the results and behavior of a solution, respectively the processes involve are predictable based on the design, implementation or other factors considered (e.g. best practices, methodology used, experience, procedures and processes). Highly predictable solutions are desirable, though reaching the required level of performance and quality can be challenging.
The results from the dry-runs can offer an indication on whether the data migrated during UAT and Go-Live provide a certain level of assurance that the DM will be a success. Otherwise, an additional dry-run should be planned during UAT, if the schedule allows it.
Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.
Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary.
Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system.
Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.
Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.
Scalability is the degree to which the solution is able to respond to an increased workload. Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.
Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.
Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points — points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.
Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values.
Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values.
Efficiency is the degree to which a solution uses the hardware (storage, network) and other organizational resources to fulfill a given task. Data characterized by high volume, velocity, variety and veracity can be challenging to process, requiring upon case more processing power. Therefore, the DM solutions need to consider these aspects as well. However, efficiency refers on whether the available resources are used efficiently — the waste is terms of resource utilization is minimal.
On the other side the waste of resources can be acceptable when there are other benefits or requirements that need to be considered, respectively when the ratio between resources utilization and effort to built more efficient processes is acceptable.
A DM solution involves iterative and exploratory processes in which knowledge and feedback is integrated in each iteration, therefore it might look like resources are not used efficiently. However, this is a way to handle complexity and uncertainty by breaking the effort in manageable chunks.
Learnability is the degree to which a person can become familiar with a solution’s use, the data and the processes associated with it. A DM can be challenging for many technical and non-technical resources as it requires a certain level of skillset and understanding of the requirements, needs and deliverables. The complexity of the data and requirements can be overwhelming, however with appropriate communication and awareness established, the challenges can be overcome.
Stability is the degree to which a solution is sensitive to environment changes (e.g. overuse of resource, hardware or software failures, updates), respectively on whether it performs with no performance defects or it does not crash under defined levels of stress. Stability can be monitored during the various phases and countermeasures need to be considered in case the solution is not stable enough (e.g. redesigning the solution, breaking the data in smaller chunks).
Suitability is the degree to which a solutions provides functions that meet the stated and implied needs. No matter how performant and technologically advanced a solution is, it brings less value as long it doesn’t perform what it was intended to do.
Transparency is the degree to which a solution’s stakeholders have access to the requirements, processes, data, documentation, or other information required by them. In a DM transparency is important especially important in respect to the data, logic and rules used in data processing, respectively the number of records processed.
Trustability is the degree to which a solution can be trusted to provide the expected results. Even if the technical team assures that the solution can deliver what was indented, the success of a DM is a matter of perception from stakeholders’ perspective. Providing transparency into the data, rules and processes can improve the level of trust however, special attention need to be given to the issues raised by stakeholders during and after Go-Live, as differences need to be mitigated.
Understandability is the degree to which the requirements of a solution were understood by the resources involved in terms of what needs to be performed. For the average project resource it might be challenging to understand the implications of a DM, and this can apply to technical as well non-technical resources. Making people aware of the implications is probably one of the most important criteria for success, as the success of a migration is often a matter of perception.
Usability is the degree to which a solution can be used by the targeted users within the agreed context of usage. Ideally DM solutions need to be easy to use, though there are always trade-offs. In the end, a DM must fit the purpose it was built for.
Originally published at http://sql-troubles.blogspot.com.