Into Your System – Multi-Sources, Mass Data, Multi Format
As systems become more complex, and we get data from a variety of sources, what would be considered the best practices to load this data in? We’ll ignore manual data entry in this post and assume that we already know how to do that – as effectively and efficiently as possible. At times, this might be the only way, but we can always plan for more automation.
The First Step – Plan, Develop Metrics and Monitoring Tools
Prioritize a list of the most common data elements that need to be uploaded/entered and managed. A starting point is to review a good book on analytics – data mining – which can help point out which data elements are important to your organization. As a first step you can run some queries to begin to analyze where your database may be “weak.”
Next, you should develop a set of reports that show how the data elements you need to focus on are being monitored and managed. For example: What’s the contactable rate of living individuals in your list? Contactable rate might be those records with a good home address or a good email address or a good phone number or various combinations of similar metrics. If you take a periodic snapshot of these statistics, you can see the trends over time – and will know where you need to focus your attention. If you group by organizational sub-unit, these metrics can help you educate the units on what they should be focusing on. For example if the % of good phone numbers drops for an organizational unit, you can start to focus more of your attention.
The collection and management of a large amount of information in itself is always problematic, particularly in fundraising databases where the fundraisers need a very complete picture of a donor, but quite frankly, without a plan, priority and the right metrics to determine where the holes in the data are, the more challenging it is to know where to begin. The above approach will help to provide a better focus over the long haul and is a good starting point to preface what your strategies might need to be for actually loading the data.
And don’t forget to scale up your staff accordingly. Ideally you want to have uploads done under the watchful eye of your “data czar” and their compatriots rather than by the general users. You’ll need to ensure that the queue can be managed as the amount of data to be loaded increases – and it always does. You also need to ensure that it’s correct and has the level of integrity that you need.
Loading – Use the Vendor’s Supplied Import Utilities (Least Flexible, Least Risk)
Most fundraising software comes with a modicum of data loading utilities that quite frankly vary in the degree of flexibility and robustness. Very often they’re good for basics such as duplicate checking on a data load and can accept a variety of formats. However, they don’t always have some of the “extra” logic you need for data cleansing or data conditions that you may need to reinforce because of your unique business rules or processes.
These utilities are usually run by the designated database administrator and are not usually farmed out across the organization because of the degree of risk associated with changing source data. In fact, this would be true of any of the approaches. (Manual data entry may have more variety on these themes.)
Loading – Purchase a 3rd Party Add-On (More Flexible, Low Risk)
Some software companies have an ecosystem of software developers that produce add-ons focusing on some particular gap in the vendor’s software. An example with Blackbaud is a company that provides a tool called Importomatic.
These types of add-ons provide more robust data importing capabilities at a reasonable price point. Just make sure that they’re likely to be upgraded and enhanced over time along with the base product. Make sure you’re getting value for your money – these add-ons are not a panacea nor will they replace practices that are not best in class.
Loading – Write Your Own Utilities to Upload Data (Most Flexible, Most Risk)
You’ve got to be a great programmer and really focus on the risks associated with changing source data. Are you sure you know all of the constraints and relationships for all the tables? Are you going to be able to modify your scripts whenever there’s a fundraising software upgrade?
On the plus side, this approach can be infinitely flexible and can often leverage data from other areas in the organization in ways that a vendor’s supplied import utilities can’t handle.
Writing your own utilities, or using an ETL (Extract, Transform and Load) tool also takes advantage of enterprise grade scheduling, error checking, logging and other features of your database software.
Make sure you have hyper checking in place once the load is done. Hyper checking means that you need to spend lots of time verifying that the data was loaded correctly – and this can often be a very complex exercise depending on what’s being loaded. (This is true for all of the approaches – but when you write your own utilities, I feel that the risk is even greater and checking deserves the most due diligence.)
One approach I like is to develop a set of reports that describe the data before it’s loaded. These reports could have counts sorted in various ways, lists of probable errors and ideally a before and after picture of how your data works. You should also have reports that you can run that show your database state after the load. Sometimes data changes may not be obvious when you run the program. I’m sure all of us at some point have regretted doing a global update.
Additionally – know all about when your backups happen and how to restore from them if a job fails.
Loading – A Hybrid Approach (Most Flexible, Mitigated Risk)
Use either the vendor’s supplied utilities or a third party developed application – and couple this with your data warehouse. For example, you can create a staging area, where you upload data in pre-determined formats such as CSV or Excel, and have scripts in your data warehouse that format the data, do the cleansing and produce output for vendor/third party importing tools.
This gives you the ability to apply complex business rules for cleansing and modifying data without having to touch the live data directly – the vendor tools will handle the rest. If you try and upload data and the referential integrity isn’t correct, the upload will fail. Your live data hasn’t been touched. It’s easy to go back to the drawing board and modify your scripts.
You can enhance this process for units outside the central organization.
In one place I worked, they provided a number of blank spreadsheet templates in specific formats to the various decentralized units. The units would populate the spreadsheets with the data to be uploaded, these would be sent back to central, saved in pre-set file folders and then a system job would do the cleansing and analysis of the data before loading. The various templates, what they should be used for and how they were to be populated were available through the Intranet so all of the departments had access to them.
This approach had the additional advantage of “training” the units in terms their understanding of data quality and data requirements. It ended up saving the central staff many hours of time since data came in a more structured and final format and could be easily uploaded.
I probably prefer a hybrid approach as it gives you the most flexibility in the pre-staging of your data, particularly with cleansing as it relates to your own unique business rules.
Lastly But Not Least
Don’t forget to have the appropriate policies and procedures in place regarding the uploading of data. There may need to be formal agreements in place with other parts of the organization as there can often be radically different points of view, or even different legalities as to the expectations of how centrally integrated data is to be managed and used. Better to sort this out at the onset.
Make sure you understand the data you’re receiving, what it means and what some of the human behaviors associated with it are. Data may look the same over time, but the purpose for which it is being collected and how it is collected can change. Policies and procedures, along with agreed upon definitions go a long ways to help enforce data integrity and quality.
Data loading – and particularly automated loading can really make an organization effective if you have a systematic and organized approach. Over time, it all becomes part of the culture and will be smooth and seamless.