|
LOCATION: Washington, DC, DC, US YEAR: 2007 STATUS: Laureate CATEGORY: Government NOMINATING COMPANY: Sybase |
ORGANIZATION:
Internal Revenue Service
PROJECT NAME:
Compliance Data Warehouse
Short Summary
Research analysts in the Internal Revenue Service are at the leading edge of providing decision analytics to support the agency’s mission of delivering effective service to our nation’s taxpayers. IRS researchers have a long and successful track record of innovation in a variety of areas, and continue to develop new and effective solutions to meet the tax administration challenges of the 21st Century. Data are the lifeblood of most IRS research studies, and it is imperative that researchers have access to data in an environment that is specifically designed for analysis. The Compliance Data Warehouse (CDW) has been developed by the Research, Analysis, and Statistics organization to meet this requirement, integrating over 100 terabytes (TB) of data from multiple disparate sources throughout the IRS legacy environment into a single, high-performance database. CDW represents one of the most complete and modernized database systems in the IRS today, and has recently inspired the CIO’s office to use it as a blueprint for modernizing other legacy systems that will support such operational activities as tax return classification, inventory management, and workload delivery. CDW has also been an important part of a renewed effort by the IRS to develop a modernized data architecture to support transaction, operational, and warehousing needs at an enterprise level. All of this has been accomplished in a very short period of time with a limited budget, resulting in a remarkably high return on investment to the IRS and taxpayers alike.
Introductory Overview
The Internal Revenue Service faces increasing demands to enforce the tax law, provide a high level of customer service, and modernize its infrastructure. The IRS Research community supports these strategic goals by providing analyses of taxpayer behavior; program or treatment effectiveness; estimates of the “tax gap”; predictive models to improve risk metrics and identify noncompliance; and other activities. With few exceptions, all research studies require some kind of data.The Compliance Data Warehouse (CDW) was developed to provide a single, integrated database environment to support research and analysis, and is the largest database system of its kind in the IRS, managing over 100 terabytes (TB) of data for hundreds of researchers. CDW provides a low-cost, high-performance environment specifically designed for ad-hoc queries, predictive modeling, data mining, visualization, simulation, optimization, and other types of analyses that are needed to support the IRS mission. It also provides the most comprehensive online catalogue of metadata anywhere in the IRS, giving researchers the ability to quickly search for and understand the meaning of data being used in their analyses. CDW’s data integration effort is no small task, as source data reside in multiple legacy environments, each with its own physical platform, formatting standards, naming conventions, retrieval mechanism, and other disparate features. This challenge is intensified by the massive amounts of data being captured, moved, and processed, as well as the fact that IRS legacy systems that provide source data to CDW are in the process of being modernized, requiring constant adaptation to such changes. But the benefits from CDW are significant by almost any measure. The value from integrating data on a massive scale from multiple disparate source environments means that IRS research analysts spend less time searching for, compiling, linking, validating, and re-formatting data, and more time doing the actual data analysis itself. This in turn has translated into a more efficient and productive Research organization for the IRS.
Benefits
Has your project helped those it was designed to help?
Yes
What new advantage or opportunity does your project provide to people? IRS researchers routinely face a variety of ad-hoc and unpredictable questions from the Department of Treasury, Congress, federal agencies, and even foreign governments. The Compliance Data Warehouse significantly reduces the time it takes them to search for, combine, standardize, and validate data—leaving more time to focus on decision analytics. The volume of data used by researchers in the IRS is massive, with typical analyses involving hundreds of millions, even billions, of rows of data. Without a centralized, high-performance database environment like CDW, geographically dispersed research offices would face prohibitive costs associated with acquiring, processing, moving, and analyzing data volumes of this magnitude. To facilitate the process of quickly searching for and understanding the meaning of data used in research studies, CDW also provides a comprehensive, online catalogue of metadata through its Web site that includes data definitions, lookup tables, summary statistics, and links to other sites. CDW metadata are often as important as the underlying data themselves, and delivering these metadata through a robust, interactive Web site gives users a easy-to-use tool for interpreting the data they are using. The combination of an integrated repository of over 100 terabytes of data along with an online metadata library has delivered proven benefits to the IRS Research community. Has your project fundamentally changed how tasks are performed? Yes How do you see your project's innovation benefiting other applications, organizations, or global communities? CDW was recently selected by the CIO’s office as a blueprint for modernizing other IRS legacy systems that support such operational activities as tax return classification, inventory management, and workload delivery. In particular, CDW will be largely replicated in 2007 to support a new operational system providing streamlined data for up to eight independent projects, a consolidation effort that will likely result in millions of dollars of savings to the IRS. Because CDW integrates and organizes data in a way that is inherently conducive to analysis, it has increasingly become a platform of choice within the IRS to quickly answer questions from such external customers as the Department of Treasury, General Accountability Office, U.S. Congress, Federal Reserve Board, and a variety of other public and private institutions. The Treasury Department’s Office of Tax Analysis is so impressed with CDW that in 2007 they are scheduled to become the first organization in the Department of Treasury to have direct network access to it, eliminating their need to submit requests for data, analyses, or reports that would otherwise be done by IRS staff. Besides supporting external customers, CDW is also routinely used to analyze data for the IRS Commissioner’s office, CFO, National Taxpayer Advocate, and other business units.
The Importance of Technology
How did the technology you used contribute to this project and why was it important?There is nothing trivial about managing an analytical environment for researchers with over 100 terabytes (TB) of data, as is the case with the IRS Compliance Data Warehouse. In such a massively large data complex, determining what technologies to invest in is also not trivial. Careful consideration must be given to a wide variety of established technologies, as well as next generation solutions that can lead to process efficiencies, expanded flexibility, and lower operating costs. User requirements have played a critical role in driving the technology decisions for CDW. From the perspective of an IRS research analyst, criteria such as performance, data refresh rates, and flexible storage solutions are of paramount importance. To meet the need for fast database query times, CDW relies on Sybase IQ, a high-performance database engine specifically designed for analytics in massively large environments. SunFire E2900 servers from Sun Microsystems provide the computing horsepower needed to support the wide range of numerical analysis used by IRS researchers. A storage area network (SAN) from Hewlett Packard is used to connect both fibre channel and Serial Attached Storage Architecture (SATA) disk arrays to manage CDW’s large data footprint.. One of the biggest challenges facing CDW has been the time and cost associated with routinely moving terabytes of source data from physically distant systems to the CDW environment. Until recently, this had largely been done using IBM tapes from IRS mainframe centers. The generation of tapes being used by the IRS store less than two gigabytes (GB) of data, forcing CDW staff to copy, ship, and unload hundred of tapes on a regular basis. In addition, these tapes offer no means of encryption, putting the IRS at risk for unauthorized disclosure of personally identifiable information. In 2006, CDW worked with IRS enterprise operations staff at mainframe sites to install Network Attached Storage (NAS) devices, which are size of roughly four IBM tapes but can hold up to two terabytes (TB) of data. They also provide 256-bit encryption for security. The NAS devices that are now used to ship data to the CDW environment have the equivalent capacity of roughly 1,500 IBM tapes, saving approximately two weeks of staff time. Over a five-year period, this innovation will result in millions of dollars of direct savings to the IRS. Even though CDW performs a good deal of extraction, transformation, and loading (ETL) of data, it does not rely on any single tool or technology to do so. This is because IRS source data are in different locations and hardware platforms, each with its own format. In some cases, CDW must rely on assembly language to access certain legacy data in the IRS mainframe environment. Other data are in flat files and VSAM structures that might require COBOL or SAS to extract. Once data are moved to Unix servers in the CDW environment, however, processing is primarily done in C/C++, SAS, and database scripts. CDW has recently invested in Dataflux to automate and simplify the process of profiling, standardizing, and linking data, all part of a longer-term strategy to improve overall data quality. For the IRS researchers who are responsible for analyzing data in CDW, a variety of third-party tools are available, including SAS, SPSS, SQL clients, and Hyperion Intelligence. Both server-side and client-only processing models are used depending on the volume of data and computational complexity for a given research study. One of the key benefits of CDW is that research analysts have the flexibility to choose tools that are right for a particular analysis, a result that often results in higher productivity.
Originality
What are the exceptional aspects of your project?There are many areas in which the Compliance Data Warehouse has been able to deliver significant value to the IRS and other customers of tax administration data. First, CDW is the only complete repository of data in the IRS for research and analysis, integrating very large volumes of data with extremely limited resources. Because of this, it tends to have a higher return on investment relative to other database environments of similar size and scope. Second, CDW delivers the largest online catalogue of metadata anywhere in the IRS, doing so through its intranet Web site. These metadata, which include data definitions, lookup tables, and summary statistics for literally thousands of data elements, give users the capability to quickly search for and understand the meaning of data used for research and analysis. It has also attracted the attention of other enterprise IT projects, who are using it as a model for future modernization efforts throughout the IRS. Third, CDW has recently saved the IRS millions of dollars by eliminating its use of IBM tapes in favor of Network Attached Storage (NAS) devices. These devices, which are no bigger than a tissue box, store the equivalent of over 1,500 tapes, providing a fast, secure means of transporting data between mainframe centers and the CDW environment. By using NAS devices, CDW staff are now able to move, stage, process, and load data in less than one tenth the time it took when using tapes. Finally, CDW has been an important part of a renewed effort by the IRS to develop a modernized data architecture to support enterprise transaction, operational, and warehousing needs. CDW is managed by a small organization with a limited budget but is having a big impact on IRS enterprise decision making. How is it original? The Compliance Data Warehouse is the first analytical data repository of its kind in the IRS in terms of size, completeness of data, and customers. No other database management system in the IRS uses Network Attached Storage (NAS) devices to transport data quickly and efficiently between host sites, a technology that CDW staff introduced to update and release data to its research users on a more timely basis without any increase in staff. CDW developed the most comprehensive online catalogue of metadata found anywhere in the IRS, giving users the capability to quickly find and understand the meaning of data used in research and analysis. These metadata, which include data definitions, lookup tables, and summary statistics for literally thousands of data elements, have also attracted the attention of other enterprise IT projects, who are using it as a model for future modernization efforts throughout the IRS. Is it the first, the only, the best or the most effective application of its kind? All of the above
Success
Has your project achieved or exceeded its goals?
Achieved
Is it fully operational? Yes How many people benefit from it? 400 If possible, include an example of how the project has benefited a specific individual, enterprise or organization. Please include personal quotes from individuals who have directly benefited from your work. The Compliance Data Warehouse directly supports multiple operating divisions and functional areas in the IRS on a regular basis. Researchers will repeatedly express their appreciation for being able to answer simple questions in seconds, even when those questions require a statistical analysis of hundreds of millions, and even billions, or rows of data. This capability, in turn, makes other customers happy as well, and the entire customer-relationship loop provides CDW support staff critical feedback on how to continually improve the system. The IT organization within the IRS has recently been so impressed with CDW that it is using it as a blueprint for modernizing other IRS legacy systems that will support such operational activities as tax return classification, inventory management, and workload delivery. That compliment has inspired new relationships between IT and business in the IRS, and has fed through to discussions involving future strategy, governance, and technology investments. To someone from the business side of an organization (like Research), this might seem too good to be true. Indeed, it is usually the IT organization that drives mandates throughout the enterprise. But the IRS has faced many years of missed deadlines, cost overruns, and outright project failures in its attempt to modernize legacy systems that have been around since the 1970s. In that regard, using a high-performing system like CDW as a blueprint makes sense. How quickly has your targeted audience of users embraced your innovation? Or, how rapidly do you predict they will? Hundreds of IRS researchers rely on the Compliance Data Warehouse on a routine basis to support their analytical needs. There is no other analytical system in the IRS today that provides the breadth of data, tools, training, and support as CDW. In the past two years, the number of new CDW users has grown more than eight-fold, to over 300, a number that is expected to continue growing. Although these research analysts are geographically and organizationally disperse, they can easily access and analyze data through CDW’s flexible, client-server architecture. Since there is no other system in the IRS today that provides the capabilities of CDW, it has not been difficult for research analysts to embrace. This goes for prospective new users as well, including the Treasury Department’s Office of Tax Analysis, which will be added to the list of users in the immediate future so that their staff can perform tax policy analyses for the first time on population-level data in CDW. In the short run, CDW will be supporting IRS modernization projects as they transition to newer systems to support their operational needs. Some of these projects have readily embraced CDW and are already using it to support their needs. In the long run, CDW’s strategy includes developing and deploying more summarized and multidimensional data through its intranet Web site in the form of tables, charts, and maps, which will make use of the terabytes of atomic-level data residing in its database. By reaching both information consumers and research analysts alike, CDW expects the total number of users to continue to grow for the next several years.
Difficulty
What were the most important obstacles that had to be overcome in order for your
work to be successful? Technical problems? Resources? Expertise? Organizational
problems?CDW has faced two important obstacles. First, like other federal agencies, the IRS is undergoing significant changes resulting from the Clinger-Cohen Act, the Federal Information Security Management Act (FISMA), and Federal Enterprise Architecture standards, all of which increase requirements associated with planning, reporting, security, and financial controls for IT projects. While these new mandates are intended to improve governance, oversight, and accountability, they are largely unfunded. This has resulted in a reallocation of resources away from innovative activities needed to expand the productive capacity of existing infrastructure, and is occurring at a time when the federal workforce is beginning to witness greater attrition due to retiring baby boomers. As a consequence, many projects like CDW have been forced to “rebuild the plane while flying it”, presenting strains on morale and risks to support for its continuing operations. Second, CDW is not a system owned or managed by the CIO’s office, but rather by the Research, Analysis, and Statistics organization, which has generated subtle challenges between the business and IT. For example, CDW requirements are driven by IRS researchers whose needs for tools and technology are somewhat unique within the IRS, and may not always conform to Enterprise Architecture standards. At the same time, IRS research analysts would not be able to do their jobs without such specialization. This apparent paradox often creates conflict between the IT and Research organizations. Only through awareness and a mutual appreciation for the benefits from research activities has this challenge been addressed and mitigated. Often the most innovative projects encounter the greatest resistance when they are originally proposed. If you had to fight for approval or funding, please provide a summary of the objections you faced and how you overcame them. Compared to the production of widgets on an assembly line, research and analysis activities can be difficult to measure. However, the benefits from integrating disparate legacy data is not that difficult to measure, or even monetize, and CDW has had little or no difficulty justifying the materially lower costs and improved productivity that it generates. Consider the alternative to a single, integrated repository of over 100 terabytes of data for research and analysis: users would have to search for, standardize, link, compile, validate, and format data from disparate sources for each and every ad-hoc question. Because the typical volume of data for research studies is massive, this process might take weeks or even months before the actual research effort could even begin. Measuring the time savings from having an integrated data store like CDW is straightforward, and for the IRS has resulted in a significant return on investment for its Research community.
Digital/Visual Materials
The Program welcomes nominees to submit digital and visual images with their Case
Study. We are currently only accepting .gif, .jpg and .xls files that are 1MB or
smaller. The submission of these materials is not required; however, please note
that a maximum of three files will be accepted per nominee. These files will be
added to the end of your Case Study and will be labeled as "Appendix 1", "Appendix
2" or "Appendix 3." Finally, feel free to reference these images in the text of
your Case Study by specifically referring to them as "Appendix 1", "Appendix 2"
or "Appendix 3."Currently Uploaded Appendices: No appendices currently uploaded. |
|
Site Map Contact Us
The Computerworld Honors Program is governed by the Computerworld Information Technology
Awards Foundation
©
2010
Computerworld Honors Program |