Many researchers at TAMU require secure and compliant computing facilities to conduct research projects involving analysis of sensitive or proprietary data. Yet currently TAMU has no good options. Privacy protection is a complex issue requiring a holistic approach (e.g. technology, statistics, policy) and a shift in culture of information accountability via transparency. Traditional approaches for protection via informed consent and de-identification are no longer effective in an era where large amounts of data and computational methods are readily available. Here, we propose a new paradigm that regards person level data as valuable but hazardous research material. Taking the privacy-by-design approach we focus on building a safe environment, consisting of 1) a well-designed secure computer system, with 2) required secure software and data to conduct research safely, along with 3) a policy framework for human protection. Using the Privacy by Design approach, Dr. Kum has design four different data access systems - restricted access, controlled access, monitored access, and open access - based on common activities in the research pipeline. Together the four layer access can minimize risk and optimize utility in data for social, behavior, economic, and health (SBEH) sciences (Kum et al, 2013). She has been running a pilot system in the Population Informatics Lab.
We propose to take this system to scale and develop the Texas Virtual Data Library (ViDaL). ViDaL(1) is a secure cloud computing data infrastructure to support data intensive research using sensitive person level data (e.g., health data) or proprietary licensed data to meet the myriad legal requirements of handling such data (e.g., HIPAA, Texas HB 300, NDA) and (2) will accumulate and host good data sources of general interest (e.g., HCUP, CMS, SEER), which often need to be purchased or processed to be made suitable for research, to be available to researchers with appropriate approvals and permissions. The Texas ViDaL project will extend the capacity of the current High Performance Research Computing (HPRC) – a shared computing infrastructure used by many A&M investigators and students –to support an even wider user base to include those that need secure compliant computing as well as general interest data sources. All units with interested faculty would benefit directly from the proposed virtual computing facility because remote access from their own personal computers would be the main mode of access. It will also complement the existing TX-federal census RDC infrastructure – a facility supporting research using restricted-access federal data – to a wider array of data sources including Texas state data. The Texas ViDal research infrastructure will enable researchers across many disciplines (e.g., public health, public policy, sociology, remote health, transportation, computer science, statistics etc.) at TAMU to develop new research agendas and collaborative team science becoming thought leaders for data intensive research in their respective fields. The infrastructure is critical for researchers to be competitive for external funding opportunities for leading-edge research that necessarily involves sensitive personal level data. For example, for the recently funded NSF ERC, PATHS-UP, this infrastructure will be essential for securing the sensitive data collected through remote sensing devices. Many institutions are working on developing secure computing facilities for handling sensitive data and it is becoming a necessity for many fields of science that involve person level data. We anticipate that the Tx ViDaL will lead to more publications, external funding, and education of data science at Texas A&M.