Texas Virtual Data Library (Texas ViDaL)

A Secure and Legally Compliant Data Infrastructure

Announcements

Thank you for coming to the Kick off meeting!

We got the $1.4 Million TAMU RDF funding for Spring 2018 to implement the Texas ViDaL (Virtual Data Library)! Plans are to start pilot projects on ViDaL Spring 2019. Please let us know if you have a pilot project you want to start.

For more information, please contact Dr. Hye-Chung Kum (kum at tamu dot edu) or check out the "People" page to find your local contact.

Welcome!

Many researchers at TAMU require secure and compliant computing facilities to conduct research projects involving analysis of sensitive or proprietary data. Yet currently TAMU has no good options. Privacy protection is a complex issue requiring a holistic approach (e.g. technology, statistics, policy) and a shift in culture of information accountability via transparency. Traditional approaches for protection via informed consent and de-identification are no longer effective in an era where large amounts of data and computational methods are readily available. Here, we propose a new paradigm that regards person level data as valuable but hazardous research material. Taking the privacy-by-design approach we focus on building a safe environment, consisting of 1) a well-designed secure computer system, with 2) required secure software and data to conduct research safely, along with 3) a policy framework for human protection. Using the Privacy by Design approach, Dr. Kum has design four different data access systems - restricted access, controlled access, monitored access, and open access - based on common activities in the research pipeline. Together the four layer access can minimize risk and optimize utility in data for social, behavior, economic, and health (SBEH) sciences (Kum et al, 2013). She has been running a pilot system in the Population Informatics Lab.

We propose to take this system to scale and develop the Texas Virtual Data Library (ViDaL). ViDaL(1) is a secure cloud computing data infrastructure to support data intensive research using sensitive person level data (e.g., health data) or proprietary licensed data to meet the myriad legal requirements of handling such data (e.g., HIPAA, Texas HB 300, NDA) and (2) will accumulate and host good data sources of general interest (e.g., HCUP, CMS, SEER), which often need to be purchased or processed to be made suitable for research, to be available to researchers with appropriate approvals and permissions. The Texas ViDaL project will extend the capacity of the current High Performance Research Computing (HPRC) – a shared computing infrastructure used by many A&M investigators and students –to support an even wider user base to include those that need secure compliant computing as well as general interest data sources. All units with interested faculty would benefit directly from the proposed virtual computing facility because remote access from their own personal computers would be the main mode of access. It will also complement the existing TX-federal census RDC infrastructure – a facility supporting research using restricted-access federal data – to a wider array of data sources including Texas state data. The Texas ViDal research infrastructure will enable researchers across many disciplines (e.g., public health, public policy, sociology, remote health, transportation, computer science, statistics etc.) at TAMU to develop new research agendas and collaborative team science becoming thought leaders for data intensive research in their respective fields. The infrastructure is critical for researchers to be competitive for external funding opportunities for leading-edge research that necessarily involves sensitive personal level data. For example, for the recently funded NSF ERC, PATHS-UP, this infrastructure will be essential for securing the sensitive data collected through remote sensing devices. Many institutions are working on developing secure computing facilities for handling sensitive data and it is becoming a necessity for many fields of science that involve person level data. We anticipate that the Tx ViDaL will lead to more publications, external funding, and education of data science at Texas A&M. 

Types of Compliant Computing Available

  • Large RAM nodes: 4 nodes with 1.5TB Ram each
  • Regular RAM nodes: 16 nodes with 192GB Ram each
  • GPU nodes: 4 GPU nodes, each with 192 GB Ram and two NVIDIA V100 GPUs

General Timeline

  • Sep 2018: Ordered the hardware, and expecting delivery by Nov.
  • Sep 24, 2018: Management Team Meeting.
  • Oct 22, 2018: Kick off & User Committee Meeting
  • Nov & Dec 2018: Recruitment of pilot projects
  • Jan 15 2019: First batch of pilot projects
  • Spring semester 2019: phase in all the pilot projets

Foundational Publications