BDEEP
Big Data in Environmental Economics and Policy
Research Group

Our team is focused on problems in public economics and public policy, with particular focus on cities and the environment. Current projects include: climate change policy, urban growth and expansion, transportation policy, urban drinking water, housing discrimination, and environmental amenities in cities. We utilize a combination of applied microeconomics and data science methods and have developed a software stack that enables the use of high performance computing in observational research and experimental trials. We regularly partner with data and technology companies to bring new technological platforms and data sources into economic research. Please take a look at our current projects and our GitHub for examples of our research. Our work is made possible through generous funding support from the National Science Foundation, the US Environmental Protection Agency (EPA), the Sloan Foundation, the Russell Sage Foundation, and Uber Technologies.
new_bldg3.-1.jpg

Infrastructure Stack

We employ a set of cutomized tools to enable the acquisition and analysis of large datasets in observational research and experimental trials. Our infrastructure is designed to support a continuously integrated pipeline that includes the following primary components:
  • Acquire large datasets.
  • Store large datasets in a fully secure system.
  • Pre-process and analyze data in virtual computing environments.
  • Host replicable and continuously updating datasets and applications.
  • Compile and continuously update documents for publication.

Here is a simple diagram of the BDEEP Infrastructure Pipeline.

Acquisition

Traffic delays, housing transactions, social media posts, pollution concentrations, and satellite observations are examples of data we regularly query. We build tools to facilitate the acquisition of these datasets. Different data sources are queried at different rates and using different protocols. For example, data are downloaded from monitoring websites, at discrete intervals and sometimes in real-time. Scripts are built to acquire data through downloading, scraping, crawling, or directly engaging with users of applications.

Scripts run in Docker containers. Scripts are scheduled to execute in accordance with the research design of a project. Docker containers allow developers and system administrators to isolate applications and allow them to run in a consistent environment regardless of which machine they are running on.

Store

In some cases, we store individual files (JSON, CSV, RDS, SHP, TIFF, etc.), but more often computational efficiencies or other empirical protocols require database storage (e.g. MongoDB, SQL). Our infrastructure allows us to store large datasets and allow group members to access and / or query these datasets. BDEEP uses samba to host a shared network as well as an Active Directory server. The shared network allows members of our team to collaborate through a common network-mounted directory.

The Active Directory server uses samba. Active Directory allows us to maintain a credential server. This credential server allows us to add, delete, or modify user credentials in the same place. Currently, it is only used to access the shared network, but it also has a range of other applications.

We recently added a Postgres database to our infrastructure. Since R is an in-memory operation language, many of large datasets encounter “out-of-RAM” errors during operations on larger data objects. Our PostGres Database can handle the partitioning, subsetting, and merging operations for larger datasets. This allows us to work more efficiently by targeting an analysis using database queries rather than loading entire datasets into R. We developing a warehouse for the datasets that are stored using our Postgres database.

Finally, we use AWS for system backups.

Analysis

All BDEEP team members are able to access BDEEP project files on our shared network. BDEEP members perform data analysis using a RStudio server which they can access through a web browser window. R is an increasingly widespread programming language in economics and data science. For information on how we set up RStudioServer on our cloud see: Installing RStudioServer on Ubuntu 15.04.

Communication

The results of our research are directly compiled and updated in for-publication documents and presentations using a system that is based on Latex. Key graphs, tables, maps and other figures can also be hosted directly on our website in more interactive formats (ex. d3.js, shiny).

In keeping with standards for reproducible research, all BDEEP team members are expected to maintain their code in BDEEP’s Github repositories.

Other Services/Requirements

We use a combination of cloud-based and computing infrastructure to manage ongoing projects and push the computational frontier of empirical research. We make use of both cloud-based (OpenStack - a cloud orchestration platform and AWS) and industry-standard computing clusters (ex. NCSA iForge/aForge, UIUC Campus Cluster, ACES Cluster).

Here is an overview of all the platforms used in BDEEP infrastructure: Platforms