BDEEP
Big Data in Environmental Economics and Policy
Research Group

Our team is focused on problems in public economics and public policy, with particular focus on cities and the environment. Current projects include: climate change policy, urban growth and expansion, transportation policy, urban drinking water, housing discrimination, and environmental amenities in cities. We utilize a combination of applied microeconomics and data science methods and have developed a software stack that enables the use of high performance computing in observational research and experimental trials. We regularly partner with data and technology companies to bring new technological platforms and data sources into economic research. Please take a look at our current projects and our GitHub for examples of our research.

You can find us on the 3rd Floor of the National Center For Supercomputing Applications, at 1205 W Clark St, Urbana, IL 61801. We invite interested candidates to drop in on our weekly meetings to find out more about what we are working on.
new_bldg3.-1.jpg

Infrastructure Stack

We employ a set of cutomized tools to enable the acquisition and analysis of large datasets in observational research and experimental trials. Our infrastructure is designed to support a continuously integrated pipeline that includes the following primary components:
  • Acquire large datasets.
  • Store large datasets in a fully secure system.
  • Pre-process and analyze data in virtual computing environments.
  • Host replicable and continuously updating datasets and applications.
  • Compile and continuously update documents for publication.

Here is the BDEEP Infrastructure Pipeline.

Acquisition

Traffic delays, housing transactions, social media posts, pollution concentrations, and satellite observations are examples of data we regularly query. We build tools to facilitate the acquisition of these datasets. Different data sources are queried at different rates and using different protocols. For example, data are downloaded from monitoring websites, at discrete intervals and sometimes in real-time. Scripts are built to acquire data through downloading, scraping, crawling, or directly engaging with users of applications.

Scripts run in Docker containers. Scripts are scheduled to execute in accordance with the research design of a project. Docker containers allow developers and system administrators to isolate applications and allow them to run in a consistent environment regardless of which machine they are running on. BDEEP uses Docker for the vast majority of its applications.

Store

In some cases, we store individual files (JSON, CSV, RDS, SHP, TIFF, etc.), but more often computational efficiencies or other empirical protocols require database storage (e.g. MongoDB, SQL). Our infrastructure allows us to store large datasets and allow group members to access and / or query these datasets. BDEEP uses samba to host a shared network as well as an Active Directory server. The shared network allows members of our team to collaborate through a common network-mounted directory.

The Active Directory server uses samba. Active Directory allows us to maintain a credential server. This credential server allows us to add, delete, or modify of user credentials in the same place. Currently, it is only used to access the shared network, but it also has a range of other applications.

We recently added a Postgres database to our infrastructure. Since R is an in-memory operation language, many of large datasets encounter “out-of-RAM” errors during operations on larger data objects. Our PostGres Database serves as a tool to handle the partitioning, subsetting, and merging operations for those larger datasets. This allows us to work more efficiently because we are able to query the rows or columns we need for analysis instead of loading entire datasets into R. We are planning on developing a warehouse for the datasets that are stored using our Postgres database.

Finally, we use AWS for backing up our files on a weekly basis.

Analyze

All BDEEP team members are able to access BDEEP project files on our shared network. BDEEP members perform data analysis using a RStudio server which they can access through a web browser window. R is an increasingly widespread programming language in economics and data science. For information on how we set up RStudioServer on our cloud see: Installing RStudioServer on Ubuntu 15.04.

Communicate

The results of our research are directly compiled and continuously updated in for-publication documents and presentations using a system that is based on Latex. Key graphs, tables, maps and other figures are simultaneously hosted directly on our website in standard and more interactive formats (ex. d3.js, shiny). This increases public engagement with our research and allows us to be transparent about our findings.

In keeping with standards for reproducible research, all BDEEP team members are expected to maintain their code in BDEEP’s Github repositories. Fellow researchers and interested members of the public will have access to our code and research methodology.

Other Services/Requirements

We use a combination of cloud-based and computing infrastructure to manage ongoing projects and push the computational frontier of empirical research. We make use of both cloud-based (OpenStack - a cloud orchestration platform and AWS) and industry-standard computing clusters (the iForge/aForge Supercomputer Cluster at the National Center for Supercomputing Applications).

Here is an overview of all the platforms used in BDEEP infrastructure: Platforms