Big Data in Environmental Economics and Policy
You can find us on the 3rd Floor of the National Center For Supercomputing Applications, at 1205 W Clark St, Urbana, IL 61801. We invite interested candidates to drop in on our weekly meetings to find out more about what we are working on.
- Acquire large datasets.
- Store large datasets in a fully secure system.
- Pre-process and analyze data in virtual computing environments.
- Host replicable and continuously updating datasets and applications.
- Compile and continuously update documents for publication.
Here is the BDEEP Infrastructure Pipeline.
Traffic delays, housing transactions, social media posts, pollution concentrations, and satellite observations are examples of data we regularly query. We build tools to facilitate the acquisition of these datasets. Different data sources are queried at different rates and using different protocols. For example, data are downloaded from monitoring websites, at discrete intervals and sometimes in real-time. Scripts are built to acquire data through downloading, scraping, crawling, or directly engaging with users of applications.
Scripts run in Docker containers. Scripts are scheduled to execute in accordance with the research design of a project. Docker containers allow developers and system administrators to isolate applications and allow them to run in a consistent environment regardless of which machine they are running on. BDEEP uses Docker for the vast majority of its applications.
In some cases, we store individual files (JSON, CSV, RDS, SHP, TIFF, etc.), but more often computational efficiencies or other empirical protocols require database storage (e.g. MongoDB, SQL). Our infrastructure allows us to store large datasets and allow group members to access and / or query these datasets. BDEEP uses samba to host a shared network as well as an Active Directory server. The shared network allows members of our team to collaborate through a common network-mounted directory.
The Active Directory server uses samba. Active Directory allows us to maintain a credential server. This credential server allows us to add, delete, or modify of user credentials in the same place. Currently, it is only used to access the shared network, but it also has a range of other applications.
We recently added a Postgres database to our infrastructure. Since R is an in-memory operation language, many of large datasets encounter “out-of-RAM” errors during operations on larger data objects. Our PostGres Database serves as a tool to handle the partitioning, subsetting, and merging operations for those larger datasets. This allows us to work more efficiently because we are able to query the rows or columns we need for analysis instead of loading entire datasets into R. We are planning on developing a warehouse for the datasets that are stored using our Postgres database.
Finally, we use AWS for backing up our files on a weekly basis.
The results of our research are directly compiled and continuously updated in for-publication documents and presentations using a system that is based on Latex. Key graphs, tables, maps and other figures are simultaneously hosted directly on our website in standard and more interactive formats (ex. d3.js, shiny). This increases public engagement with our research and allows us to be transparent about our findings.
In keeping with standards for reproducible research, all BDEEP team members are expected to maintain their code in BDEEP’s Github repositories. Fellow researchers and interested members of the public will have access to our code and research methodology.
Here is an overview of all the platforms used in BDEEP infrastructure: Platforms