BDEEP
Big Data in Environmental Economics and Policy
Research Group
Infrastructure Stack
- Acquire large datasets.
- Store large datasets in a fully secure system.
- Pre-process and analyze data in virtual computing environments.
- Host replicable and continuously updating datasets and applications.
- Compile and continuously update documents for publication.
Here is a simple diagram of the BDEEP Infrastructure Pipeline.
Acquisition
Traffic delays, housing transactions, social media posts, pollution concentrations, and satellite observations are examples of data we regularly query. We build tools to facilitate the acquisition of these datasets. Different data sources are queried at different rates and using different protocols. For example, data are downloaded from monitoring websites, at discrete intervals and sometimes in real-time. Scripts are built to acquire data through downloading, scraping, crawling, or directly engaging with users of applications.
Scripts run in Docker containers. Scripts are scheduled to execute in accordance with the research design of a project. Docker containers allow developers and system administrators to isolate applications and allow them to run in a consistent environment regardless of which machine they are running on.
Store
In some cases, we store individual files (JSON, CSV, RDS, SHP, TIFF, etc.), but more often computational efficiencies or other empirical protocols require database storage (e.g. MongoDB, SQL). Our infrastructure allows us to store large datasets and allow group members to access and / or query these datasets. BDEEP uses samba to host a shared network as well as an Active Directory server. The shared network allows members of our team to collaborate through a common network-mounted directory.
The Active Directory server uses samba. Active Directory allows us to maintain a credential server. This credential server allows us to add, delete, or modify user credentials in the same place. Currently, it is only used to access the shared network, but it also has a range of other applications.
We recently added a Postgres database to our infrastructure. Since R is an in-memory operation language, many of large datasets encounter “out-of-RAM” errors during operations on larger data objects. Our PostGres Database can handle the partitioning, subsetting, and merging operations for larger datasets. This allows us to work more efficiently by targeting an analysis using database queries rather than loading entire datasets into R. We developing a warehouse for the datasets that are stored using our Postgres database.
Finally, we use AWS for system backups.
Analysis
Communication
The results of our research are directly compiled and updated in for-publication documents and presentations using a system that is based on Latex. Key graphs, tables, maps and other figures can also be hosted directly on our website in more interactive formats (ex. d3.js, shiny).
In keeping with standards for reproducible research, all BDEEP team members are expected to maintain their code in BDEEP’s Github repositories.
Other Services/Requirements
Here is an overview of all the platforms used in BDEEP infrastructure: Platforms