At this weekend’s #OccupyDataNYC hack I worked on the Data Anywhere project presented by Gloria W of NYC Python Meetup. This project seems to have a lot of promise for solving the data management issues plaguing #OccupySandy and other relief organizations.
The Current Problem
Much of the data that is being collected is from canvassing.
Relief organizations canvass residents to better understand what a community needs. This data, normally collected on paper, then needs to be digitized. Once digitized, it then needs to be stored in a secure way. It also must remain available for review by the organization that “owns” the data, as well as any organizations working in coordination.
- Organizations collecting data need a secure place to keep it.
- Non-private data needs to be available for research and advocacy.
- Private data needs to be available–in a secure way–to people who can act on it, even if they are not part of the organization that originally collected the data.
The existing solutions tend to be based on software that holds data along with doing a whole host of other great things like case management, mass emailing, etc. These solutions, however, are normally limited to members of the organization; advocacy groups will have a hard time gaining access to the system to run reports to further common causes. If two groups want to share data they have to agree to use one piece of software which can be very difficult if each group has invested time and money in their particular solution.
Once data is put into a locked system, it is safe–but, it can be almost impossible for even the most effective spontaneous grassroots organizations (like Occupy Sandy) to gain access to it to make it actionable.
Real-Life Use Case
The Staten Island Community and Interfaith Long Term Recovery Group (LTRG), which came together in Staten Island after Sandy, turned out volunteers to canvass over 1000 homes this January. The paper forms that were used then had to be digitized. With the help of Occupy Sandy volunteers, we were able to enter all this data a few weeks later.
Taking the inconsistent form data, trying to decipher handwriting, and accurately entering that into a computer is no easy task.
This data was digitized via a Google Form that mirrored the questions on the canvass sheet. The data now lives on a Google Drive folder, where access is all or nothing.
In February, a heavy snowstorm loomed and an organizer with the LTRG asked me to create a report that could be shared with other relief organizations to check up on people who indicated that they were living without heat—a question on the canvass form. I hacked together a solution by creating a pivot chart in the Google Spreadsheet then exported a PDF with the names, addresses, and phone numbers of people who indicated they were living without heat. Volunteers were dispatched to deliver warming information and heaters to residents on the list.
It took weeks to even begin to make this data actionable; if new data had been added or entries had needed to be updated, there wouldn’t have been a clear way to do this. If you don’t happen to know the people who have access to the data you can’t even hope to use it for good. Non-personal data gathered from the canvass is equally inaccessible, hampering data nerds across the world from contributing to advocacy.
- Data is hard, if not impossible, to access even for those who “own” the data.
- Data is often stored on third party servers and access can be revoked without notice (do YOU know what Google’s Terms of Service are?).
- Data is often not secure and access control is all or nothing.
- Producing reports requires a good bit of work.
What Would a Good Solution Look Like?
I want to be able to ask a database “who in this region doesn’t have heat?” and be presented with a list of people who indicated to a canvasser that they didn’t have heat. Furthermore, I might want to cross-reference all homeowners who have flood damage and don’t have insurance. I want to share queries with collaborators as well as make non-private data available to the public. I want to pull data into my current workflow without having to learn new software!
- Data would have a persistent home.
- Data would be machine readable.
- Data stewards could manage access to the data, making some public and keeping some private.
- Reports could be generated on the fly.
- Third party applications could query the data.
Currently data is on random computers, either in the cloud or at some person’s possession—if it’s even in a digital format at all. Having a persistent home where the data and subsequent updates can be housed makes finding the most up-to-date data easier.
Behold the above diagram. At the top is freshly digitized data. It now has to be shared, manually, with collaborators who can then further share the data. This isn’t ideal because either the data ends up in the wrong hands (the magenta jerk) or the data becomes unreliable (bottom right).
Think about working on a shared Excel spreadsheet that gets e-mailed around: it quickly becomes hard to know which version is the most current or accurate.
Data that fits a spec and is standardized so that it can be read by a machine means that powerful scripts can be executed against the data to perform all kinds of magic, from plotting points on a map to cross-referencing common data elements across data sets.
It’s important to keep people’s private information secure while also allowing non-personal data to be available to the community. The data stewards need granular access management: some people need to see everything while others need only see some things. A good solution would allow for multiple levels of access to private data while allowing most data to be open to the public, if the stewards so desire.
The data isn’t of much use if you can’t query it and produce reports. Furthermore, reports shouldn’t have to be re-created every time the data is updated or changed slightly.
Within the Sandy response, many web and mobile applications sprung up to manage information and data. Occupy Sandy set up an instance of the open-source disaster relief software Sahana while another team set up Disaster Dispatch to manage relief efforts in Staten Island. Both of these web applications should be able to access our canvassing data. While Sahana is an exceptional tool, we can’t force other groups to use our software by locking our data up in the system. So our solution must be agnostic to any software platform.
The bottom line is that people don’t want to learn another program or create another user account on some web app, especially in disaster response or recovery.
While thinking about this I figured that the answer was to build and Application programming interface (API) around the data so that authorized users could query the data as well as posting new data or updates. An API is simply a strategy to get humans interfacing with technology. The screen you are staring at, the buttons on an ATM, and your computer mouse are all APIs.
There is a particular kind of API that I was envisioning, called a RESTful API. I’ll save the technical details for another blog post, but the basic idea is this: the data steward keeps data on a web server and provides a way for authenticated people to query the raw data from the server.
It turns out that is exactly what Data Anywhere aims to do.
Data Anywhere is not an application, but a strategy for setting up off-the-shelf VPS (Virtual Private Server, i.e. a cheap web servers) to host data and a custom API for querying that data. Here’s how it works:
- Data is collected and digitized.
- The digitized data is then mapped to a data structure and uploaded to a database on a configured server.
- An API is developed to query the data.
- Users with proper authentication can then access raw data simply by visiting a website.
- Third party developers can build apps that run on this data.
I’ll get into the more technical stuff later in this post. For now let’s think of it this way:
Your organization uses a proprietary system for managing canvassing data. You really want to share this data, but can’t have CSV files floating around and don’t have the time to create custom reports every time someone needs one.
If you had a Data Anywhere server you could drop your data set in there, then allow people to access specific areas of your data. If you wanted to share data with someone you would simply have to grant them access to the Data Anywhere server and they could begin asking the server, not you, for reports on your data.
Here’s a rough flow chart of the process:
Data is collected with a form based on community needs and collective data standards.
Data is entered into a computer, digitized. If information is collected digitally this step is simplified.
Data is mapped to a standardized format for storage. Structuring input data to match standards will simplify this process.
Data is stored in an off-the-shelf server.
A data API
is developed to manipulate stored data. Data stewards control how data is made available. By following collective standards uniform data can be distributed to a network of data servers.
App authors can access shared data from the server network to run services.
Apps can interface with other systems both open and closed.
Allow me to get a little more technical. My non-techie readers can skip this section.
Under the Hood
We currently have an example set up using my fork of the Data Anywhere code on GitHub. The server is an off-the-shelf VPS running Linux Fedora. The stack consists of:
- Python with Flask
- You can see the bash history here of the full server setup.
We use Python to parse raw data from a .CSV file. The data is then stored in the MongoDB. A RESTful API is written using Python and Flask to call data from the database. Data can now be queried from the server! Private methods can be created and put behind HTTP authentication.
Why is this a good solution?
Data Anywhere allows organizations to keep data in a locally-controlled database, which can then be shared without any work on the part of the organization (aside from the initial set up).
It also allows for very granular access. You can set up many levels of access and control just how much data any of those levels has access to.
The solution is agnostic to existing systems. Because the API produces machine-readable code, existing software can interface with the data rather than other software based solution. The relief community then doesn’t have to decide on a single piece of software. Groups that aren’t heavy users of software could very well access data in formats suited for download and printing.
Anyone in the world can develop applications that use the data. I’ve provided a link to one I made below.
What are the shortfalls?
There is a lot of work that has to be done on the front end. Data must be mapped and merged, and then APIs must be written from scratch. These shortcomings can be overcome as more groups use the system libraries and naming standards, and unified APIs can be developed.
As the system becomes more automated with more developers adding to the project, the challenge shifts to one of creating standards—which, in my humble opinion, is a much better challenge to have: it is non-technical and so more people can work on it.
Lets see it in action!
Just to recap, data (in this case canvassing data of resident needs) is uploaded into a server. The server can then be queried. Here’s an example:
This is a URL that prints data
That URL is asking the server for all the resident data who indicated that they do not have heat and need medical attention. The server responds with data in a format called JSON, which looks like this:
"note-insurance": "got $11,000 from insurance. and was told by FEMA that it was adequate?? electricity repair cost was $5,500. needs help with dealing with mortgage company.",
"note-other": "can't move back in without floors. has plywood to replace subfloor. send grants list.",
"note-info": "currently on medical leave from Board of Education, needs food deliveries. rapid repairs screwed up heating so it's getting fixed on 2/18/13. rapid repairs flooded attic. evacuated when water reached knees. has spots of mold on subfloors.",
"note-fema-sba": "might be eligible for unemployment because of stroke. denied FEMA because didn't have inspection. denied by SBA."
"have-payment-fema-rental": "received original $2900 but put into repairs",
"note-need-medical": "both asthmatic, difficult breathing because of mold. husband goes to house to feed animals. she needs medical attention bus is taking care of health.",
"contact-insurer": "State Farm",
"city": "Staten Island",
Yes I know, that looks very very scary. I wanted to show the raw data first to highlight what that data can become. Last night over the course of about an hour I was able to produce this test page:
That web page is accessing publicly available data from the Data Anywhere server. Now imagine that there are 10 different data servers owned by 10 different groups doing canvasing in NYC. My app could very well query all of them!
I want to drive this point home. Let’s say Restore the Rock collects a thousand canvas forms from the Rockaways. They used a standardized form (with some customization for the unique qualities of the neighborhood) and have digitized the responses. We now set up a Data Anywhere instance on a server which belongs to Restore the Rock. The API strategy is developed and now they are able to serve the data how ever they want. To ease sharing needs, the non-private portions of the data are made public. Anyone in the world can now access this hyper local data. In the above example I, a developer, took that data and made it into a web app. My web app can now be used by volunteers back in the Rockaways to see what supplies are most needed in the neighborhood.
Co-create this solution
I would like to preface this ask for help with the simple fact that I am no expert an any of this. Never designed an API, deployed servers, or written scripts to parse data. So any advice you can give me is very much appreciated. I’ve identified three main areas that need work.
- Structuring raw data. We need to take raw data in almost any configuration and map it. Turn “Does your house have heat? NO” into have-heat = False. What’s the best way to do that?
- Writing clear, extendable, and flexible APIs. Each set of data is going to be different, but there must be existing standards for most of this stuff. How do we not already have a standard canvassing form?
- Automation for easy replication. One click install, though for Phase 1 we can settle for less.
If you are interested in this project, can lend me any advice, or know of events where I could pitch this to interested talented people, please e-mail me at email@example.com
Check out the code on GitHub: github.com/dhornbein/DataAnywhere
Or the website here.
Leave questions or comments below!