I received email inviting me to participate in Crowdsignals. Specifically, they invite me to join “in this very exciting community initiative”, where “the community” apparently means “the research community”. I’m guessing they extracted my name from old publications at Ubicomp or Pervasive or whatnot, since I have no recent research in this area. I almost deleted the email as junk. But I ended up looking at it.


Where do I start on this one.

Crowdsignals is a very-definitely-for-profit organization that aims to amass a large dataset of mobile users, to be used in “research”. They correctly point out that “[t]he most valuable data are collected longitudinally from many subjects in naturalistic settings” ([1], p. 874)

This is certainly a difficult research problem. Even though most people have mobile devices with astonishing sensor capabilities, getting data from them is difficult and expensive, especially if you want to be ethical about it.

Their solution is to amass a large sample of volunteer subjects who provide data traces from their ordinary life to Crowdsignals, who then curates and brokers the data to researchers. If this works as intended, this will form a pool of data much larger and with more uniform quality than any single research group could achieve, at a reasonable cost (they believe). Note that the volunteers will be paid for the data, will give informed consent, and the data will be anonymized to protect privacy.

This isn’t a terrible idea, though their argument for doing a company rather than a non-profit is weak, and the crowdfunding approach is, well, not really a sustainable way to do research. There is no way around the fact that the US National Science Foundation or someone like that should fund this activity as a public resource. But the US is now ruled by Know Nothings, so there is no hope of that.

Looking at the crowdsignals paper and website raises many troubling points.

They describe the overall project as “ will generate Terabytes of open mobile and sensor data over the duration of the data collection.” Actually, the data is owned by the company and available only to people who pay for it. This is not what most people mean by “open data”.

They are building their consortium via crowdfunding. That’s right, you can buy into this dataset simply by contributing to their kickstarter. This is an open process (if anything, too open), but not exactly open to all researchers, just rich ones. Also, the kickstarter process is rather iffy. They do not know many specifics, such as cost, schedule, or availability. Sigh.

At this point, they propose to sell you the data at about $1-2 per subject (I’d guess that something like 50% of that would go to the subjects providing the data). If you need a bespoke sample, say, from a specific location or population profile, you’ll have to pay more.

I admit that this is cheaper than collecting the data yourself, but it’s still kind of dear for those of us who don’t have backing from a big corporation or DARPA. This is certainly beyond the means of a student or a high school teacher, unless they get a grant. Again, this is not what most people would think of as “open data”.

For an academic researcher, this data source poses some dilemmas besides the cost. While crowdsignals “allows” you to publish your findings, with attribution, you cannot publish the data itself, or even share it with colleagues.  (Again, not what most people would mean by “open data”.) It sounds to me as if any studies based on this data will be impossible to replicate or even independently validate.

Not so good.

This “no publishing” policy will surely run afoul of some institutions policies.   For example, I have to wonder if a dissertation committee will allow a student to use this closed, unpublishable data in his or her study. Some committees will, some won’t. And I imagine that it would be difficult to use the data for a class project, even assuming funding was available. The access restrictions would be painful.

I also wonder how well this data will meet the needs of public agencies who really do want “open data”.  I could imagine a large city wishing to use this kind of data for a lot of purposes.  But can they use privately held data that cannot be shared with the public?  Maybe, maybe not.

At this point, we don’t know what kind of samples Crowdsignals can actually assemble. The company intends to recruit and “on-board” the volunteers using varoius means, including digital advertising, for example. (Is anybody besides me turned off by the use of this creepy HR term in this context?) I’m not sure how, or how well it will work, or how representative the resulting samples might be. Researchers will need to be careful, for sure.

They currently plan to do only Android devices. (Perhaps Apple has their own fiendish plan for something like this). Right there, you have a huge, huge hole in the sampling—more than half the people in the US are not in this population.

They also expect “terabytes of data”, coming in over months or years. They say they will be using the Open Data Kit  Even so, having been there and done that, I have a healthy respect for the challenges of data curation. $1-2 per subject seems to me to be a low estimate for the costs of managing this data, however automated the process.  (Of course, they intend to “sell” each subject more than once.)

As far as I can tell, the data products are not real time data streams, they are anonymized blocks of longitudinal data. This limits the usefulness of this data quite substantially. There are plenty of opportunities to be sure, but no possibility of real time experiments, for instance. (At least not with the standard product.)

I think I view this as an interesting experiment in mobile computing. We’ll see how well this mashed up system really works. I’m not optimistic, if only because I think their cost estimates are off (possibly by an order of magnitude), which would mean that they will have a small pool of well healed clients, which will not really help the “research community” very much.

They do seem to understand the ethics of this sort of data collection, and have an approach that respects the subjects. Whether this actually flies or not, whether a lot of people will sign up for this particular package, I don’t know. I certainly wouldn’t, but I’m pretty much a refuse-nik all around. I hope Crowdsignals publishes their experience in that area.

One thing is for sure, though:  they should not call this “open data”.  I would be a lot happier if they took that claim off their materials, or else make provisions for really opening their data.


