Crowdsourcing Data Analysis: The complexities of free data labor in a data hungry market – DataCrunch

Companies don’t know where to look to find the data analysts they need. A February 2017 article reported that 40% of major companies are struggling to find reliable data analysts to hire. According to TechTarget, “a lack of skills remains one of the biggest data science challenges,” and many tech magazines have reported something similar. This has led to companies sponsoring campaigns encouraging people to learn coding and universities to create comprehensive data analysis training programs. But it has also led to the widespread use of crowdsourcing data analysis. Crowdsourcing, while not a new tool in data science, has recently become extremely popular as a way for companies to fulfill their data analysis needs, from gritty data cleaning to full blown model creation. Last month DataCrunch reported on Kaggle, a website that allows companies to host competitions with a dataset they need to be analyzed in some way. Another example is DrivenData, who do activism work themselves but have a similar competition layout that runs their projects. The way the competition model works is that the participant or group whose model is chosen as the best by the company receives a cash prize. However, these competitions get a large enough number of submissions that the chance of winning the prize is rather low.

The fact that these competitions are so popular seems to clash with companies’ complaints that it is impossible to hire and retain the analytics experts they need. A technical consultant reported to Forbes that “People call me with their lists of requirements, so numerous and so varied that they simply don’t exist in any one person. They don’t find what they want, and they declare it impossible to get good analytics talent.” It is concerning that companies set such high standards for hiring while also content to use the labor of the competition participants without compensation. Sure, not all submissions might be good, but assumedly the majority of participants have minimal skills and passion for data analysis, meaning that they could easily improve on the job. Another roadblock could be that most participants are students, as is the case with Kaggle, and therefore not yet eligible to be hired. But they are still putting in the work and not guaranteed to be paid, which has caused some people to wonder if these competitions are exploitative.

These dilemmas intersect in a conversation about how we define “work” itself. There is a separation between our understanding of work as a form of productivity and work as an activity worthy of payment/compensation. This separation has become increasingly difficult to navigate with the growing popularity of the internet and big data. Lots of productive tasks – work – are now being done online without the payment the workers would assumedly receive if they were performing physical tasks in the non-virtual world.

One could ask, if the competition set-up is unfair, then where is the resistance? If these people are willing to put forward the effort, then is there any reason to complain? It is not directly a situation of exploitation, as the contributors consent to the process. But still, it’s a jackpot system, with one person cashing out big and the others walking away with nothing, which is not how any other competitive work environment works. And, while these competitions models could be seen as a great learning opportunity and utilization of passion in an isolated environment, it is in the context of a supposed lack of data analysis talent that it starts to seem suspicious.

However, this payment-for-all suggestion is also complicated. If we’re asking to be paid for all the work we do with data, what about the data we produce? Every moment that we are online, whether using Facebook, or watching YouTube videos, or reading the New York Times, we are creating data that is then used by that company to better market to us, or improve their interface, or make some changes that will, likely, earn them more money. We are creating information for them without being compensated. This situation would be interpreted very differently if we were participating in a focus group for an advertising company, or aiding a psychology lab with their research. Those activities come with compensation, while data creation does not.

It would be far too simplistic to come to a conclusion here about whether data crowdsourcing is good or bad. Rather, it seems that as society’s definitions of property and ownership are shifting, our concept of work needs to be shifting as well. And if companies are going to be reporting on the gap in data analysis skills, they need to also mention the profit they gain from the hundreds of people who aid them in tasks. And if they acknowledge them, then we need to seriously consider the question: When does working with data become work?