Big Social Data
The Long Tail of Science Data
THE 20TH CENTURY was seminal for the natural sciences, with discoveries such as penicillin (Fleming, 1945), the polio vaccine (Salk, 1952), the double helix structure of DNA (Watson & Crick, 1953), and the first complete DNA sequence of an organism (Sanger et al., 1977), all of which advanced human understanding and human welfare. The advent of the OpenWeb and a seemingly endless amount of new science data have the potential to do for the computational social sciences in the 21st century what other measurement tools did for the natural sciences of the 20th century – advancing further human understanding and human welfare. Big Data will be central to that pursuit.
“Data is the new oil” is a phrase coined by Clive Humby and embraced by the World Economic Forum in 2011 as it considered data as an economic asset like oil. Every day, we create 2.5 quintillion bytes of data – so much that 90% of the data in the world today has been created in the last two years alone. (IBM) These data come from everywhere: hard sensors used to gather information, the social web, transaction records, and cell phone GPS signals, to name a few. The amount of such data is big and by every account growing exponentially. These facts, however, give no hint to where the largest growth is, or comparatively speaking, where the greatest rewards lie for researchers. It is difficult to believe that data are increasing uniformly.
Despite the remarkable growth of data, the description of Big Data still seems rather empty. The description of the distribution of the 90% outgrowth of Big Data would more accurately define what Big Data is and, more importantly, how it is different from traditional science data. If, for example, the data explosion is normally distributed, then perhaps Big Data will have less impact than expected for the social sciences. This would result from the fact that the tails are closer to the average than what would be true under a power law distribution. If this is the case, Big Data isn’t special or is not as special as suspected, and is less “oil” than it is an average economic asset.
The Long Tail of Science Data
A distribution is said to have a long tail if a larger share of population or use rests within its tail than would under a normal distribution. A signature quality of a power law is the long tail and the large number of occurrences far from the head or central part of a distribution.
The long tail has gained popularity in describing a retailing strategy of selling a large number of unique items with relatively small quantities. The Long Tail (Anderson, 2004) encourages entertainment to “forget squeezing millions from a few megahits at the top of the charts.”
Anderson quite successfully foretells that “the future of entertainment is in the millions of niche markets at the shallow end of the bit stream.” This principle explains why Netflix beat out Blockbuster, and how Amazon has been so successful, with stock prices increasing from $40 in late 2004 to nearly $240 in late 2012.
The long tail of science data follows the power law distribution. The National Science Foundation has shown its grants in dollar amounts to follow the power law – empirical support for the long tail of science data. The tail has many heterogeneous datasets; these data are small, often individually curated and unmaintained. As a result, the data are discontiguous from other research efforts and discontinuous over space and time. See Figures 1-2.
FIGURE 1. The Long Tail of data
FIGURE 2. Power Law Distribution example, similar to The Long Tail
Intelligence data too are often collected in a manner that is limited in scope and almost never preserves any semblance of external validity. These data are often collected for small projects and are often forgotten and not maintained.
The poor curation of these data leads to their inevitable misplacement – dark data, or data that are suspected to exist or ought to exist, but are difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The utter lack of central management of data in the tail invariably leads these data to be forgotten. The long tail is an intractably large management problem and an analytical one as well. The central curation of data in the head ensures maintenance, unlike data in the tail.
Professor and mathematician Richard Hamming’s prominent words during a speech titled “You and Your Research” encourage researchers to ask, “What are the important problems in my field?’’
Understanding the long tail of science data may suggest where computational social scientists will have the greatest impact and contribute the greatest social good.
The head of the distribution is where large, homogenous datasets reside and perhaps where the greatest impact is on human understanding and human welfare. The high volume of data produces coincident datasets in time and space – unintentionally producing binding research across social science disciplines, even producing binding research between the natural and social sciences. Their coincident nature makes them ideal for cross-correlation and multivariate analysis.
Open Innovation initiatives hold certain promise for sharing risk and rewards. “Open Innovation is a paradigm that assumes that firms can and should use external ideas as well as internal ideas.” (Chesbrough, 2003)
Designing binding research across social science disciplines and between the natural and social sciences will require shared laboratories provided by the head of the long tail, and shared interest in socially critical problems.
Big Social Data
There is no argument that Big Data has spurred technological innovation, the result of which has lowered processing cost on data and consequently has had a social impact. Businesses especially are using Big Data to answer questions that five years ago were not answerable. A recent study (Brynjolfsson, 2011) found that technology investments of 179 large publicly traded firms that adopted datadriven decision making have output and productivity that is five to six percent higher than what would be expected given their other investments and information technology usage.
Big Data advancements have included better targeted Web ads by the likes of Google and Facebook, as well as better recommendation systems by Netflix and Amazon. The use of data on the part of these and similarly savvy data-driven companies has seemingly had a positive, unilateral impact on operations, offering supporting evidence for a data-driven approach.
The social turn in Big Data, however, promises the marked improvement of policy and governance decisions affecting the lives of everyone. Big Social Data makes great effort to disspell a number of enduring malpositions of Big Data.
The contributions on the part of several socially conscious data scientists are ultimately highlighting the differences between the demands on data and analysis on the part of private industry and the demands on data and analysis on the part of security, governance, and policy – foreign or domestic.
Lieutenant General Michael Flynn (Director of the Defense Intelligence Agency, 2012) recently noted that focuses on the precursors of war are needed – what he calls “left of boom.” When speaking of national security, he notes that one thing needed in greater frequency is the engagement of populations before starting a counterinsurgency campaign. He notes that there have been 32 major insurgencies since 1960, and in 22 of these cases the insurgent forces were the victors.
Irregular warfare is the template for future battlefields; U.S. interests are best served by preventive measures that ensure these wars are never fought.
Ethnic tension is frequently a part of the developing world, a world often in the greatest need of aid or humanitarian assistance, and the frontlines of current or perhaps future engagements. These areas have also seen some significant, if not dramatic, increases in data. Flynn cautiously accounts for potential social ills facing the industrialized world as well. Big Social Data is a potential that is largely untapped and will allow decision makers to track progress, better understand and improve social conditions of local populations, and understand where existing policies require adjustment.
Big Data, Small Theory
The typical state-centric analysis that seeks to determine how states can or do maintain stability must also develop a sensory capability to better detect the precursors to political change, a social radar of sorts with a level of granularity that enables policy leaders to make informed decisions that maximize national influence left of boom. (Flynn, 2012)
The Failed States Index (Foreign Policy, 2012) claims that 108 of the 178 countries evaluated are in high warning of failing. Assuming the remaining unmeasured countries (18) are “stable,” then Foreign Policy estimates 55% of the world’s countries to be in serious danger of failing. Many of these countries are important pivots in foreign policy, currently receiving aid and/or simply in close proximity to the United States, (see Figure 3) either geographically or economically.
FIGURE 3. Failed States Index, Foreign Policy magazine, 2012
The Failed States Index provides annual snapshots of a state at a national level. It turns out that states fail for a variety of reasons, shaped in innumerable ways, many of which are not violence.
Violence, as it is, may result from the failure of the state, what Thomas Hobbes called the “state of nature.” Flynn’s idea of left of boom is an acknowledgement of the “state of nature” and a plea for better data analysis. State failures are not without disharmonies of local communities or without heterogeneous castings of ethnicities, economies, and other human characteristics of the landscape.
Geographers already know stationarity is a tenuous claim due to heteroscedasticity in the response and spatial heterogeneity in the control. Analytical pursuits seeking nomothetic laws quickly sacrifice internal validity. At times, the reaction on the part of human geographers is to treat every observation as a special case. This is extreme heterogeneity with multiple equilibrium – one process for every observation over space. Analyses of this nature prove to be overly descriptive and quite idiographic, ultimately lacking any external validity, which is the hope of making generalizations to other places or times of interest.
By clutching complexity, the computational social sciences can blend these two contrasting outlooks.
Peter Norvig, Director of Research at Google Inc., (2009) writes in “The Unreasonable Effectiveness of Data” about seizing complexity. He draws direct comparison to Eugene Wigner’s article, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” (1960) and acknowledges that sciences that involve humans rather than elementary particles have proven resistant to elegant mathematics. Norvig continues, “(We) should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.”
Italian scientist Galileo’s experiment of dropping two balls of different masses from the Leaning Tower of Pisa demonstrated that their time of decent was independent of their mass. Wigner explains how this result is true everywhere, was always true, and will always be true. It is valid no matter whether it rains or not, whether the experiment is carried out in the Far East, Near East, or Northeast D.C., no matter whether the person is a man or a woman, rich or poor, Muslim or Catholic.
This invariance property of physics is well recognized, and without invariance principles, physics would not be possible.
Social phenomena are not invariant however. As Gelfand and others (Velupillai, Poli, Norvig) noted, there is only one thing more unreasonable than the unreasonable effectiveness of mathematics in physics, and that is the unreasonable ineffectiveness of mathematics in other sciences, including the humanities. While serial and spatial autocorrelation exist, so do temporal and spatial heterogeneity and, ultimately, uncontrolled variance.
Exploiting the complexity of data in the head of the distribution holds certain promise, rather than creating models to decisively model all of human behavior.
Big Social Data’s goal is to learn about social systems at a speed commensurate with decision making and at a spatial support commensurate with policy development and assessment. Small theory is one framework to operate within. These small theories are not small in significance but locally calibrated to the populations they measure. They are based on empirical observation and are expected to be generally true and sufficient to be useful norms.
The social turn in big data mimics similar social turns in the Department of Defense and the U.S. Intelligence Community.
One ought to wonder what U.S. Government engagements would look like if Big Social Data could improve decision making or intelligence analysis by a mere five to six percent, as industry has done for output and production. Big Social Data has untapped potential, as Anyon (1982) envisioned, for social sciences to study socially critical problems.
- Anderson, C. (2004). Wired 12.10: The Long Tail. wired.com.Retrieved December 30, 2012, from http://www.wired.com/wired/archive/12.10/tail.html
- Anyon, Jean. (1982). “Adequate Social Science, Curriculum Investigation, and Theory.” Theory Into Practice. Vol. 21, Number 1, 1982: 34-38.
- Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance? SSRN Electronic Journal. doi:10.2139/ssrn.1819486
- Chesbrough, H. W. (2006). Open Innovation: The New Imperative for Creating And Profiting from Technology. Harvard Business School Press.
- Failed States Index. (n.d.). Retrieved from http://www.foreignpolicy.com/failed_states_index_2012_interactive
- Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. Intelligent Systems, IEEE, 24(2), 8-12.
- NSI, A., OSD, M. Flynn., H., Casebeer, B., Chapman, M., & Rice, C. et al (2012). National Security Challenges: Insights from Social, Neurobiological, and Complexity Sciences. DTIC Document, (July).
- Long Tail. (n.d) In Wikipedia. Retrieved December 30, 2012, from http://en.wikipedia.org/wiki/Long_tail