On the Ethics of Mapping: When does data collection cross the line?
“I saw the chance to really work on an Internet scale, command hundred thousands of devices with a click of my mouse, portscan and map the whole Internet in a way nobody had done before, basically have fun with computers and the Internet in a way very few people ever will.”
Many of you probably read the New York Times feature last year describing how Target used its data collection team to discover shoppers who were pregnant—and therefore likely to become big spenders on diapers, lotions, baby furniture, and by extension, everything else if they could be targeted (no pun intended) early enough in the pregnancy. Citing the poignant example of an enraged father deploring Target for suggesting that his teenage daughter might be pregnant, only to discover that she actually WAS pregnant, this article raised the question, when does data collection cross the line between smart business practice and invasion of privacy?
A recent mapping project, the Internet Census 2012, blurs the line even further. Hot on the tails on SOPA, PIPA, and Aaron Swartz’s untimely death, and the conversation about open access to information these events provoked, this project by an anonymous researcher capitalized on a glaring security loophole in Linux systems to find temporary hosts that would allow a good chunk of the IPv4 internet (note 1) to be mapped.
The technical details are beyond the scope of this blog and are described in further detail in the report, but essentially, the researcher hacked into 420,000 Linux systems that had either weak or no system passwords, infecting each device with a background service that scanned for IP addresses. Luckily, the researcher harbored no malicious intent: the system was not affected, incoming and outgoing traffic were ignored, and each device was returned to its original state after a reboot.
The maps and visualizations created as a result of this endeavor are stunning for their revelations about the internet as much as for the chutzpah involved in embarking on such an ethically-ambiguous endeavor. Is it worth the potential violations of privacy of hundreds of thousands of individuals (and corporations) to document the proliferation of our virtual communications networks? How far is it prudent to go in the pursuit of knowledge acquisition, and where do we draw the line between information we NEED and information we WANT? After all, the creator of this work was cited as saying, “I did not want to ask myself for the rest of my life how much fun it could have been or if the infrastructure I imagined in my head would have worked as expected.”
True, the hacker involved had the ethical bearings to restrain this study, and did not exploit the private contents of the devices and local networks, but what about next time? In all the project charted 4.6 million IP addresses of the course of October to December 2012, affecting hundreds of thousands of devices and in the process breaking enough laws around the world to “make them liable for many thousands of years behind bars” if current sentencing policy prevails. A big if, considering recent forays into information access, whether legal (Freedom of Information Act (FOIA)) or not (WikiLeaks).
This post is by no means critical of the results, either of Target or the Internet Census, as obviously data mapping is the prime motivator of this blog. Indeed, the maps and diagrams themselves give a clear picture of the concentration of the internet at large in a way that more abstract maps by CAIDA, Opte, and others (note 2) are unable to do. Given the optimistic, save-the-world attitude often embraced by mappers, data visualizers, and urban explorers such as myself, it is merely worth asking, how much data do we need? And how far are we willing to go to get it
To get a geographic overview we determined the geolocation of all IP addresses that respond to ICMP ping requests or have open ports. We used MaxMinds freely available GeoLite database [maxmind.com] for geolocation mapping. Different versions of this image are available for download at http://internetcensus2012.bitbucket.org/images.html
To test if we could see a day night rhythm in the utilization of IP spaces we used all ICMP records to generate a series of images that show the difference from daily average utilization per half an hour. We composed theses images to a GIF animation that clearly shows a day night rhythm. The difference between day and night is lower for US and Central Europe because of the higher number of “always on” Internet connections.
To get a visual overview of ICMP records we converted the one-dimensional, 32-bit IP addresses into two dimensions using a Hilbert Curve, inspired by xkcd. This curve keeps nearby addresses physically near each other and it is fractal, so we can zoom in or out to control detail. Figure 2 shows 420 Million IP addresses that responded to ICMP ping requests at least two times between June and October 2012. Address blocks are labeled based on IANA’s list of IPv4 allocations that can be found here. Each pixel in the original 4096 x 4096 image represents a single /24 network containing up to 256 hosts. The pixel color shows the utilization of each /24 based on the number of probe responses. Black areas represent addresses that did not respond to the probes. Blue represents low utilization (at least one response), and red represents 100% utilization. This image was generated to be comparable to Figure 3, created 2006 by CAIDA in an Internet census project [isi.edu].
(note 1) IPv4 is the system of establishing a unique number to identify devices on a network. If you’ve ever seen a number like 192.168.1.42, that’s an IP4 address. IPv4 is a 32-bit system, meaning the address format can provide 4.29 billion (2^32) different configurations–and when the internet was in its infancy, this seemed like a LOT of devices. However, the last blocks of IPv4 addresses were officially allocated in February 2011. Luckily, a switch had been in the works for several years to IPv6, a 128-bit system that allows for 2^128 Internet addresses, or 340,282,366,920,938,000,000,000,000,000,000,000,000 of them to be exact. For more information on this switch, Mashable has a good plain-English explanation.
(note 2) For more maps of the internet and other virtual systems, see http://visualizingsystems.com/category/connective-systems/virtual-infrastructure/