网页归档记录马来西亚 mh17 被击落事件的证据
malaysia airlines flight 17 took off from amsterdam at 10:31 a.m. g.m.t. on july 17, 2014, for a twelve-hour flight to kuala lumpur. not much more than three hours later, the plane, a boeing 777, crashed in a field outside donetsk, ukraine. all two hundred and ninety-eight people on board were killed. the plane’s last radio contact was at 1:20 p.m. g.m.t. at 2:50 p.m. g.m.t., igor girkin, a ukrainian separatist leader also known as strelkov, or someone acting on his behalf, posted a message on vkontakte, a russian social-media site: “we just downed a plane, an an-26.” (an antonov 26 is a soviet-built military cargo plane.) the post includes links to video of the wreckage of a plane; it appears to be a boeing 777.
two weeks before the crash, anatol shmelev, the curator of the russia and eurasia collection at the hoover institution, at stanford, had submitted to the internet archive, a nonprofit library in california, a list of ukrainian and russian web sites and blogs that ought to be recorded as part of the archive’s ukraine conflict collection. shmelev is one of about a thousand librarians and archivists around the world who identify possible acquisitions for the internet archive’s subject collections, which are stored in its wayback machine, in san francisco. strelkov’s vkontakte page was on shmelev’s list. “strelkov is the field commander in slaviansk and one of the most important figures in the conflict,” shmelev had written in an e-mail to the internet archive on july 1st, and his page “deserves to be recorded twice a day.”
on july 17th, at 3:22 p.m. g.m.t., the wayback machine saved a screenshot of strelkov’s vkontakte post about downing a plane. two hours and twenty-two minutes later, arthur bright, the europe editor of the christian science monitor, tweeted a picture of the screenshot, along with the message “grab of donetsk militant strelkov’s claim of downing what appears to have been mh17.” by then, strelkov’s vkontakte page had already been edited: the claim about shooting down a plane was deleted. the only real evidence of the original claim lies in the wayback machine.
the average life of a web page is about a hundred days. strelkov’s “we just downed a plane” post lasted barely two hours. it might seem, and it often feels, as though stuff on the web lasts forever, for better and frequently for worse: the embarrassing photograph, the regretted blog (more usually regrettable not in the way the slaughter of civilians is regrettable but in the way that bad hair is regrettable). no one believes any longer, if anyone ever did, that “if it’s on the web it must be true,” but a lot of people do believe that if it’s on the web it will stay on the web. chances are, though, that it actually won’t. in 2006, david cameron gave a speech in which he said that google was democratizing the world, because “making more information available to more people” was providing “the power for anyone to hold to account those who in the past might have had a monopoly of power.” seven years later, britain’s conservative party scrubbed from its web site ten years’ worth of tory speeches, including that one. last year, buzzfeed deleted more than four thousand of its staff writers’ early posts, apparently because, as time passed, they looked stupider and stupider. social media, public records, junk: in the end, everything goes.
web pages don’t have to be deliberately deleted to disappear. sites hosted by corporations tend to die with their hosts. when myspace, geocities, and friendster were reconfigured or sold, millions of accounts vanished. (some of those companies may have notified users, but jason scott, who started an outfit called archive team—its motto is “we are going to rescue your shit”—says that such notification is usually purely notional: “they were sending e-mail to dead e-mail addresses, saying, ‘hello, arthur dent, your house is going to be crushed.’ ”) facebook has been around for only a decade; it won’t be around forever. twitter is a rare case: it has arranged to archive all of its tweets at the library of congress. in 2010, after the announcement, andy borowitz tweeted, “library of congress to acquire entire twitter archive—will rename itself museum of crap.” not long after that, borowitz abandoned that twitter account. you might, one day, be able to find his old tweets at the library of congress, but not anytime soon: the twitter archive is not yet open for research. meanwhile, on the web, if you click on a link to borowitz’s tweet about the museum of crap, you get this message: “sorry, that page doesn’t exist!”
the web dwells in a never-ending present. it is—elementally—ethereal, ephemeral, unstable, and unreliable. sometimes when you try to visit a web page what you see is an error message: “page not found.” this is known as “link rot,” and it’s a drag, but it’s better than the alternative. more often, you see an updated web page; most likely the original has been overwritten. (to overwrite, in computing, means to destroy old data by storing new data in their place; overwriting is an artifact of an era when computer storage was very expensive.) or maybe the page has been moved and something else is where it used to be. this is known as “content drift,” and it’s more pernicious than an error message, because it’s impossible to tell that what you’re seeing isn’t what you went to look for: the overwriting, erasure, or moving of the original is invisible. for the law and for the courts, link rot and content drift, which are collectively known as “reference rot,” have been disastrous. in providing evidence, legal scholars, lawyers, and judges often cite web pages in their footnotes; they expect that evidence to remain where they found it as their proof, the way that evidence on paper—in court records and books and law journals—remains where they found it, in libraries and courthouses. but a 2013 survey of law- and policy-related publications found that, at the end of six years, nearly fifty per cent of the urls cited in those publications no longer worked. according to a 2014 study conducted at harvard law school, “more than 70% of the urls within the harvard law review and other journals, and 50% of the urls within united states supreme court opinions, do not link to the originally cited information.” the overwriting, drifting, and rotting of the web is no less catastrophic for engineers, scientists, and doctors. last month, a team of digital library researchers based at los alamos national laboratory reported the results of an exacting study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot. it’s like trying to stand on quicksand.
the footnote, a landmark in the history of civilization, took centuries to invent and to spread. it has taken mere years nearly to destroy. a footnote used to say, “here is how i know this and where i found it.” a footnote that’s a link says, “here is what i used to know and where i once found it, but chances are it’s not there anymore.” it doesn’t matter whether footnotes are your stock-in-trade. everybody’s in a pinch. citing a web page as the source for something you know—using a url as evidence—is ubiquitous. many people find themselves doing it three or four times before breakfast and five times more before lunch. what happens when your evidence vanishes by dinnertime?
the day after strelkov’s “we just downed a plane” post was deposited into the wayback machine, samantha power, the u.s. ambassador to the united nations, told the u.n. security council, in new york, that ukrainian separatist leaders had “boasted on social media about shooting down a plane, but later deleted these messages.” in san francisco, the people who run the wayback machine posted on the internet archive’s facebook page, “here’s why we exist.”
the address of the internet archive is archive.org, but another way to visit is to take a plane to san francisco and ride in a cab to the presidio, past cypresses that look as though someone had drawn them there with a smudgy crayon. at 300 funston avenue, climb a set of stone steps and knock on the brass door of a greek revival temple. you can’t miss it: it’s painted wedding-cake white and it’s got, out front, eight corinthian columns and six marble urns.
“we bought it because it matched our logo,” brewster kahle told me when i met him there, and he wasn’t kidding. kahle is the founder of the internet archive and the inventor of the wayback machine. the logo of the internet archive is a white, pedimented greek temple. when kahle started the internet archive, in 1996, in his attic, he gave everyone working with him a book called “the vanished library,” about the burning of the library of alexandria. “the idea is to build the library of alexandria two,” he told me. (the hellenism goes further: there’s a partial backup of the internet archive in alexandria, egypt.) kahle’s plan is to one-up the greeks. the motto of the internet archive is “universal access to all knowledge.” the library of alexandria was open only to the learned; the internet archive is open to everyone. in 2009, when the fourth church of christ, scientist, decided to sell its building, kahle went to funston avenue to see it, and said, “that’s our logo!” he loves that the church’s cornerstone was laid in 1923: everything published in the united states before that date lies in the public domain. a temple built in 凯发k8官网下载 copyright’s year zero seemed fated. kahle hops, just slightly, in his shoes when he gets excited. he says, showing me the church, “it’s greek!”
kahle is long-armed and pink-cheeked and public-spirited; his hair is gray and frizzled. he wears round wire-rimmed eyeglasses, linen pants, and patterned button-down shirts. he looks like mr. micawber, if mr. micawber had left dickens’s london in a time machine and landed in the pacific, circa 1955, disguised as an american tourist. instead, kahle was born in new jersey in 1960. when he was a kid, he watched “the rocky and bullwinkle show”; it has a segment called “peabody’s improbable history,” which is where the wayback machine got its name. mr. peabody, a beagle who is also a harvard graduate and a nobel laureate, builds a wabac machine—it’s meant to sound like a univac, one of the first commercial computers—and he uses it to take a boy named sherman on adventures in time. “we just set it, turn it on, open the door, and there we are—or were, really,” peabody says.
when kahle was growing up, some of the very same people who were building what would one day become the internet were thinking about libraries. in 1961, in cambridge, j. c. r. licklider, a scientist at the technology firm bolt, beranek and newman, began a two-year study on the future of the library, funded by the ford foundation and aided by a team of researchers that included marvin minsky, at m.i.t. as licklider saw it, books were good at displaying information but bad at storing, organizing, and retrieving it. “we should be prepared to reject the schema of the physical book itself,” he argued, and to reject “the printed page as a long-term storage device.” the goal of the project was to imagine what libraries would be like in the year 2000. licklider envisioned a library in which computers would replace books and form a “network in which every element of the fund of knowledge is connected to every other element.”
can the internet be archived
in 1963, licklider became a director at the department of defense’s advanced research projects agency (now called darpa). during his first year, he wrote a seven-page memo in which he addressed his colleagues as “members and affiliates of the intergalactic computer network,” and proposed the networking of arpa machines. this sparked the imagination of an electrical engineer named lawrence roberts, who later went to arpa from m.i.t.’s lincoln laboratory. (licklider had helped found both b.b.n. and lincoln.) licklider’s two-hundred-page ford foundation report, “libraries of the future,” was published in 1965. by then, the network he imagined was already being built, and the word “hyper-text” was being used. by 1969, relying on a data-transmission technology called “packet-switching” which had been developed by a welsh scientist named donald davies, arpa had built a computer network called arpanet. by the mid-nineteen-seventies, researchers across the country had developed a network of networks: an internetwork, or, later, an “internet.”
kahle enrolled at m.i.t. in 1978. he studied computer science and engineering with minsky. after graduating, in 1982, he worked for and started companies that were later sold for a great deal of money. in the late eighties, while working at thinking machines, he developed wide area information servers, or wais, a protocol for searching, navigating, and publishing on the internet. one feature of wais was a time axis; it provided for archiving through version control. (wikipedia has version control; from any page, you can click on a tab that says “view history” to see all earlier versions of that page.) wais came before the web, and was then overtaken by it. in 1989, at cern, the european particle physics laboratory, in geneva, tim berners-lee, an english computer scientist, proposed a hypertext transfer protocol (http) to link pages on what he called the world wide web. berners-lee toyed with the idea of a time axis for his protocol, too. one reason it was never developed was the preference for the most up-to-date information: a bias against obsolescence. but the chief reason was the premium placed on ease of use. “we were so young then, and the web was so young,” berners-lee told me. “i was trying to get it to go. preservation was not a priority. but we’re getting older now.” other scientists involved in building the infrastructure of the internet are getting older and more concerned, too. vint cerf, who worked on arpanet in the seventies, and now holds the title of chief internet evangelist at google, has started talking about what he sees as a need for “digital vellum”: long-term storage. “i worry that the twenty-first century will become an informational black hole,” cerf e-mailed me. but kahle has been worried about this problem all along.
“i'm completely in praise of what tim berners-lee did,” kahle told me, “but he kept it very, very simple.” the first web page in the united states was created at slac, stanford’s linear-accelerator center, at the end of 1991. berners-lee’s protocol—which is not only usable but also elegant—spread fast, initially across universities and then into the public. “emphasized text like this is a hypertext link,” a 1994 version of slac’s web page explained. in 1991, a ban on commercial traffic on the internet was lifted. then came web browsers and e-commerce: both netscape and amazon were founded in 1994. the internet as most people now know it—web-based and commercial—began in the mid-nineties. just as soon as it began, it started disappearing.