A massive digitization project is underway to scan the roughly six million images stored in the morgue at the New York Times, which will be used in an archival storytelling project called Past Tense. The first example was published online and in print, and the possibilities are pretty neat. But there’s also a lot to consider about what this means for the archives at a variety of organizations, and about digitization and digital storage in general.
Let’s start by looking at the digitization process for this project. A post at Google describes how the Times is using Google’s technology:
“Once an image is ingested into Cloud Storage, the New York Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.”
That’s a lot of Google product doing a lot of automated tasks. One of the more interesting parts is that though humans are still required to scan the prints, the metadata is largely handled by artificial intelligence. Cloud Vision API detects the text that’s on the back of the images — info about the image like when, if ever, it ran in the Times — which is an incredible time-saver. In describing the process, however, Google itself notes it’s not perfect. In the example they show, the AI picks up a hand-drawn circle around a date and includes it as an open parenthesis, and in trying to parse the words in a sentence that had been crossed out, it creates quite garbled results. Still, it does capture a lot, and far quicker than a human could do it. Additionally, the Cloud Natural Language API can be used to analyze the text that’s been identified and add even more useful search terms.
That’s all very exciting, especially when the task at the Times is to get through several million images in about one to two years (the speed is understandable — in 2015, a busted pipe caused a flood that nearly ruined the archive, so they’re reasonably concerned about preserving it). Naturally, however, the use of AI might give an information professional pause — wait, is that a machine taking my job? In a way, yes, since a person will not be individually entering all that back-print data into a computer. But it’s helpful to revisit The Seven Deadly Sins of AI Predictions here to get some perspective, particularly on “performance versus competence,” which reminds us: “Today’s robots and AI systems are incredibly narrow in what they can do. Human-style generalizations do not apply.” Google’s AI is sophisticated, but there’s still a need for humans for certain aspects of categorizing and cataloging.
When I’m wearing my journalist hat, this project seems super cool. I’ve been waiting to see how this would evolve since I saw job postings for the archival storytelling team earlier this year, and while browsing the Times this weekend, it was exciting to see the first piece to come out of it, an eerily timed collection of photos showing a glimpse of California over time. Imagine all the possible stories they could build using these millions of images, all the tales that have been waiting in a basement all this time.
But when I put on my information hat, things get a little murky. I respect that the Times understands it’s not an archiving operation. “This is not first and foremost about preservation — it’s about storytelling,” said Nick Rockwell, Times CTO, in a speech at Google describing the project. But while they obviously see the importance in preservation, it’s unclear just how much consideration they gave it.
Starting from the public-facing part (because the archive itself will only be available to the Times’ journalists), there’s the long-term preservation concern of the dynamic digital packages they’ll be creating out of these images — the Times might be working on its own digital preservation plans, but the problem of dynamic content is still one to grapple with.
Of course, the digitization also raises plenty of questions. The Times describes this project as “part of a technology and advertising partnership with Google,” but doesn’t go into any more detail about what that entails. Are they keeping copies on their own servers, or only in the Cloud? What else is Google getting out of this, beyond the ample coverage as the Times discusses the project? Also, one thing that all organizations embarking on a digitization project like this must consider is what will they do with the physical materials. Is it viable for the Times to keep up the digital archive and the 600,000-pound physical archive in an expensive piece of Manhattan real estate?
And what does this mean for the wider landscape? The Times is an influential institution, so when they heavily promote the use of Google products for such an ambitious project, other organizations might see that as a recommendation for their own archives. Google, of course, is ready for them. They’re pushing the use of Google Cloud by news organizations of all sizes, offering funding to initiate its use, as well as platform credits. So this is something that other organizations can do — but should they?
“Google cares deeply about journalism.” The quote, attributed to Google CEO Sundar Pichai, is in large, bold lettering on the page that describes what the Google News Initiative is about. It continues: “We believe in spreading knowledge to make life better for everyone. It’s at the heart of Google’s mission. It’s the mission of publishers and journalists. Put simply, our futures are tied together.”
The conclusion of that statement almost comes across as a threat. Is Google really journalism’s future? Say a newspaper uses the Google services to archive its photo collection. Is the next step to store all its archivable assets there? What does that mean for access, and what does that mean for libraries? Is there an alternative option that doesn’t tie an organization’s archives to a corporation — especially a monolith like Google?
One last thing to consider about a newspaper’s digitized archives is what happens to serendipity. Is that lost when you’re searching a digital database for very specific and targeted term? Or does it open up new possibilities for serendipitous discovery mechanisms, like things Jer Thorp is creating at the Library of Congress? In a piece about the Times morgue published in 2017, they suggested serendipity would be lost in digitization.
“There’s an inescapable sense of serendipity to wandering through the morgue, a sensation that in many ways would be impossible to replicate with a modern-day, digitized archive,” Stephen Hiltner wrote. However, maybe that prediction is as good as this one, which followed immediately in that piece: “Not that a digital version of The Times’s morgue is likely, or even possible.”
Originally written for INFO-654: Information Technologies, Pratt School of Information, Fall 2018.