“A wild, wild West”: How NYT makes sense of COVID-19 reporting systems

Archie Tse is the director of graphics at The New York Times

The Trump administration this week ordered hospitals to send their coronavirus data to the Department of Health and Human Services, bypassing the Centers for Disease Control and Prevention. Thursday morning, the CDC website that had been displaying the data was blank, prompting an outcry from journalists and members of Congress.

Later Thursday, the CDC restored two-day-old data. But it remains unclear whether the information will be regularly updated and visible to the public.

The New York Times and The Atlantic’s COVID Tracking Project, however, were able to continue updating their databases through their widespread, work-intensive collection systems from states and, for the Times, even from counties.

“The Times’s data collection for this page is based on reports from state and local health agencies, a process that is unchanged by the Trump administration’s new requirement that hospitals bypass the Centers for Disease Control and Prevention and send all patient information to a central database in Washington,” the Times wrote Friday at the top of their COVID-19 mapping page.

We talked to Archie Tse, the Times director of graphics, to learn more about how massive amounts of data collected by Times staffers become charts and maps that viewers can easily understand.

What are your main sources of information for the data you display in your coronavirus graphics?

Tse: We’re getting all of our cases and death data directly from the states and also in many cases directly from counties. States are often slow and so we actually go directly to the counties and get the data, which is usually a few days ahead of what the states report.

Because the data task is pretty monumental, we kind of pick and choose the counties that we go to directly based on how severe the outbreak is in those places, and what kind of outbreak is happening there. So we’re kind of making choices about when we rely on a state for the data and when we go to the county to get the data.

So you don’t even use the CDC data for backup?

Tse: We’re not using any of the case and death data from the CDC site at all. We have been trying to plug into their hospital data, but we’ve not found it to be that useful yet, and we don’t know when it will become useful.

We don’t know exactly what it will mean if the HHS now becomes the clearinghouse for this. We do worry a little bit about another set of competing numbers that may not be as timely, that will be confusing to the public. We’re kind of concerned about that. But I think there’s not a lot we can do at this point.

There’s not been a central clearinghouse for the data. It’s kind of a wild, wild West.

What’s the time and staff commitment for the data collection?

Tse: We’re doing it throughout the day. We do close the books a little bit after midnight for each day. And then we start a new day in the morning.

It’s pretty intensive. It’s a huge team that is doing this. We have automated some of the data collection from the states and counties using scraping. But there is still a large amount of manual data collection. Every county and every state is using some different dashboard that we had to write a scraper for, or that needs to be checked to make sure they didn’t change their HTML or something overnight.

So on any given day there’s a team of at least a dozen people who are actively working on data collection and data entry. And then another five people a day that are working on the scraping. The scrapers have to be monitored, because they go down pretty frequently because of changes in the code, and you have to have people to fix the scrapers right away.

It’s a lot bigger than we imagined when we started.

Once you have the data collected, are your graphics systems pretty automated by now?

Tse: The data collection first started in a Google spreadsheet when one of our national reporters started collecting data on every case. We were using our graphics publishing rig to generate the pages and it was how we would normally set something up on deadline.

But we got to about 40,000 rows in the Google spreadsheet and things began to slow down to a crawl. So we went over to the interactive news department, which does a lot of the back-end work for the newsroom. And they helped build the sort of database scraping effort to make this a proper data collection effort.

Then we actually spun up a publishing team, using the model that we use for election results, to build the presentation side of all the data. So that’s a pretty robust publishing system. It’s constantly being edited and adjusted because as the nature of the outbreak changes, the things that you want to show on those pages also changes.

There was a period where we were really focused on showing percentage change or the rate of change on the pages because things were shooting up in June. But now it’s more of a per capita kind of period where we want to show places where things are most severe.

You’re constantly adjusting what graphs and charts and maps that you’re showing. They need to be tailored to the moment.