How Bloomberg Law uses AI-driven data analysis to tackle big stories

Last fall, Bloomberg Law set out to track the impact of the Supreme Court’s 2021 ruling in Brnovich v. Democratic National Committee.

Following a months-long investigation, a team of more than a dozen Bloomberg Law journalists revealed that the ruling significantly weakened Section 2 of the Voting Rights Act, making it harder to claim a voting practice is discriminatory. To produce the deep dive, the newsroom relied on AI analysis of large sets of data, as outlined in the “data methodology” section at the end of the story.

The Institute asked Bloomberg Industry Group Data Editor Andrew Wallender and Investigations Editor Gary Harki to shed light on this process and learn how other newsrooms can replicate this type of data journalism. (Wallender recently spoke to journalists and communicators at the Institute’s “Intro to AI for Journalists” workshop at the National Press Club on May 16.)

This interview has been lightly edited for length and clarity.

Bloomberg Law’s story on voting rights claims came out in February of this year. How long did this project take from idea to publishing date?

Harki: The idea came out of a conversation with our editor-in-chief, Cesca Antonelli in September or October. We published in February. The timeline had to do with our work using large language models (LLMs) to help create the dataset, which started out as an experiment.

We used it primarily to answer a series of yes or no questions to create a data set that resulted in the data points in the story. It was a big endeavor, and we learned a lot about using LLMs.

The type of analysis we did here was very difficult because the data needed to be so precise. We weren’t just counting things, looking to get an “at least” number of how many times certain cases were cited or other things happened. We wanted precise numbers and percentages so we could look at change over time. That, combined with using an LLM to create the dataset, was a lot of work.

The story mentions that “Bloomberg Law looked at 579 federal voting rights complaints.” What AI tools were used to complete that analysis?

Wallender: We decided to use OpenAI’s GPT-4o model for this project after testing several other large language models. Our initial testing found 4o to be the best all-around LLM to categorize this kind of document and extract information at the time we did the analysis. That ended up being the main AI tool.

Another critical tool was a feature built into GPT-4o called structured outputs. This allowed us to define the exact format of the responses from the LLM. Knowing how the answers would be formatted meant that we could reliably turn the LLM’s responses into structured data that we could analyze. Before using structured outputs, answers would be formatted in different ways despite attempts to prevent that in the prompt. Those formatting irregularities would break the processing pipeline.

We queried the LLM in bulk using scripts written in Python. SQLite databases were another very handy tool to keep track of LLM responses and store metadata about each LLM query. For the actual analysis of the responses, we used R.

People might assume AI integration in journalism means less people are involved in the work — but we counted 19 reporters, editors, and artists involved in this story in some capacity. That’s an impressive number! How did you divide up the work, and why were that many individuals necessary to complete this project?

Harki: It took a team of journalists to do the story because we used AI. The story had six bylines and every one was very much deserved. Alex Ebert and Kimberly Robinson, who cover federal courts and the Supreme Court, served as our in-house experts on the law. Diana Dombrowski, an investigative reporter, was lead writer and did much of the work forming the questions we asked the LLM to create the dataset. About 20 people from around the newsroom helped us validate the data, making sure the LLMs answers were correct. In the grand tradition of newsrooms everywhere, we fed the staff pizza in thanks for the work.

Umar Farooq, Sophie Will, and Jon Meltzer were involved in both LLM and data work and Meltzer created the graphics and presentation with our art director David Evans. We had two outside journalists advising us as well. Derek Willis, who has worked at ProPublica and the New York Times, among other places, and is now a professor at the University of Maryland teaching data and computational journalism, was our advisor on all things LLM. Seamus Hughes is a well-known expert on the federal court system and was a big help anytime we had to puzzle out something dealing with federal court filings.

So that’s all the work on just creating the dataset using the LLM. And then, of course, all the normal reporting things had to happen once we had the results. There were also several engineering teams outside of the newsroom that work on Bloomberg Law’s dockets database that were a big help. One team pulled the data from PACER and made sure we had all the cases, another helped us get the initial complaints scanned so we could use an LLM to read them. The story had us working across teams. Bloomberg is a big place, and it’s amazing we have access to all of these resources and talented people.

This deep dive seems like a major endeavor and a somewhat high-stakes use of AI technology. Did your team test out LLM analysis on smaller projects before getting to this one?

Wallender: This was the first project we’ve done like this in our newsroom. So we put a lot of work into testing and validating our methods. We also leaned heavily on experts and academics to vet our approach.

One of the most time-consuming parts of this project was the prompt engineering. We went through several rounds of prompts that we iterated on during several weeks of testing. A major focus was whether to use open-ended responses or binary yes/no responses. In the end, the binary responses were a better approach for this story but meant we had to be real precise with our language. For instance, it wasn’t enough to ask the LLM to identify “redistricting” cases. We had to define what those looked like and specify that our definition also included challenges to at-large voting districts. Little changes in language could have big effects on the output.

We conducted several validation sessions where we took a random sample of the LLM responses and determined how accurate they were. This testing was an incredibly important part of the process and gave us confidence in our final results.

Although this was our newsroom’s first major AI-assisted investigation, we’re hopeful we’ll have more projects like this in the near future.

What ethical considerations were made when using a large language model (LLM), GPT-4o, to do this reporting? Does Bloomberg Law have an AI ethics code?

Wallender: We approached this project with a healthy dose of caution. We know AI can hallucinate or misinterpret questions, so we had to be completely confident in the LLM’s output.

Part of the reason the story took so long was the amount of time we spent repeatedly testing and re-writing the prompts to ensure accuracy. It was a good exercise in being really deliberate with the language we use to define certain attributes. Checking the LLM for accuracy forced us to see our story in a new light by making us break down each definition we used and contemplating how we wanted to categorize cases that didn’t exactly match initial definitions we established.

We set predefined thresholds for accuracy so that we could be confident in the figures cited in the story. We then had a small army of reporters offer up their time to read through a random sample of cases and answer the same questions we asked the LLM, so we could assess accuracy. At least two reporters provided responses for each case. When people disagreed on answers, we scrutinized the responses and rationale to see if there was a mistake or we needed to discuss as a group the correct answer.

Thankfully, our newsroom and company as a whole are no stranger to AI. There are a number of people at Bloomberg Industry Group and Bloomberg LP who have been working with AI for years who offered us advice and best practices. Our company also has an extensive generative AI policy and requires that any use of the technology, including for this story, is approved by a committee of reviewers.

What are some of your main takeaways from experimenting with using AI for data analysis in this story?

Harki: We learned a lot of lessons doing this story, the first of which is that using an LLM to create data from even a small subset of court documents is really challenging. You have to consider the consistency in which documents are filed, how they show up in the court’s filing system and the vagaries of how lawyers draft the documents themselves. It’s doable, but it’s a lot of work, and you really have to think through every step.

The LLM was very useful, but it’s not going to tell you how a certain court files a certain type of case. You need a reporter for that. It’s also only going to answer the question you ask it consistently with a lot of trial and error. It’s a tool. It can help with the analysis, but ultimately it can’t write analytical stories like this.

Tags: AI, data journalism, investigative reporting, technology