Saturday, March 7, 2026
Beyond the Crime Scene
  • Home
  • News
  • True Crime Stories
  • Videos
  • Podcast
No Result
View All Result
  • Home
  • News
  • True Crime Stories
  • Videos
  • Podcast
No Result
View All Result
Beyond the Crime Scene
No Result
View All Result
Home News

How We Reported on Rhetoric About Immigrants in the 2024 Election

by
November 20, 2024
in News
0
Screenshot of WizMap, a tool to visualize high-dimensional vectors, showing the immigration statements data.
192
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter


For even a casual observer of Donald Trump’s electoral career, it is clear that the former, and future, president has centered his political project on an opposition to immigration. He has, over and over again, made inflammatory statements about immigrants — many of which collapse under scrutiny.

In an election year in which immigration was a critical issue for voters and popular anti-immigrant sentiment grew, our question became: How can we understand Trump’s immigration rhetoric in its full scope and importance, and how might we similarly interrogate Vice President Kamala Harris’ language?

The Marshall Project set out to tackle this question ahead of the 2024 election. Focusing on immigration — an area of public discourse rife with falsehoods, and an explicit cornerstone of Trump’s campaign — we decided to take a bird’s-eye view of each candidate’s comments on immigration over decades of public life available in Factba.se, a public database of presidential candidate statements.

Our goal was to process hundreds of thousands of lines of transcript text to pull out 1) how many of these statements were about immigration, 2) how many of those immigration statements were repetitions of the same idea and 3) how many of those repeated ideas were false.

Processing large amounts of information is an important challenge in journalism. Until recently, a processing task of this scale would have been abandoned as impossible: No reporter can realistically read and categorize 10 million words — we roughly estimated it would take the average reader around 700 hours. Enter natural language processing.

NLP is the use of computers to understand, process and generate text. Techniques like topic modeling, classification and clustering are long-established in computer science, and have recently become more accessible in less technical fields through increasing computational resources and improved interfaces. These methods can vastly increase reporters’ capacity to find and process the information they are looking for.

We used NLP techniques to evaluate the scale and nature of Trump’s and Harris’ immigration rhetoric, which were starkly different due to their approaches to immigration and the differing lengths of their candidacies. After scraping over 350,000 lines of text from almost 4,000 Factba.se transcripts dating from 1976 to the end of September 2024, we filtered and grouped the statements into those made by each of the candidates, and used a binary classifier to identify over 12,000 of them that were about immigration.

From there, we used a clustering algorithm to create groups of similar claims. Reporters manually reviewed the results, combining some clusters and splitting up others, refining them into a final set of major claims about immigration. We tailored a binary classifier for each one and ran it on the entire corpus, which gave us, for each claim, a set of statements likely to be making that claim. Finally, reporters manually reviewed each set of statements, confirming which ones were examples of that particular claim.

The result was a set of 13 claims carefully checked by reporters, and a lower bound for the number of times Trump has made some variation of each one. That allowed us to show he has repeated some of the claims at least 500 times.

For example, Trump has referred to unauthorized immigrants as criminals at least 575 times, as snakes that bite at least 35 times, as coming from prisons, jails and mental institutions at least 560 times and as causing crime in sanctuary cities at least 185 times. He has described the construction of a wall on America’s southern border as essential to public safety at least 675 times, and has argued at least 50 times that mass deportations are acceptable because President Dwight Eisenhower did it. We found all of these claims to be either entirely false or, at the very least, highly misleading.

In this project, we used basic, trusted NLP methods to pull meaningful findings out of a mountain of text. And you can do it, too. By arming yourselves with NLP techniques to simplify large datasets into subsets that are more manageable for human review and using them to set lower bounds (such as “at least 50 times”), reporters like you can become more efficient without sacrificing accuracy.

We collaborated with Robert Flagg, a data scientist and father of Anna Flagg, one of the project’s reporters. He designed and developed code for the analysis with The Marshall Project, and provided expert guidance on NLP to reporters.

Here are some more details about how we did it:

Scraping

We needed the raw data, so our first step was to scrape speech transcripts for the candidates from Factba.se. We obtained permission from Factba.se before scraping.

Using Python and the Beautiful Soup and Selenium libraries, we pulled down a list of both candidates’ speeches, interviews and other available transcripts from the Factba.se search results page, including the URLs of individual transcripts, which we also then scraped. Factba.se provides the transcripts conveniently broken up into small segments of text, usually one or two sentences, labeled by speaker. We counted each of those snippets of speech as one statement.

After scraping, the result was a dataset of public statements of the candidates, interviewers and other participants in the speaking events, as well as the date, location and other pieces of metadata about the event.

Next, we needed to pull out all the statements related to immigration. We decided to use a binary classification model, a method of categorizing data into one of two groups, because the nature of our problem was to label each statement as one of two things: about immigration, or not about immigration.

Such a classifier works by learning patterns from an initial “training set” of labeled data, which it can then apply to labeling new data. So we needed some labeled data — and a significant amount of it. But we didn’t want to spend weeks having humans label sample material. Instead, we ran a subset of the data through the large language models (LLMs) GPT-4o mini and Claude 3.5 Haiku, which we prompted to label each statement as either about immigration or not.

To improve the accuracy of the LLM responses, we used Clue and Reasoning Prompting, a method that requires the LLM to reason step-by-step by first generating a list of helpful hints, and then articulating a diagnostic reasoning process before making a judgment about whether the passage is or is not about immigration.

Using the resulting labeled data as an initial training set, we fine-tuned a RoBERTa binary classifier, a state-of-the-art classification model. We ran the model on the overall unlabeled data. When the model expressed low confidence in its answer, reporters manually reviewed and provided labels, added the resulting labeled data to the training set and trained the model again. We repeated this cycle several times to improve the model’s performance, a technique known as active learning.

Clustering to identify major themes in immigration rhetoric

We hypothesized that many of the statements were repetitions of the same idea. So we needed a way to group together statements that were similar in meaning.

We turned to a common deep-learning tool known as a transformer, which works by representing input data as high-dimensional vectors. Transformers were introduced in “Attention Is All You Need,” a seminal paper by Google developers that became a key building block in the field. Here are some more details about transformers.

In our case, our input data was the statements. We used a sentence transformer to embed the statements in high dimensions, and the UMAP dimension reduction technique to create a simplified representation of each statement. We then clustered those into groups of related statements using DBSCAN.

Human review

The goal of this analysis was to explore the universe of candidate statements about immigration, and report out the major themes we saw and how often they were repeated. Our findings needed to be 100% reported by humans. All our language processing was to get to the stage where reporters could step in with their expertise.

Reporters read statements from each cluster that had been highlighted. To aid this review, we again used an LLM, prompting it for a summary of each cluster based on its 10 most relevant statements as defined by the model’s reported level of confidence. We paired this information with WizMap, a tool used to visualize high-dimensional embeddings, which reporters used to see and explore the immigration statements.

Screenshot of WizMap, a tool to visualize high-dimensional vectors, showing the immigration statements data.

Reporters combined some clusters and split apart others. The computer-aided work made this process much more efficient, quickly surfacing themes and patterns from an otherwise overwhelming amount of text.

Related articles

Suffolk County Police car.

Toddler hit and killed by pick-up truck in Long Island driveway: cops

December 9, 2025
Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

December 9, 2025

Counting statements for each claim

Our final set of immigration claims in hand, we again trained the binary classifiers, looking for statements that matched each claim. We used a similar process to before, fine-tuning each classifier with a set of statements labeled by an LLM and improved by human review.

Reporters then manually reviewed the statements returned by the classifier, sometimes amounting to hundreds of statements or more for a single claim. Any statement deemed not to strictly match the claim was thrown out. These false positives were more frequent for some claims than others, sometimes numbering in the hundreds.

The result was a comprehensive list of major repeated claims about immigration driven by the candidates’ catalogs of immigration-related statements. For each of the Trump claims that we fact-checked, we had a set of up to hundreds of instances, all confirmed by human reporters.

For example, the model surfaced a pattern in Trump’s speeches of citing a group of isolated, tragic cases to allege that undocumented immigrants are killing Americans en masse. Reporters read all the statements classified in that category, throwing out any false positives, and found that Trump had made this claim more than 235 times.

Reporting with natural language processing

In this project we used classifiers, LLMs and clustering to narrow a large dataset of text, using human reporters at strategic points to guide the process, at the end producing an entirely human-reported set of results.

We hope this work can be a useful reference for how reporting projects can use computers for something they are good at — processing lots of text — and humans for something they are good at — providing nuanced editorial judgment.

Right now, reporters have a great opportunity to use trusted NLP methods as a powerful tool to both expand and speed up their work. By mixing computer-aided techniques with traditional journalism, we are in a better position than ever before to tackle reporting problems that involve vast amounts of information, without sacrificing accuracy.



Source link

Tags: 2024 electionBiden administrationborder crossingborder securityborder separationborder wallDACAdepartment of justicedeportationdonald trumpImmigrant FamiliesimmigrationImmigration and Customs Enforcementkamala harrispoliticsrepublicansTrump AdministrationUndocumented immigrants
Share77Tweet48
Previous Post

Killer mom Susan Smith sobs, begs to be released from prison at parole hearing — but refuses to take responsibility for actions

Next Post

Dad accused of helping son carry out ‘Pike County Massacre’ of 8 claims jury pool ‘tainted,’ wants trial moved

Related Posts

Suffolk County Police car.

Toddler hit and killed by pick-up truck in Long Island driveway: cops

by
December 9, 2025
0

A toddler was hit and killed by a pickup truck in a Long Island driveway on Saturday afternoon, Suffolk County...

Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

by
December 9, 2025
0

The man stabbed by a homeless illegal migrant on a light rail train in North Carolina last week was a...

Once jailed Long Island corruption watchdog now preps convicted white-collar criminals for prison

Once jailed Long Island corruption watchdog now preps convicted white-collar criminals for prison

by
December 8, 2025
0

A disgraced Long Island ex-prosecutor is using his own experiences in the big house to peddle consultancy services to white-collar...

NYC ties record for longest stretch without a single homicide

NYC ties record for longest stretch without a single homicide

by
December 8, 2025
0

The Big Apple just went 12 days without a single homicide — matching a historical record set nearly a decade...

Nurses Say Staff Shortage Impacting Medical Care at Missouri Prison

Nurses Say Staff Shortage Impacting Medical Care at Missouri Prison

by
December 8, 2025
0

When Steven Caldwell-Bey wasn’t able to get a regular refill for his blood thinners, he began taking one pill a...

Load More
  • Trending
  • Comments
  • Latest
The horrifying rape, torture murder of eight-year-old Nurin Jazlin Jazimin : True Crime Diva

The horrifying rape, torture murder of eight-year-old Nurin Jazlin Jazimin : True Crime Diva

May 29, 2023
What I Learned From a Year of Reading Letters From Prisoners

What I Learned From a Year of Reading Letters From Prisoners

December 16, 2024
Drunk driver who killed mother and son blamed the victims, phone calls with father reveal

Drunk driver who killed mother and son blamed the victims, phone calls with father reveal

September 22, 2024
'Gulf Coast Stapletons' influencer sentenced for child porn

‘Gulf Coast Stapletons’ influencer sentenced for child porn

July 4, 2025
NJ man who chopped neighbor's trees fined $13K — and faces $1M bill

NJ man who chopped neighbor’s trees fined $13K — and faces $1M bill

February 27, 2024
Karen Styles: map of where a deer hunter found her body

The 1994 murder of Karen Styles

May 9, 2023
Sacks of USAID yellow peas in a storage facility.

USAID official pleads guilty to taking part in $550M bribery scheme: ‘Violated the public trust’

June 14, 2025
Karen Styles: map of where a deer hunter found her body

The 1994 murder of Karen Styles

0
Dwane Roy Dreher: photo of his 2nd wife, Lois Genzler Dreher at 16 years old

The 1955 disappearance of U.S. Navy veteran Dwane Roy Dreher

0
Alta Braun: professional photo taken when she was about 4 years old.

The 1917 unsolved murder of Alta Marie Braun

0
Vacation Nightmare: The gruesome murder of Janice Pietropola and Lynn Seethaler

Vacation Nightmare: The gruesome murder of Janice Pietropola and Lynn Seethaler

0
Kristi Nikle: photo of suspect Floyd Tapson

The 1996 disappearance of Kristi Nikle

0
Frank and Tessie Pozar: photo of their son, Frank Pozar, Jr.

Motel Mystery: What happened to Frank and Tessie Pozar?

0
Evil on The Road Part 4: Desmond Joseph Runstedler

Evil on The Road Part 4: Desmond Joseph Runstedler

0
Missing father found buried under family home after decades of searching

Missing father found buried under family home after decades of searching

December 26, 2025
Suffolk County Police car.

Toddler hit and killed by pick-up truck in Long Island driveway: cops

December 9, 2025
Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

Kenyon Dobie was good Sam trying to stop Oscar Solarzano: prosecutors

December 9, 2025
Once jailed Long Island corruption watchdog now preps convicted white-collar criminals for prison

Once jailed Long Island corruption watchdog now preps convicted white-collar criminals for prison

December 8, 2025
NYC ties record for longest stretch without a single homicide

NYC ties record for longest stretch without a single homicide

December 8, 2025
Nurses Say Staff Shortage Impacting Medical Care at Missouri Prison

Nurses Say Staff Shortage Impacting Medical Care at Missouri Prison

December 8, 2025
Former Georgia beauty queen Trinity Poague breaks down after being sentenced in murder of ex-boyfriend's toddler son

Former Georgia beauty queen Trinity Poague breaks down after being sentenced in murder of ex-boyfriend’s toddler son

December 8, 2025
Beyond the Crime Scene with Bee Astronaut

Categories

  • Featured
  • News
  • Podcast
  • True Crime Stories
  • Videos

Legal Pages

  • Home
  • Contact Us
  • Privacy Policy
  • Terms of Use
  • Cookie Privacy Policy
  • California Consumer Privacy Act (CCPA)
  • DMCA

© 2023 All right reserved.

No Result
View All Result
  • Home
  • News
  • True Crime Stories
  • Videos
  • Podcast

© 2023 All right reserved.