The Quantifind approach
Quantifind’s unique expertise is founded in leveraging artificial intelligence to derive insights from unstructured text data found across vast public datasets. Combining source-based and text-based classification, we are able to significantly broaden the range of useful sources and data while reducing the impact of fake news in our adverse media screening.
The first phase includes source-level and domain-level screening. Our internal research found over 1500 domain sources of fake news, misinformation, and heavily biased content. This research leveraged professionally curated lists of unreliable media sources, as well as our own analyst reviews. Between 1-3% of our news archive content is attributable to unreliable sources. By combining data-driven analysis with manual analyst review, Quantifind is able to identify the most harmful sources of fake news and filter them from our risk alerting. Our targeted review process identified 52 domains that are the worst propagators of fake and unreliable content and removed them completely from our screening results.
Document Level Screening
The next phase in the process is document-level screening using a text-based methodology. There are a few advantages to this approach that complement the source-based implementation. Not all sources fall into a binary label of reliable and unreliable; there is a spectrum at the article level where a single source may have a mixture of thoroughly fact-checked news content and less reliable content. We don’t want to remove all results from a given source if only a small percentage of the content is unreliable. In these cases, Quantifind can identify the specific articles that are a problem instead of labeling at the source level only.
Another motivation for document-level analysis is the risk of domain name changes or URL spoofing. Sites propagating fake news are known to change their domain once they’ve been identified by a fact-checking outlet. URL hijacking is another known tactic for disinformation to appear to be legitimate news from credible sources. The advantage of document-level detection is that it evaluates the content and not just the source.
Quantifind has developed AI-driven models and algorithms to classify documents based on textual features. The first focuses on identifying editorial and opinion pieces that are less relevant to adverse media screening. Using a rules-based approach to parsing article URLs, we were able to characterize over 4,000 documents as opinion pieces with greater than 90% precision. This strategy can be used to label certain documents as opinion so that an investigator is aware of that context when reviewing the article.
The second identifies patterns within features of the articles’ text that distinguish reliable vs. unreliable content. Our model utilizes syntax features such as number of words per sentence and readability, lexical features such as types of pronouns used, and psycholinguistic features such as the amount of positive or negative emotionalism. The power of a machine learning algorithm is that by combining many features with slight predictive power it can discover accurate patterns that would otherwise be unnoticeable.
AI-Powered Fake News Detection
Quantifind is able to correctly characterize fake news on a representative set of labeled training data. These results illustrate how patterns can be detected to distinguish between reliable and unreliable content. For this example, the “fake” category includes content that is fake, heavily biased, unreliable, or clickbait.