Training a Deep Learning Algorithm with Text: Generic v. Company-Specific Training Data

Why wait until the end? Here’s the one-sentence “takeaway” from this blog:

“It’s a far, far better thing we do, when thinking of a prospective customer, to extract the factual allegations from 400 cases (if there are that many) filed against them.”

...

In a recent hypothetical exchange (i.e., I’m having this conversation with myself), I spoke with the CTO of a leading company in the tech sector. He had become interested in AI, specifically “neural networks.”

He explained a chicken-and-egg problem in a way I could understand. Previously, venture capital people had explained it differently. To them, the egg was money (theirs), and the chicken was customers.

To the CTO, the chicken was (again) the customer. But this time the egg was the “training data.”

How can a risk analysis system be created to help a customer avoid a risk, e.g., the risk of litigation, unless the customer is supplying the eggs?

But, understandably, customers don’t want to supply their risky data to a startup unless they’re first persuaded that the system will work. And without the training data from the customer, how can the startup start?

Tough question, isn’t it?

“Let me explain,” the CTO said. “I’m playing the role of your buyer. I’d like to see you provide proof that the way you use text for risk analysis actually works. One way is proof against a standard dataset like MNIST—you know, the set of numbers as images, which does not exist for words—or cheers from your customers, and you don’t have them yet. But you have to be able to tell me, as a prospective investor or business partner, about the verifications you’ve received from other customers. How will you do that?”

 

MNIST, Images & CNNs: Oh My!

As a preliminary matter, it is clear that Deep Learning neural networks (the street name for multi-layer neural networks) have accomplished amazing feats in only the last few years.  

For example, Deep Learning neural networks were taught by ingesting millions of images to play Atari games so well that a machine was able to prevail over one or more human champions. DeepMind achieved success in all 47 Atari games in 2014 and 2015. After Google acquired DeepMind, Google’s AlphaGo prevailed over different human champions in the Asian game of Go (2016 and 2017), and Carnegie Mellon created a system that played two different teams of human champion in the computer-version of Heads Up, No Limit Texas Hold’em Poker (both in 2017).

And driverless cars are functional because the input consists of images that are taken in using cameras and Lidar, and they’re connected to the gas, brake and steering mechanisms. But the brain between the cameras and the mechanisms is Deep Learning brain.

These feats have attracted a great deal of attention. But the neural networks “on deck” were Convolutional Neural Networks (CNNs).

In a computer, images are created from pixels, which can also be converted to numbers. In this context, MNIST is a standard for assessing how well the images have been processed.  “The MNIST database of handwritten digits … has a training set of 60,000 examples, and a test set of 10,000 examples.” (See http://yann.lecun.com/exdb/mnist/).

And CNNs are typically used for processing images (see https://www.tensorflow.org/get_started/mnist/beginners and https://www.tensorflow.org/get_started/mnist/pros).

Words are different. Generally speaking, words are best processed with language models that “remembers” meaning, such as Recurrent Neural Networks (RNNs).

 

The Distribution Problem

And, as the imaginary CTO had said, there is no standard like MNIST for words. (Common Crawl, at http://commoncrawl.org/, is an open repository of web crawl data, including text, and is massive bucket of words. However, it is not a standard.)

“Look at it this way. Compare the distribution of training data with the distribution of actual data. They’re often different, right? How can you minimize that difference, either to start with, or over time? And then you can tune your algorithm or find even better training data.”

 

Generic Training

This problem of fitting a distribution of training data to a distribution of actual data was not unfamiliar to me. In U.S. Pat. No. 9,552,548 for “Using Classified Text and Deep Learning Algorithms to Identify Risk and Provide Early Warning,” issued on January 24, 2017, I explained that part of the process would involve converting words into numbers. The procedure for this training involves using a language model such as word2vec or GloVe. These procedures convert the words into number strings or, as is more technically correct, “word vectors.” (For the point of this article, I need not discuss this conversion process further.)

The word vectors of some training data could train a Deep Learning algorithm, I had thought, and that algorithm would pattern-match to the word vectors extracted from internal enterprise communications, e.g., emails (of course). This pattern-matching is an A to B “mapping,” via the algorithm, from the training data to the data being tested.

The CTO’s remark, which focused on “the deepest issue we run into,” made me think I had missed something.

Yes, as I will explain, and show, I had.

For risk, where did we get the risky words for training the RNNs we used? There were two sources: words from PACER (the acronym for the federal judicial database, Public Access to Court Electronic Records) and words from customers.

However, for starters, we used only the PACER data in a generic sense. As the CTO knew, PACER data comes in categories, and it’s probably no surprise that there are many lawsuits filed in each category. There are many categories are not business-relevant, but in the business-relevant categories, there’s no shortage of litigation in each category.

Here are the categories and number of lawsuits for 2016:

 

image.jpg

For Intraspexion’s initial model of a system to provide early warning of a litigation risk, we had chosen Civil Rights-Jobs (red). There are no sub-classifications, so we didn’t focus on whether the discrimination was based on age, race, sex, national origin, and any of the other types of discrimination. The “classification” was “discrimination.”

We wanted to see how much data we would need to find the risky needles in the haystack. So we decided to train with 50, then 100, then 150, then 200, and, finally, 400 “examples” of those types of lawsuits.

Notice, I did not focus on the identity of the defendant. I had not connected “defendant” to “prospective customer,” but that connection was now dawning on me.

We moved forward without that insight. We trained a Deep Learning model to look at internal enterprise emails and “pattern-match” them against the model. Because we had some initial success right off the bat (with 50 examples), I built the Intraspexion team and they built a functional system.

 

Specific Training: Part I

However, I got one aspect of the “defendant/prospective customer” connection right. To get the words from customers we didn’t have, I saw the need for a feedback loop once the system was deployed. And so a feedback loop is part of my core patent and our system.   

Here’s the loop. When a user sees an email that’s risky, it can be “accepted” as a true positive or “rejected” as a false positive. Both sets of tags will be valuable once they’re aggregated and fed back by the customer to us, and we put it into the training set.

Then, as the size of the training set grows and the customer-inputs begin to dominate, the training set will more accurately reflect company’s own culture, and the “distribution of the actual data,” as the CTO had put it.

 

NEW – Specific Training: Part II

Let’s go back. All of the foregoing is throat-clearing.

Ahem.

For the realization sparked by the CTO’s challenge, we go back to PACER.

In PACER, there is a functionality that goes unused and unappreciated by the vast majority of PACER users. I have used it to write about a how a corporate legal department could see its own “litigation risk profile.”

For Xerox in 2014, I used the PACER filter (in 2016) to create this profile:

image2.png

What I didn’t realize before is that, instead of a generic training for Nature of Suit (NOS) code 442 (Civil Rights-Jobs), Intraspexion could use the discrimination lawsuits filed against a prospective customer’s previously filed lawsuits for training. Note that, in the spreadsheet above, from 2009 through 2013, there were 31 cases of the Nature of Suit code 442 (discrimination) variety.

Let’s pretend that Xerox is a prospective customer.  We’ll use PACER, expand the date range to 10 years, from January 1, 2007 to December 31, 2016, and find these specific cases.

The result (shown below) is a list of 64 cases where, in each instance, Xerox was a defendant.

Before I took the screen shot you’ll see, you should also know that I clicked on Date Filed. When I did that, PACER put these 64 cases in date order, with the oldest case first. (With another click, PACER would show the most recent cases first.) PACER is nice that way.

But note also that each case number cannot be changed and is active. By clicking on a case number, one can get to the case itself, and (most importantly) to the complaint.

Typically, PACER logs in complaints as Document 1. The Civil Cover Sheet, where a plaintiff is required to select the “one and only one” category that best describes the case, regardless of the number of claims, is not given a document number. To PACER, the Civil Cover Sheet is just for statistics, and so it is tracked as being related to the complaint, and usually appears as 1-1.

Once the complaint is accessed, the factual allegations can be extracted in a programmatic way. And now I’m getting ahead of myself. I’ll get there, but first let’s see that PACER screen shot of 64 Xerox-specific factual examples, most of which will be attorney-vetted. Here’s the first page of the PACER screen shot:

Picture1.jpg

Now, whether a company experiences more or less discrimination lawsuits (in federal court) depends in part on the type of company it is.

Xerox is one type of company. The largest, Wal-Mart, is another.

From January 1, 2012 through October 17, 2016, Wal-Mart experienced a wide variety of lawsuits. But using a different filter, a different screen shot shows, the number of 442 cases was much higher: 826.

In the screen shot on the next page, look about half way down for “442 (826).”

Picture2.jpg

Note also that this particular filter shows each NOS code and the number of lawsuits filed. That’s the beginning of a risk profile for Wal-Mart. Of course, the profile is for federal court lawsuits only, and doesn’t include lawsuits filed in state courts, but it displays the classifications and how frequently they occur.

For example, NOS 445 is the NOS code for an Americans with Disabilities Act case. There were 187 of them. 

 

The Complaint

Let’s go back to the Xerox screen where the date range is 01/01/2014 to 12/31/2014. 

Picture3.jpg

Number 4 is noticeable because “XEROX BUSINESS SERVICES (dft)” is in all-caps format.

When I click on the number under Case, i.e., case number, 3:2014-cv-00089, PACER takes us here:

Picture4.jpg