eDiscovery tools and techniques - Knowledge Discovery in lawyer's clothing

Posted by K Krasnow Waterman on Sun, Feb 24, 2008 @ 12:02 PM

eDiscovery was the hands down favorite at LEGALTECH, a huge legal industry expo and conference held in New York a few weeks ago.  Although some lawyers have been asking for opposing parties' electronic records during litigation for years, the Supreme Court only implemented a rule on the subject in December 2006.  The rule applies to lawsuits in federal court and uniformly places the burden on parties to proactively seek out and turn over relevant digital records from their repositories early in the case. Well more than half the vendors at LegalTech had an eDiscovery spin to their offerings.

As someone who has been both a trial attorney and the manager of large data systems I was somewhat bemused by the marketing efforts.  Nearly every vendor's representative told me his or her offering was "unique."  When questioned, almost none could say what made them unique or what the software really does.  Some had systems engineers on hand from whom I could glean more specific information. 

Based upon my conversations with the sales personnel and their systems engineers, eDiscovery is nothing more than Knowledge Discovery (a pre-existing and still rapidly growing field of  information technology) in lawyer's clothing.  There's nothing fundamentally wrong with that.  Lawyers shouldn't have to learn a whole new technology lexicon to do their work; it's appropriate for the vendors to speak in terms that are relevant to the customer.  However, since most of the sales reps had a black-box, it's-magic, sort of presentation, I think a little explanation from the perspective of someone who has been working with these issues since long before last year might be useful.  Lawyers need to know that not every product is offering the same function or the same quality.

1) Preservation

Once a company knows that it is being sued or is filing suit, all relevant records have to be preserved.  This means making sure that no one changes or deletes relevant data that is potential evidence or which could lead to evidence.

One of the hard questions is deciding where relevant data might reside. From the hardware perspective, company servers are usually an obvious place to start, but it may also be necessary to reach out to desktops, laptops, phones, and other employee PDAs. Also, if the company uses outside hosted web services, it may be necessary to work quickly to preserve that data as well.  

I was surprised to find that most vendors seem to focus exclusively on major corporate databases and a small number of personal filetypes: word processing documents and email-related files. Most did not mention finding instant messages, photo files, internet session logs, or a number of other popular application created files.  Also, few mentioned capturing physical access (e.g., swipe card lock records), telephone call logs or voicemail.

Since businesses usually continue to operate during litigation, and that often means legitimate reasons for changing data, the eDiscovery process often involves making and preserving a copy of the data.  People have differing opinions about whether it is more effective to narrow what's collected (decide what's potentially relevant first) or to collect everything and narrow later.  In theory, the former is cheaper because you're storing a smaller copy.  On the other hand, storage is cheap, but failing to preserve the correct data could cost the ultimate price.

2) Cleansing

Data cleansing is a broadly used term to describe anything done to data in preparation for searching activities.  The lawyer might not think much about this step in the process, but one study in other businesses found this typically accounted for 60% of the effort.  Some of the most significant challenges are:

Integration: If you're not technical, imagine the instructions you would have to give to file clerks to get them to re-order a million paper files from a chronological system to a topic-based one.   Integration is the process of taking data with a structure created by one piece of software and making it understandable to a system with a different structure.

Deduplication: Digital files replicate faster than rabbits. People copy them like mad and, occasionally, electronic hiccups just create them. Deduplication is the process of reducing the multiples to one.  In an eDiscovery context this can be a double-edged sword. It can radically improve the speed for answering "is there a document that says...?" questions.  But, it may remove the ability to know how many people had copies or where they saved them.

Disambiguation/Fuzzy Matching: Whether by typo or intent, there are often similar but not identical representations of information (think "Robert", "Bob", and "Bobert").  There are a variety of techniques to attempt to figure out which refer to the same information and which is really distinct (e.g., two employees, both named "Joe Johnson").  Some try to perform this before the Search process and others have tools to handle it during the search.

Entity Extraction: From a computer perspective, it's easier to find information in a database (already sorted into a neat table with descriptive column headers) than in running text (called "unstructured" and including things like emails, letters, and written reports).  So, there's now an array of software that will attempt to pull everything out of unstructured text and put it in a database.  

3) Search

It is critically important for lawyers to understand what search technologies are actually doing, yet this was the area where sales reps had the least understanding.  At the simplest level, search technologies can look for what you know or what you don't know.

In the "what you know" category, the most common is keyword searching, looking for a specific word.  This might be a fine method for searching for official documents on a project or deal that will always be mentioned by name in the document.  This can be enhanced by Boolean search, the technique that lets you add "and" "or" "but not" connectors between words, so that you can narrow the number of results.  But, these are only an incomplete option for searching less formal communications, like email, instant messages, and voicemail, where the subject is often not mentioned.

The next level of "what you know" searching involves data structure.  For example, when you see it, you typically know which is a social security number, a phone number, a street address, a person's name.  It is possible to teach a computer to do the same thing.

The big jump in technology is moving to inference based searching, when you want a system to find things that are like other things even if the same words are not used.  This can sometimes find communications about a person or project only referenced by a nickname or not named at all.  In reality, a computer still can only do what it's told and people have come up with a variety of computations to emulate what a human is doing when making inferences.  The four I heard from eDiscovery systems engineers were: Bayes, Shannon, linguistic indexing, and semantic indexing.  Describing what they do is the subject for another blog, but suffice it to say that they will not likely produce identical results. 

4) Visualization

Even in paper files, complex litigation discovery has often involved millions of records.  Computers, though, can quickly and radically improve your ability to understand what you have.  My favorite example is that a five year graph of the S&P500 represents about 126,000 data points.  In my day at LegalTech, I was surprised by how little was being said about output formats.  I'm not sure whether that represents a lack of availability or a perception that lawyers only want traditional text presentation.


Conclusion:  Human document review was never perfect; critical documents have always been missed through concentration fatigue and occasional laziness or dishonesty.  Spend a week in a dusty warehouse full of documents, and you'll understand just how easy it is for those to occur.  But, lawyers must understand that every one of these electronic eDiscovery techniques can be done well or poorly; that even the best techniques will likely miss something; and some of these techniques (such as entity extraction and inferential searching) are young or imperfect.  Performing  multiple techniques means compounding the number of misses or errors.  These may be the only realistic options for handling millions, billions, or trillions of records, so it's important for lawyers to know enough about the technologies being offered to ensure they ask the questions that matter to them, understand what they're buying and consider the risks involved.

Topics: technology for lawyers, knowledge discovery for litigation, eDiscovery