Text messaging and the train wreck

Posted by K Krasnow Waterman on Tue, Sep 16, 2008 @ 11:09 AM

Tags: technology for lawyers, knowledge discovery for litigation, technology for business managers, law about technology, public policy, eDiscovery

Train wreck caused by text messaging?  Multiple news reports have raised the possibility that the conductor of a Los Angeles train was sending text messages just before the train crashed and many were killed.  The questions under investigation are whether this is true and whether the conductor was distracted by it when he should have seen red light signals indicating the hazard ahead. 

This is the saddest outcome of an issue I, and others, have been raising for years.  The use of technology for non-work activities has pervaded the work environment to the extent that it is impacting work performance.  The obvious problem is lost revenue and reduced profits to the employer, but sometimes it correlates to increased liability.  If true in this case, it means lost lives. 

If the shopclerk with an mp3 player or cellphone in the ear is too distracted to answer  questions accurately or make correct change, what makes me think my car mechanic, stock broker, or doctor's lab technician isn't?  In 2006, eDiscovery companies were estimating that one quarter to one third of all emails flowing through a corporation were personal email. At the time, I wrote about the thousands of football and fantasy football gambling emails that had passed through Enron.  I also wrote about the dirty jokes, hook ups, and other sex emails there.

It's getting technically easier to discover that people aren't really working when they claim to be. This summer before lecturing at a state bar convenion, I stood in the back of the large hall and observed what people were doing.  I explained the ways I could prove that they had been using their  laptops, blackberries, and iphones to shop on the web, play video poker, and text friends and family.  I explained how, In the not-to-distant-future, these activities will probably void the professional certification credit they thought they were earning by being present but not paying attention.

This week's train wreck brings more attention to the debate about just how much people's attention is diverted and what the consequences can be.  At a New York panel discussion last fall, a group of senior financial industry compliance managers uniformly said they weren't concerned about personal web, email, and phone use at work.  Perhaps they ought to be.

 

Article has 0 Comments. Click here to read/write comments

Technology & Legal Ethics

Posted by K Krasnow Waterman on Fri, Jun 20, 2008 @ 13:06 PM
"If I'd wanted to be a CIO, I wouldn't have gone to law school" is the subtitle of the talk I'm giving today at the Arizona State Bar Convention about the ethical trouble lawyers can get into using current technology.  Today's presentation is posted on the papers and presentations page or you can click here to see the presentation.
Article has 0 Comments. Click here to read/write comments

Judge Kozinski - Closing the barn door...

Posted by K Krasnow Waterman on Fri, Jun 13, 2008 @ 03:06 AM

Tags: technology for lawyers

(WARNING: Adult content) 

On Tuesday, Alex Kozinski, Chief Judge of the federal Ninth Circuit was caught by the LA Times with a website full of sexually explicit material accessible to the public.  Pardon the pun, but perhaps the old expression about "closing the barn door after the animals are gone" has never been more appropriate. The LA Times says the site included  photos of "naked women on all fours painted to look like cows and a video of a half-dressed man cavorting with a sexually aroused farm animal."  There is so much wrong with this picture that it's hard to decide where to start.

Next week, I'll be giving a talk at the Arizona State Bar Convention about legal ethics and technology. One of the most important points is that lawyers need to understand how big a data footprint they and their clients are leaving behind. 

Kozinski is reported to have said that he thought the site was for his private storage and that he was not aware the images could be seen by the public.  That's a problem for many lawyers, who are unaware how easy it is to find things they or their clients have posted on the web.   In the Judge's case, that's doubtful if he's really the author of the letter to 'Article III Groupie" posted on undertheirrobes.com.  There, in a plea to be included as a contender for "judicial hottie" were multiple links to http://alex.kozinski.com.  The links included the reportedly offensive subdirectory /stuff (see the properties for "bungee jump").  If he didn't think people could get to the subdirectory, why did he include a link to it?

Kozinski is reported to have said he didn't know if any of the material on the site is obscene. The site is now offline and apparently unavailable through some of the easiest means of access.  But, Cryptome has posted a list of all of the files and subdirectories in the judge's  /stuff subdirectory and it contains a subdirectory called "/fucking" which has been around since November 2006.  The LA Times described part of the Kozinski site as containing "images of masturbation, public sex and contortionist sex."  In researching this story, I accidentally came across the women-as-cows photo (be very careful which Google hits you choose if you search this story); the women's posteriors are facing the camera and their genitalia are in full view.

In the first LA Times story, the Judge said that he had uploaded sexually explicit content to the site.  The next day, the Judge is reported to have suggested that some of the items were posted by his adult son and that he was unaware of them.  If this becomes a question of sufficient concern, there are technical methods to determine whether this is likely true or false.  The website appears to have been registered by the Judge's son, hosted on a joke server and registered using an obviously false address (including both homage to hackers with references to FOO and to lawyers with the fictitious town "Barsville").  Even so, with pc logs, server logs, emails, and web postings, it won't be that hard to figure out most of who did what.

The story broke because Judge Kozinski was hearing a trial level case, a criminal prosecution for for the distribuion of pornographic materials (containing bestiality).  In response to the news stories about his own website, Judge Kozinski suspended trial at least until Monday.  Besides the immediate question of possible conflict of interest, it is likely that someone will look more closely at how the case came to be assigned to Judge Kozinski.  It is not impermissible for an appeals court judge to hear a trial case, but it is not common.  

It won't be long before people are reassessing everything the Judge has said or done.  And, quite a lot of that history is readily available in digital form. For example, people are already reassesssing Judge Kozinski's 2001 battle with the Court's administrators over pornography filters on the government's computers.  I've yet to see any discussion of his opinion (in US v Poehlman) finding that the government entrapped a man it accused of crossing state lines to have sex with minors.  

The LA Times reports that the Judge  "defended some of the adult content as "funny"" and "he had shared some material on the site with friends."  Considering that the site contains the aforementioned photos of naked women as cows, and is reported to have included at least one photo of women exposing their pubic hair, we will now wait to see whether former female employees or colleagues come forward to say that they were the recipients of such "sharing" and found it offensive or harassing.  And, it's only  a matter of time before someone takes a new look at his writing on sexual harassment (Foreword in Sexual Harassment in Employment Law (Barbara Lindemann & David D. Kadue, BNA 1992), reprinted as Locking Women Workers in a Gilded Cage in Legal Times of Washington, May 25, 1992, at 26.)

Also discovered on Judge Kozinski's website were "more than a dozen" copyrighted songs and it has been asserted that they were readily copy-able by the public.  While that's a pretty small number relative to the civil copyright infringement actions typically reported, it could still be a copyright violation if others did copy the files.  Perhaps more interesting, someone may want to reread the Judge's participation in the July 28, 2000 decision to stay an injunction against Napster.   

All in all, it looks like it's going to be a tough week for Judge Kozinski, until now considered one of America's brightest and most influential conservative judges.

 

 

Article has 3 Comments. Click here to read/write comments

URI - Organizing your world's information

Posted by K Krasnow Waterman on Sat, May 17, 2008 @ 09:05 AM

Tags: technology innovation, technology for business managers

Google's mantra is "organizing the world's information." If you're organizing information in your corporation or organization, that might not be a viable option. URIs present the opportunity for everyone in a web environment to make a step in that direction.

One of the major challenges for large organizations is that different people, departments, etc. use the same words to mean different things. Every business and subset of business has "terms of art", often common words or phrases that mean something special to that group.

To a programmer, the word "beta" means the test of software before it's released for general use. To a stock broker "beta" is a number that shows whether a stock is more or less volatile than the market. They're in diffferent industries so, talking face-to-face, it's pretty easy to tell that they're talking about different things.

There are plenty of examples, though, where the same word in the same industry means different things. In the financial industry, "wealth" is used to define the threshold for accepting clients for certain services. Every institution picks its own number and they can be the same or different (e.g., over $1 million in net worth; over $1 miillion in liquid funds invested; over $1 million in assets other than personally-used real estate). When those institutions merge, the inconsistent definitions become an impediment to merging their data.

In computer systems, there historically weren't good ways to know which meaning someone had in mind when they put a particular word in a file or database. The problem was the same for the names of fields or columns. Now, we have metadata...data that let's us provide information about data. So, we can stick tags on data in a file that tells us things like where it came from, what day it was collected, or what size it's supposed to be.

A URI (uniform reference identifier) can store the definition you have in mind. So Citi/define/wealth can have a different meaning from UBS/define/wealth. And, your system can point to the appropriate one whenever "wealth" appears in your data. This makes it possible to merge data and retain different meanings or to compute across disparate meanings.

Article has 0 Comments. Click here to read/write comments

"Know Your Customer" - Host a data workshop

Posted by K Krasnow Waterman on Sat, Apr 26, 2008 @ 16:04 PM

Tags: technology for business managers, technology b2b customer service

Recently, I was invited to facilitate a workshop to learn about customer data uses, flows, and needs. It was an interesting idea, so I agreed.

"Know your customer" has become a hackneyed phrase in fairly short order. One of the post-9/11 bundle of laws, intended to gain anti-terrorism assistance from the public, was a "know your customer" mandate requiring financial institutions to better understand who their customers are and where their money comes from. Like many things we do in this automated life, it seems to have quickly lost its meaning in favor of a single massive data collection effort...like when my bank of many years -- which has seen my entire transition from debt to net worth through both my business acounts and the deposit of every paycheck -- asks me for id.

The workshop was intended to provide an opportunity for a fairly large group of data architects to hear a group of customers talk about their business day and tasks; how they interact with each other; and what they want. It was my job to draw them out over the course of two days, to find slices of life to talk about and elicit tremendous detail. It was expected that we would have an accelerated opportunity to gather needed data elements and identify system access requirements.

With facilitation, the customers opened up about their work lives. They described a tremendous amount of human interaction to obtain information. They described phoning folks in other parts of the organization to find out information they wanted. We, the folks with strong information technology orientation, thought we were making a break-through, identifying systems to which these customers could or should get access.

What happened next was unexpected. Wen we sought to validate these system access requirements, the customers repeatedly and politely told us we misunderstood. They repeatedly explained that they liked to get information in this unautomated fashion. They liked the opportunity conversation gave them to get context -- group meaning of terms, background for the way information is gathered, information that's inappropriate for permanent records, and other related information.

Since then, I've been thinking about what it really means to know your customer. As the provider of services, it's not enough to learn your customer's business. And, it's not enough to spend time in their space and observe them at work. You need to do those things but, in the end, if you really want to give them what they want, sometimes you just need to ask.

 



Article has 0 Comments. Click here to read/write comments

Microformats! Adding tags to webpages to improve search results

Posted by K Krasnow Waterman on Thu, Apr 10, 2008 @ 07:04 AM

Tags: technology innovation, technology b2c

I'm definitely a fan of the concept of semantic web, the ability to reach individual pieces of data you want from the internet rather than having to get whole pages and then find the information. A little while back, I wrote about FOAF (Friend-of-a-Friend), a semantic web tool to make social connections more readily accessible from the internet.

Now, I'm starting to get enthusiastic about Microformats, little bits of code you can add to a website or page that humans don't see, but make some particular type of information accessible to the web. If you want customers or potential customers to be able to get to contact information, calendar information (events, possibly business hours), or product information without having to read whole pages, there's a little set of tags you can stick into your page code that will make it possible. Yahoo! search recently announced support for some of these formats (seems to be calendars and reviews but not yet product listings) so people performing search will get the information in response to their Yahoo! search. These are particularly great because you need very little technical skill. The examples springing up around the web are things you can cut and paste into your web code even if you only know a little bit about html. And, there are programs, like hCalendarCreator, that will create the code for you. You can use this code even if you're on a pre-fabricated site that only lets you enter text and html in a module on a page.

The only downsides I can see are: 1) each microformat has a limited purpose and a limited number of things you can express and 2) some have questioned how long they'll be supported. The answers to both these challenges appears to be, ultimately, RDF which is the more robust smantic web standard sanctioned by the World Wide Web Consortuim. RDF will pretty much let you express anything about anything (solving problem 1). RDF requires a much higher level of technical skill and access to the header portion of your web pages. But, there's a next generation of tools (things like GRDDL) coming that will translate microformats to RDF, so even if Yahoo! does decide to pull support for every microformat you'll have a way to still get your tags read (solving problem 2).

I'll be trying this technology out on a new e-commerce site and will report back about the experience.


 

 

 

 

Article has 0 Comments. Click here to read/write comments

Accountability Appliances: What Lawyers Expect to See - Part III (User Interface)

Posted by K Krasnow Waterman on Tue, Apr 01, 2008 @ 14:04 PM

Tags: technology implementing law

In cleaning up blog tags today, I realize that I never posted part III of this discussion.  It's been posted at my MIT blog but for completeness I'm posting it here.  It's decidedly more technical than most of what's here but may be interesting for those following the thread of building technology to implement law.

--------------------------------------

I've written in the last two blogs about how lawyers operate in a very structured enviroment. This will have a tremendous impact on what they'll consider acceptable in a user interface. They might accept something which seems a bit like an outline or a form, but years of experience tell me that they will rail at anything code-like.

For example, we see

:MList a rdf:List

and automatically read

"MList" is the name of a list written in rdf

Or,

air:pattern {
:MEMBER air:in :MEMBERLIST.


and know that we are asking our system to look for a pattern in the data in which a particular "member" is in a particular list of members. Perhaps because law is already learning to read, speak, and think in another language, most lawyers look at lines like those above and see no meaning.

Our current work-in-progress produces output that includes:

bjb reject bs non compliant with S9Policy 1

Because

phone record 2892 category HealthInformation

Justify

bs request instruction bs request content

type Request

bs request content intended beneficiary customer351

type Benefit Action Instruction

customer351 location MA

xphone record 2892 about customer351

Nearly every output item is a hotlink to something which provides definition, explanation, or derivation. Much of it is in "Tabulator", the cool tool that aggregates just the bits of data we want to know.

From a user-interface-for-lawyers perspective, this version of output is an improvement over our earlier ones because it removes a lot of things programmers do to solve computation challenges. It removes colons and semi-colons from places they're not commonly used in English (i.e., as the beginning of a term) and mostly uses words that are known in the general population. It also parses "humpbacks" - the programmers' traditional
concatenation of a string of words - back into separate words. And, it replaces hyphens and underlines - also used for concatenation - with blank spaces.

At last week's meeting, we talked about the possibility of generating output which simulates short English sentences. These might be stilted but would be most easily read by lawyers. Here's my first attempt at the top-level template:

 

Issue: Whether the transactions in [TransactionLogFilePopularName] {about [VariableName] [VariableValue]} comply with [MasterPolicyPopularName]?

Rule: To be compliant, [SubPolicyPopularName] of [MasterPolicyPopularName] requires [PatternVariableName] of an event to be [PatternValue1].

Fact: In transaction [TransactionNumber] [PatternVariableName] of the event was [PatternValue2].

Analysis: [PatternValue2] is not [PatternValue].

Conclusion: The transactions appear to be non-compliant with [SubPolicyName] of [MasterPolicyPopularName].


This seems to me approximately correct in the context of requests for the appliance to reason over millions of transactions with many sub-rules. A person seeking an answer from the system would create the Issue question. The Issue question is almost always going to ask whether some series of transactions violated a super-rule and often will have a scope limiter (e.g., in regards to a particular person or within a date scope or by one entity), denoted here by {}.

From the lawyer perspective, the interesting part of the result is the finding of non-compliance or possible non-compliance. So, the remainder of the output would be generated to describe only the failure(s) in a pattern-matching for one or more sub-rules. If there's more than one violation, the interface would display the Issue once and then the Rule to Conclusion steps for each non-compliant result.

I tried this out on a laywer I know. He insisted it was unintelligible when the []'s were left in but said it was manageable when he saw the same text without them.


For our Scenario 9, Transaction 15, an idealized top level display would say:


Issue: Whether the transactions in Xphone's Customer Service Log about Person Bob Same comply with MA Disability Discrimination Law?

Rule: To be compliant, Denial of Service Rule of MA Disability Discrimination Law requires reason of an event to be other than disability.

Fact: In transaction Xphone Record 2892 reason of the event was Infectious Disease.

Analysis: Infectious disease is not other than disability.

Conclusion: The transactions appear to be non-compliant with Denial of Service Rule of MA Disability Discrimination Law.


Each one of the bound values should have a hotlink to a Tabulator display that provides background or details.

 

Right now, we might be able to produce:


Issue: Whether the transactions in Xphone's Customer Service Log about Betty JB reject Bob Same comply with MA Disability Discrimination Law?

Rule: To be non-compliant, Denial of Service Rule of MA Disability Discrimination Law requires REASON of an event to be category Health Information.

Fact: In transaction Xphone Record 2892 REASON of the event was category Health Information.

Analysis: category Health Information is category Health Information.

Conclusion: The transactions appear to be non-compliant with Denial of Service Rule of MA Disability Discrimination Law.

 

This example highlights a few challenges.

1) It's possible that only failures of policies containing comparative matches (e.g., :v1 sameAs :v2; :v9 greaterThan :v3; :v12 withinDateRange :v4) are legally relevant. This needs more thought.

2) We'd need to name every sub-policy or have a default called UnnamedSubPolicy.

3) We'd need to be able to translate statute numbers to popular names and have a default instruction to include the statute number when no popular name exists.

4) We'd need some taxonomies (e.g., infectious disease is a sub-class of disability).

5) In a perfect world, we'd have some way to trigger a couple alternative displays. For example, it would be nice to be able to trigger one of two rule structures: either one that says a rule requires a match or one that says a rules requires a non-match. The reason for this is that if we always have to use the same structure, about half of the outputs will be very stilted and cause the lawyers to struggle to understand.

6) We need someway to deal with something the system can't reason. If the law requires the reason to be disability and the system doesn't know whether health information is the same as or different from disability, then it ought to be able to produce an analysis that says something along the lines of "The relationship between Health Information and disability is unknown" and produce a conclusion that says "Whether the transaction is compliant is unknown." If we're reasoning over millions of transactions there are likely to be quite a few of these and they ought to be presented after the non-compliant ones.


 

Article has 0 Comments. Click here to read/write comments

"Danger Will Robinson!" - announcing first teleconference of the ABA AI & Robotics committee

Posted by K Krasnow Waterman on Thu, Mar 20, 2008 @ 17:03 PM

Tags: law about technology

I'm very pleased to share that the new American Bar Association committee I'm chairing has just been selected to host a teleconference for the Science & Technology Section on May 13th.  The full title of the event is "Danger, Will Robinson! Issues for Emerging Clients in Bots, Robots, and other Artificial Intelligence Enterprises."  We'll hear from Edwin Olson, MIT team lead for the DARPA Challenge -- building and deploying an unmanned vehiclde in an urban environment -- about the legal issues he considered in the course of this work.  Then, we'll talk about the state of the law today for liability, ownership, agency, and other issues relating to human emulating technologies.  And, we'll talk about changes in the law that might be appropriate and how to begin the process of driving those changes.  Please contact the American Bar Association or me to find out more.

Article has 0 Comments. Click here to read/write comments

FOAF and the coming wave of semantic web social networking

Posted by K Krasnow Waterman on Wed, Mar 12, 2008 @ 19:03 PM

Tags: FOAF, social graph, technology innovation, semantic web, technology

"FOAF" -- short for Friend-Of-A-Friend -- offers a vocabulary for putting machine readable code into a webpage, making it possible to link from one site, person, company, etc to related ones.  It's been evolving since 2000, but seems on the edge of a major break-out.  Now's a good time to learn about it and ride the wave of its growth.

FOAF is part of the semantic web movement, a philosophy and method of using technology to be able to reach, combine, and react to discrete information coming from multiple sources.  Put simply, it lets you (or your system) get to very specific bits of information from many web pages or systems, even if they're not yours.  Of course, as this technology develops, you'll only be able to reach the discrete bits of data that you're allowed to (more about that some other time).

If you think about it, the concept of machine-readable Friend-of-A-Friend data would seem to eliminate the need for LinkedIn, FaceBook, MySpace, and all the rest.   In theory, you wouldn't need these services because you could pull together your networks by first, second, and nth degree of separation automatically.  You could see networks of friends, business associates, or whatever directly from everyone's pages.  But, in that surprising way the web works, that's not how it's going to happen exactly.

A few weeks ago, I had the distinct pleasure of meeting Dan Brickley, creator of FOAF.  From him, I learned that software is going to start linking FOAF data with other machine-readable social network information.  The way I understand it, smart folks at places like Google and Yahoo are  providing code (APIs) that will let you bridge data from FOAF with data from places like Flickr and Twitter.  So they'll be able to make the jump from one social network to the next.  If FOAF knows Amy is my friend; Twitter knows I share what I'm doing with Bobby; and Flickr knows I share my pictures with Cindy, these APIs will pull that information together and know that I know Amy, Bobby, and Cindy.  These relationships can be displayed in "social graphs", visualizations that look like linked Tinker Toys where every bubble is a person.Flickr and Twitter are out front on this one because they, too, put data in code that's directly accessible to the web. 

The big ideas here are:

1) People won't have to keep entering the same information to get the same people into new social network websites.   If Twitter knows that I know Cindy, but she's not in my Twitter group, it can ask me if I want to include her.  This is HUGE.  There are millions of people joining networks all the time and one that offers this no-typing option will have a big competitive advantage.

2) People won't have to have their data stored with a particular website. They could have their data stored anywhere and just use websites that offer network topics or services, bringing together the right data only for the moment they need it.  So, you wouldn't have to permanently set up accounts of all your friends on a wine tasting network because you wanted them to share an event there one time.  

3) Applications will grow up around this, offering ways to segment relationships into different levels of access. Just because the web can see that your mom and your girlfriend are both connected to you doesn't mean you want to share the same things with them.  

Article has 0 Comments. Click here to read/write comments

eDiscovery tools and techniques - Knowledge Discovery in lawyer's clothing

Posted by K Krasnow Waterman on Sun, Feb 24, 2008 @ 12:02 PM

Tags: technology for lawyers, knowledge discovery for litigation, eDiscovery

eDiscovery was the hands down favorite at LEGALTECH, a huge legal industry expo and conference held in New York a few weeks ago.  Although some lawyers have been asking for opposing parties' electronic records during litigation for years, the Supreme Court only implemented a rule on the subject in December 2006.  The rule applies to lawsuits in federal court and uniformly places the burden on parties to proactively seek out and turn over relevant digital records from their repositories early in the case. Well more than half the vendors at LegalTech had an eDiscovery spin to their offerings.

As someone who has been both a trial attorney and the manager of large data systems I was somewhat bemused by the marketing efforts.  Nearly every vendor's representative told me his or her offering was "unique."  When questioned, almost none could say what made them unique or what the software really does.  Some had systems engineers on hand from whom I could glean more specific information. 

Based upon my conversations with the sales personnel and their systems engineers, eDiscovery is nothing more than Knowledge Discovery (a pre-existing and still rapidly growing field of  information technology) in lawyer's clothing.  There's nothing fundamentally wrong with that.  Lawyers shouldn't have to learn a whole new technology lexicon to do their work; it's appropriate for the vendors to speak in terms that are relevant to the customer.  However, since most of the sales reps had a black-box, it's-magic, sort of presentation, I think a little explanation from the perspective of someone who has been working with these issues since long before last year might be useful.  Lawyers need to know that not every product is offering the same function or the same quality.

1) Preservation

Once a company knows that it is being sued or is filing suit, all relevant records have to be preserved.  This means making sure that no one changes or deletes relevant data that is potential evidence or which could lead to evidence.

One of the hard questions is deciding where relevant data might reside. From the hardware perspective, company servers are usually an obvious place to start, but it may also be necessary to reach out to desktops, laptops, phones, and other employee PDAs. Also, if the company uses outside hosted web services, it may be necessary to work quickly to preserve that data as well.  

I was surprised to find that most vendors seem to focus exclusively on major corporate databases and a small number of personal filetypes: word processing documents and email-related files. Most did not mention finding instant messages, photo files, internet session logs, or a number of other popular application created files.  Also, few mentioned capturing physical access (e.g., swipe card lock records), telephone call logs or voicemail.

Since businesses usually continue to operate during litigation, and that often means legitimate reasons for changing data, the eDiscovery process often involves making and preserving a copy of the data.  People have differing opinions about whether it is more effective to narrow what's collected (decide what's potentially relevant first) or to collect everything and narrow later.  In theory, the former is cheaper because you're storing a smaller copy.  On the other hand, storage is cheap, but failing to preserve the correct data could cost the ultimate price.

2) Cleansing

Data cleansing is a broadly used term to describe anything done to data in preparation for searching activities.  The lawyer might not think much about this step in the process, but one study in other businesses found this typically accounted for 60% of the effort.  Some of the most significant challenges are:

Integration: If you're not technical, imagine the instructions you would have to give to file clerks to get them to re-order a million paper files from a chronological system to a topic-based one.   Integration is the process of taking data with a structure created by one piece of software and making it understandable to a system with a different structure.

Deduplication: Digital files replicate faster than rabbits. People copy them like mad and, occasionally, electronic hiccups just create them. Deduplication is the process of reducing the multiples to one.  In an eDiscovery context this can be a double-edged sword. It can radically improve the speed for answering "is there a document that says...?" questions.  But, it may remove the ability to know how many people had copies or where they saved them.

Disambiguation/Fuzzy Matching: Whether by typo or intent, there are often similar but not identical representations of information (think "Robert", "Bob", and "Bobert").  There are a variety of techniques to attempt to figure out which refer to the same information and which is really distinct (e.g., two employees, both named "Joe Johnson").  Some try to perform this before the Search process and others have tools to handle it during the search.

Entity Extraction: From a computer perspective, it's easier to find information in a database (already sorted into a neat table with descriptive column headers) than in running text (called "unstructured" and including things like emails, letters, and written reports).  So, there's now an array of software that will attempt to pull everything out of unstructured text and put it in a database.  

3) Search

It is critically important for lawyers to understand what search technologies are actually doing, yet this was the area where sales reps had the least understanding.  At the simplest level, search technologies can look for what you know or what you don't know.

In the "what you know" category, the most common is keyword searching, looking for a specific word.  This might be a fine method for searching for official documents on a project or deal that will always be mentioned by name in the document.  This can be enhanced by Boolean search, the technique that lets you add "and" "or" "but not" connectors between words, so that you can narrow the number of results.  But, these are only an incomplete option for searching less formal communications, like email, instant messages, and voicemail, where the subject is often not mentioned.

The next level of "what you know" searching involves data structure.  For example, when you see it, you typically know which is a social security number, a phone number, a street address, a person's name.  It is possible to teach a computer to do the same thing.

The big jump in technology is moving to inference based searching, when you want a system to find things that are like other things even if the same words are not used.  This can sometimes find communications about a person or project only referenced by a nickname or not named at all.  In reality, a computer still can only do what it's told and people have come up with a variety of computations to emulate what a human is doing when making inferences.  The four I heard from eDiscovery systems engineers were: Bayes, Shannon, linguistic indexing, and semantic indexing.  Describing what they do is the subject for another blog, but suffice it to say that they will not likely produce identical results. 

4) Visualization

Even in paper files, complex litigation discovery has often involved millions of records.  Computers, though, can quickly and radically improve your ability to understand what you have.  My favorite example is that a five year graph of the S&P500 represents about 126,000 data points.  In my day at LegalTech, I was surprised by how little was being said about output formats.  I'm not sure whether that represents a lack of availability or a perception that lawyers only want traditional text presentation.

------------------

Conclusion:  Human document review was never perfect; critical documents have always been missed through concentration fatigue and occasional laziness or dishonesty.  Spend a week in a dusty warehouse full of documents, and you'll understand just how easy it is for those to occur.  But, lawyers must understand that every one of these electronic eDiscovery techniques can be done well or poorly; that even the best techniques will likely miss something; and some of these techniques (such as entity extraction and inferential searching) are young or imperfect.  Performing  multiple techniques means compounding the number of misses or errors.  These may be the only realistic options for handling millions, billions, or trillions of records, so it's important for lawyers to know enough about the technologies being offered to ensure they ask the questions that matter to them, understand what they're buying and consider the risks involved.

Article has 0 Comments. Click here to read/write comments