5 Reasons Why You Can’t Find that Hot Document
You’re in the midst of an intense document review for litigation. You need to find a critical document in your production or theirs. But look though you may, you’re not going to find it.
Why? Because in e-discovery and litigation document review, finding what you need in an electronic data population isn’t just a matter of plugging some keywords into a search engine. Before you can find what you need, the proper steps must be taken to ensure the information is available in the index to begin with so that the search engine can find it.
Here are five reasons why the contents of that document you’re trying to find may never have found its way into the index, or why it’s there but difficult to retrieve.
1. The document was not successfully processed because it was password protected.
Password protected files are just that: protected. They could be protected for a number of reasons, including privacy (think HR) or chicanery (think Enron). In order for their contents to be available to a search engine, the password has to be cracked so that the processing engine can get to the text. If the IT department or service provider isn’t instructed to or doesn’t have the appropriate tools to crack passwords, password protected files never get processed; they just show up on a log as error files. Looking at the logs from the processing step is always a good idea.
2. You’re searching for patent information and numbers or numeric ranges that weren’t indexed as expected.
To make indexes smaller and to speed up indexing, IT or your service provider may have exercised the option to exclude numbers from the index, which can reduce a data population by about 20%. If so, you won’t find that patent number or possibly that date you’re looking for. If possible, know what indexing options were used to create the document population you’re looking at. If numbers are important, be sure that the numbers have been indexed both as text and as a numeric value, which is necessary to search by numeric range (for example: to find any document containing stock within 5 words of a number between 120 and 170, you need to have the numeric values option turned on.) Also, if numbers are important in your search, be sure you understand how decimal points, commas, and minus signs have been indexed; the interpretation of these characters may vary.
3. The text wasn’t captured because the document was a PDF or scanned image and wasn’t OCR’d—or the OCR was “dirty.”
Most processing systems include a step for OCR (optical character recognition), which is the conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text, which can then be indexed. No OCR, no search results for an item like this. Dirty OCR, although less of a problem than it used to be, is so named because the OCR process may be hampered by imperfect images. An OCR application uses the black pixels in a document in an attempt to interpret the correct letter or number and there are many things that can stand in the way of a good interpretation, including “noise” in the original document (think about a document that’s been photocopied numerous time causing degradation in the image), various font issues, unknown characters, etc. If the interpretation is imperfect, the information in the index will be, too.
4. It contained text in French (Italian, Chinese, Japanese, Hebrew, Swahili…)
In order for foreign languages in documents to be indexed, the index and search engine must be Unicode compliant and the indexing technology needs to support the language. The Unicode Standard is the universal character-encoding scheme for written characters and text. It defines a consistent way of way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. It’s important to understand the best construction of search queries when you’re looking for documents that might contain a foreign language, especially one that is not in the Roman alphabet.
5. It wasn’t a hot doc, it was a hot audio or video file.
Here’s a reminder: not all digital information is text. Digital audio exists, it’s becoming a common method of communication, and it’s discoverable (think digital voice mail). Before audio information can be indexed, it must be converted into text. If you think that hot evidence you’re looking for may be in an audio file, be sure that the collection process to be used will capture it – many collection tools simply ignore audio files — and that ensure that your service provider knows how these files should be handled for your matter once they’ve been collected.
Remember, before you can find what you need, the proper steps must be taken to ensure the information is indexed and searchable. Only then does it truly become “findable”.