Data Protection

Identifying Dark Data in non-OCR’d Images

By Michael Sampson on May 17, 2018

The new European data protection legislation goes into effect next Friday. It’s going to be incredibly interesting to see how militant its enforcers play the game.

There are many aspects required in complying with GDPR, but one core tenet is knowing where you are storing personal data. A challenging data type in this respect is images that contain personal data but where the image has not been converted to readable text. One law firm in Sweden is doing something about its storage of such content:

Delphi, one of Sweden’s top commercial law firms, has chosen DocsCorp’s contentCrawler as part of its General Data Protection Regulation (GDPR) compliance strategy. The firm selected the contentCrawler OCR module to help address the “dark data” issue that was discovered after an audit of their file systems.

The audit found that 30% of the documents in the firm’s iManage Document Management System (DMS) were non-searchable. Nearly 70% of these were image-based PDF files, undermining the firm’s ability to manage clients’ personal data and to adequately respond to a Data Subject Access Request (DSAR).

For an organization to comply fully with DSARS or data return, erasure or portability requests, it needs to be able to search its DMS for all relevant documents. In the case of Delphi, it scanned driver licences and passports for identification purposes without OCR’ing the resulting image documents. The firm ended up storing large amounts of personal data that was effectively invisible to search technology, putting the firm at risk of non-compliance.

Identification of personal data is important. But so is knowing the legal basis under which it was held in the first place. That has tremendous implications for what organisations must do with discrete elements of personal data. Welcome to the new world, now just 8 days away.

Read more: Law firm chooses contentCrawler for GDPR compliance

Categories: Data Protection