PlanetPress Connect: What is a DataMapper?

It’s been a while since I contributed anything to this blog. Reason is, I’ve been busy working on OL’s next generation of products. If you haven’t heard yet, it’s called Connect, and it’s just plain awesome, if I do say so myself. Not that I wrote any of it, mind you. All I do is manage a team of expert coders who somehow convert my plain-text wishes into 0’s and 1’s that eventually become an application. Yup, that’s how we do it in the 21st century: we just write a series of 0’s and 1’s and we’ve got an application to sell.

Okay, so that’s not exactly how it’s done. It’s a bit more complex – so the coders tell me – but since I’m not overly technical, I usually fall into a deep coma as soon as I hear any of them start a sentence with “lemme explain how…”. Good thing about being a manager: I don’t have to learn all that mumbo-jumbo. I just ask for something and it magically appears the next day or the next week in the latest Beta. Cool, huh? So I spend all day having fun with the new gadgets they implement in the application. Yup, playing around with data can really be fun, but more on that later.

Carburators and how things work

Why am I telling you this? Well most people, just like me, don’t really like learning about how technology works. What we do love, however, is using it. I mean, who actually knows how a flat TV works these days, right? Or have you looked under the hood of any fairly recent car? Whaddaya mean, there’s no carburator?!? Yet we all use TV’s and cars, though doing so simultaneously is generally frowned upon by the authorities. It’s the fun part of technology: using it. I don’t care how it works, I just care that it does work. And most importantly, I want to be able to use any similar technology without having to learn the specifics for each version. Case in point:  when my next door neighbour invites me over to watch football on his TV, the darn thing better work just like mine does so I can easily zap during commercial breaks and watch reruns of Gilligan’s Island. And the bottle(s) of beer he serves during the game better work the same as the ones I have at home (why aren’t they all twist-caps anyway??? But I digress…).

The DataMapper

The point is, technology is fun when it works and when it’s simple enough to use that you don’t have to know much about its inner workings. And that’s exactly the goal we set out to achieve with the different modules that make up our newest generation of products. Since our team in Montreal is more specifically in charge of providing one of the line’s modules named DataMapper, I thought I’d talk a little bit about that. The module is used for… well… mapping data. Kind of an appropriate name, don’t you think? Okay, so what is “mapping data”? (Don’t worry, I won’t get into technical explanations. I couldn’t if I wanted to, it’s all magic to me. Perhaps it has something to do with the fact that the last two guys I hired for the department are named Merlin and Gandalf. Then again, maybe it’s just a coincidence). Uhm… where was I?

What is data mapping?

Oh yes: mapping data. Sorry, short attention span. To better illustrate what it is, think of a map, any map: what is it exactly? It’s simply a representation of something else. Or to put it another way: when you’re looking at Italy on the world map, you’re not actually looking at the Italy. What you’re looking at is a representation of Italy, which by the way happens to look like a boot. And funny thing, if you ever go to Italy, you’ll notice something extremely perplexing: you never ever get the feeling you’re walking around on a huge boot. Why not? Well because it’s not a boot at all: it just looks like it on a map. It’s all about representation vs. reality. And notice something else on the map: whether you’re looking at Italy, Russia or Papua-New Guinea, all countries are delineated the same way, regardless of how tortured their coastline is or of how many islands and lakes they comprise. And to boot, the map always only shows the important information about each country: names of big cities, rivers and mountains, administrative regions. So each country is completely different yet the map allows you to instantly distinguish between them and find common information for all of them.

The DataMapper allows you to do the exact same thing with data. What kind of data? Any kind. How complex? As complex as you want. And what does the DataMapper do with this unbelievably-complex-and-absolutely-impossible-to-understand-data-file-spewed-out-by-some-prehistoric-mainframe? It generates a simple map. Just like the map of Italy. It allows you to mark the important items in the data (like the city of Rome) and discard the gibberish (like the number of goat herders living in the Abruzzo mountains). Just so we’re clear: the DataMapper won’t generate a visual map that makes your data look like a country (that would be cool, but kinda useless). Instead, it creates what’s called a UDM: a unified data model. The UDM turns your ‘unintelligible’ data into something that’s actually human-readable… and more importantly, human-understandable.

And that’s where the fun begins. Because as I already stated, if you’ve told the DataMapper it should generate a specific UDM, that’s what it will do, regardless of the complexity or diversity of your original data. Going back to my analogy with geographic maps, it’s as if you were telling the DataMapper to always extract the name of all capital cities and all major rivers in any country, regardless of where each of these items is physically located within each country. In essence, you’re mixing and matching countries of varying shapes and sizes, yet always extracting the same information for each of them.

Allow me to use an actual example with actual data

Say you have two PDF files. One is an invoice, the other is a simple promo letter. Let’s say all you want to do is scrape the postal address from each of the PDF files so you can use those postal addresses on other documents that you’ll be designing. The simple act of expressing your intent (scraping the postal address) equates to defining your UDM: for both files, you’ll want the DataMapper to extract name, first address line, second address line, city, state, country and zip code. Note that you might prefer to extract the entire postal address into a single field, that all depends on what you intend to do with it later on. But let’s say you want the aforementioned fields: unless you’re really lucky (and we all know that where data is involved, no one is ever lucky), those items won’t be located in the same physical location on the invoice and on the letter. Yet the DataMapper will extract both addresses and store them in the exact same format. Which means that when you get around to designing your new document on which you want to use those postal addresses, you don’t have to know anything about where they came from or how they were extracted. You just drag them onto your new document template as if they all originally came from the same source, using the same format.

And that’s just for two different source documents. This could be applied to hundreds of them because let’s face it, postal addresses are pretty similar all around (yes, I know there are regional differences in how the addresses are formatted but you get the idea). And the really cool thing is that those original source files don’t have to all be PDF files: some of them might be CSV’s or XML’s or something else. Yet, in the end, you only deal with a single, unified, easy format on your new document template. See what I mean when I say data can actually be fun?

A more complex example

But wait, there’s more! The example I used is a simplistic one: it only deals with postal addresses. But the same logic applies to more complex data from more complex documents. Take invoices, for instance. Over the last 15 years, I have seen literally thousands of different invoice models from all over the world. Yes, I have an exciting life, but that’s beside the point. They all had huge differences in design and layout; they had varying sizes and colors; sometimes folded, duplexed or filled with transpromo material; stuffed in envelopes or presented online. Some had a single line item, others had hundreds of them. Yet, when you stop for a second and examine their content, you realize that it’s almost identical from one invoice to the next: seller and customer information; item descriptions and quantities; totals and subtotals. Even terms and conditions. What, you ask? Can the DataMapper actually identify all of these contents, extract them and present them in a uniform map??? You bet it can. And it does so through an elegant (yes, yes, elegant!) combination of mouse selections, drag and drop operations and navigation buttons.

So what do you think? Ready to have fun too? Learn more about the DataMapper here