Thursday, January 28, 2016

Removing Newlines In Quoted Strings

Sometimes double-quoted strings in comma-delimited files (CSV) contain newline characters.

Newline characters embedded within strings really mess up any downstream processing that depends on newlines, well, denoting newlines! So you have to remove them or substitute some other value, such as a space.

I recently encountered this problem, working with a nearly 300,000 line file of election data. Manually editing the file was obviously not feasible.

This turns out to be an extraordinarily difficult problem to solve, for a variety of reasons that I learned serially the hard way:
  • Unix command-line tools work on lines, but the quoted strings span lines! -- You're trying to stitch back together into one line what command-line tools consider two lines. There seem to be some sed and awk solutions, but they seem very complex (link), and I could never get them to fully work in my scenario. (More on that below.)
  • Different *nix operating systems have different versions of command-line tools -- Another problem is that the tips you can find by searching the Web don't work on all Unix variants (e.g., Linux vs. Mac/BSD). So, when things don't work, you're constantly wondering whether you've done something wrong or whether your flavor of Unix is just incompatible with you the version some poster was using and assuming. I ended up installing GNU command-line tools, to try to get around this problem (link). Unfortunately, I still couldn't get the command-line suggestions to fully work in my scenario.
  • Excel isn't available as a tool -- Sometimes you can import data into Excel and do some simple transformations on it there, e.g., using Find & Replace. Again, I couldn't in this scenario, because of some quirks in Excel. On the one hand, if you implicitly import a .csv file into Excel by simply double-clicking on it, Excel does some magic and splices the split rows back together for you! At first that seems like a wonderful bonus, but it turns out that Excel also applies "intelligent" formatting to the columns based on data in them. For example, if it sees data like "01-30" is assumes it's a date. That sounds great, unless the data actually isn't a date, because Excel actually changes the underlying data to be a date, and there's no way (that I know of) to unformat it, to convert it back to the original data! I had some precinct identifiers that looked like dates, so I couldn't leverage Excel's auto import. On the other hand, if you explicitly import a .csv file into Excel--by opening a new worksheet in Excel and choosing File / Import ... , and explicitly identify the data types for each column to get around that auto-formatting problem--Excel doesn't then do the magic stitching of the split lines. It makes no sense, but there seems to be no setting to enable this, no way to get explicit import to work like auto import in Excel in that regard. Sigh. 
Once you've worked your way through all of that, the obvious solution is to write a little script to do the job yourself. How hard could this be? Read a character and write a character with a little state machine in the middle to keep track of whether you're inside a double quote or not. 

It turns out that a couple more problems make that seemingly simple script not so simple:
  • Newlines are invisible or unprintable characters -- The first problem is that newline characters are not visible in some popular text editors like Sublime Text. In others, like TextMate, you can View / Show Invisibles but then the invisibles only show up with little placeholder UI widgets. For example, there's a little grey NOT sign (¬) anywhere there's a newline. But all invisibles show up with the same representation. The much bigger problem, though, is that the carriage return (CR) then line feed (LF) sequence of invisible newline characters shows up as one unprintable character in TextMate! IOW, it can look like there's one newline character when there are, in fact, two, which, of course, can totally confound your otherwise seemingly simple code. What should work simply doesn't ... or doesn't always work in seemingly mysterious ways. Which leads to the last major complication:
  • Different applications, editors, and operating systems represent newlines differently -- A variety of combinations of CR and LF are used (link). Normally, you don't have to pay attention to this. However, when you're working with files coming from different systems (Windows vs. Mac vs. Unix), files produced/consumed by different applications that you don't normally use (I was using an open source GIS application called QGIS), or you're creating new files with text editors (I created some small sample files, by copying and pasted lines from the large file), the representation of newlines can get all mixed up, again confounding an otherwise simple script.  
The Unix command-line dump tool, e.g., od -c <file>, would seem like it would be helpful, but in light of the above in my scenario, not so much. Dumping works great on small sample/test files, and I did this a bunch. But, as I discovered eventually, my sample data represented newlines differently than the big file. od would have showed me that there were, in fact, two characters where I thought there was one in the real data, but the real data was gigantic and the offending data was buried and intermixed within it, so dumping the whole file was not practical or useful.

After beating my head against the wall for quite sometime, trying for the life of me to figure out why the simple Perl script I'd written worked on my sample but not on the real data--even when the sample was a copied and pasted from the original--I finally realized that the sample had single LF newlines while the full file used CRLF newline sequences. The text editor I had used to create the sample file, TextMate, had quietly used a different newline representation for the new file, even though I'd opened the real data file in same text editor. So the script was doing exactly what I wanted it to do, but the data wasn't what I thought it was!

(A side note here: If you go back to the command-line suggestions linked above, I believe they also assume single LF newlines which probably explains why the seemed to work sometimes--in retrospect, on my test data with just LF newlines--but not on my big file--with CRLF newlines.)

So, in the spirit of giving back, I've put the script I wrote in a Gist on GitHub (eat_CRLF_in_quoted_strs.pl) and included the code below, for your convenience. For my application, I chose to simply eat the newlines embedded in quoted strings, rather than replacing them with spaces.

Enjoy!