Showing posts with label Development. Show all posts
Showing posts with label Development. Show all posts

Thursday, January 28, 2016

Removing Newlines In Quoted Strings

Sometimes double-quoted strings in comma-delimited files (CSV) contain newline characters.

Newline characters embedded within strings really mess up any downstream processing that depends on newlines, well, denoting newlines! So you have to remove them or substitute some other value, such as a space.

I recently encountered this problem, working with a nearly 300,000 line file of election data. Manually editing the file was obviously not feasible.

This turns out to be an extraordinarily difficult problem to solve, for a variety of reasons that I learned serially the hard way:
  • Unix command-line tools work on lines, but the quoted strings span lines! -- You're trying to stitch back together into one line what command-line tools consider two lines. There seem to be some sed and awk solutions, but they seem very complex (link), and I could never get them to fully work in my scenario. (More on that below.)
  • Different *nix operating systems have different versions of command-line tools -- Another problem is that the tips you can find by searching the Web don't work on all Unix variants (e.g., Linux vs. Mac/BSD). So, when things don't work, you're constantly wondering whether you've done something wrong or whether your flavor of Unix is just incompatible with you the version some poster was using and assuming. I ended up installing GNU command-line tools, to try to get around this problem (link). Unfortunately, I still couldn't get the command-line suggestions to fully work in my scenario.
  • Excel isn't available as a tool -- Sometimes you can import data into Excel and do some simple transformations on it there, e.g., using Find & Replace. Again, I couldn't in this scenario, because of some quirks in Excel. On the one hand, if you implicitly import a .csv file into Excel by simply double-clicking on it, Excel does some magic and splices the split rows back together for you! At first that seems like a wonderful bonus, but it turns out that Excel also applies "intelligent" formatting to the columns based on data in them. For example, if it sees data like "01-30" is assumes it's a date. That sounds great, unless the data actually isn't a date, because Excel actually changes the underlying data to be a date, and there's no way (that I know of) to unformat it, to convert it back to the original data! I had some precinct identifiers that looked like dates, so I couldn't leverage Excel's auto import. On the other hand, if you explicitly import a .csv file into Excel--by opening a new worksheet in Excel and choosing File / Import ... , and explicitly identify the data types for each column to get around that auto-formatting problem--Excel doesn't then do the magic stitching of the split lines. It makes no sense, but there seems to be no setting to enable this, no way to get explicit import to work like auto import in Excel in that regard. Sigh. 
Once you've worked your way through all of that, the obvious solution is to write a little script to do the job yourself. How hard could this be? Read a character and write a character with a little state machine in the middle to keep track of whether you're inside a double quote or not. 

It turns out that a couple more problems make that seemingly simple script not so simple:
  • Newlines are invisible or unprintable characters -- The first problem is that newline characters are not visible in some popular text editors like Sublime Text. In others, like TextMate, you can View / Show Invisibles but then the invisibles only show up with little placeholder UI widgets. For example, there's a little grey NOT sign (¬) anywhere there's a newline. But all invisibles show up with the same representation. The much bigger problem, though, is that the carriage return (CR) then line feed (LF) sequence of invisible newline characters shows up as one unprintable character in TextMate! IOW, it can look like there's one newline character when there are, in fact, two, which, of course, can totally confound your otherwise seemingly simple code. What should work simply doesn't ... or doesn't always work in seemingly mysterious ways. Which leads to the last major complication:
  • Different applications, editors, and operating systems represent newlines differently -- A variety of combinations of CR and LF are used (link). Normally, you don't have to pay attention to this. However, when you're working with files coming from different systems (Windows vs. Mac vs. Unix), files produced/consumed by different applications that you don't normally use (I was using an open source GIS application called QGIS), or you're creating new files with text editors (I created some small sample files, by copying and pasted lines from the large file), the representation of newlines can get all mixed up, again confounding an otherwise simple script.  
The Unix command-line dump tool, e.g., od -c <file>, would seem like it would be helpful, but in light of the above in my scenario, not so much. Dumping works great on small sample/test files, and I did this a bunch. But, as I discovered eventually, my sample data represented newlines differently than the big file. od would have showed me that there were, in fact, two characters where I thought there was one in the real data, but the real data was gigantic and the offending data was buried and intermixed within it, so dumping the whole file was not practical or useful.

After beating my head against the wall for quite sometime, trying for the life of me to figure out why the simple Perl script I'd written worked on my sample but not on the real data--even when the sample was a copied and pasted from the original--I finally realized that the sample had single LF newlines while the full file used CRLF newline sequences. The text editor I had used to create the sample file, TextMate, had quietly used a different newline representation for the new file, even though I'd opened the real data file in same text editor. So the script was doing exactly what I wanted it to do, but the data wasn't what I thought it was!

(A side note here: If you go back to the command-line suggestions linked above, I believe they also assume single LF newlines which probably explains why the seemed to work sometimes--in retrospect, on my test data with just LF newlines--but not on my big file--with CRLF newlines.)

So, in the spirit of giving back, I've put the script I wrote in a Gist on GitHub (eat_CRLF_in_quoted_strs.pl) and included the code below, for your convenience. For my application, I chose to simply eat the newlines embedded in quoted strings, rather than replacing them with spaces.

Enjoy!

Friday, April 24, 2015

Trying Craft Client or Pro on Mac OS X Before Buying

Craft allows you to try Craft Client and Craft Pro before buying them, by running your site from the host name craft.dev or a subdomain such as site1.craft.dev (link). However, it's not immediately obvious what "running your site from craft.dev" means.

This post explains how to run your site from craft.dev on Mac OS X using MAMP and Sequel Pro.

1 — Install Craft on MAMP

  • The first step is to install Craft Personal. These instructions explain how to do that on Mac OS X using MAMP and Sequel Pro.

2 — Create a directory for the craft.dev site

  • Go to /Applications/MAMP/htdocs
  • Create a new folder trybeforebuy/public/ (or whatever you want to call it)  
  • Copy contents of the htdocs folder there, including the hidden .htaccess file
  • Edit $craftpath in index.php to be '../../../craft'

3 — Setup craft.dev as an Apache virtual host 

  • Follow these video instructions 
  • Use 'craft.dev' (no quotes) as the ServerName
  • Use "/Applications/MAMP/htdocs/trybeforebuy/public/" (with quotes) as the DocumentRoot and Directory 

4 — Change the Web port to 80 from 8888

  • Open MAMP
  • Switch to the Ports tab
  • Click on the “Set the Web & MySQL ports to 80 & 3306” button

5 — Access the new Dashboard 

  • Go to http://craft.dev/admin/login 

6 — Upgrade your Craft installation

  • Click on the "Show me" link in the "Upgrade Craft to take your site to the next level" box at the bottom of your Dashboard
  • Click on the "Test" link by Craft Client or Client Pro, whichever you want to explore
That's it!

This post describes how to switch between the different versions--Craft Personal, Craft Client, and Craft Pro--if you want to do that.

Saturday, April 4, 2015

Installing Craft CMS on Mac OS X Using MAMP & Sequel Pro

I just wanted to set up Craft CMS on my Macbook to experiment with. It took me forever. This explains what worked (and what was missing from other sets of instructions).

Craft's install documentation assumes but doesn't explain several prerequisites (like setting up my MySQL) and The Absolute Beginners Guide to Setting Up Craft on Mac (sounded perfect) was directionally helpful but had some errors and, at the end of the day, simply didn't work for me. For the longest time, I could not get Craft to connect to MySQL, even though it was installed correctly and running.

These are the perquisite steps, assuming you aren't already running Apache and MySQL:
  • Step P1: Download and install MAMP -- This sets up a Macintosh, Apache, MySQL, and PHP solution stack and gives you a simple management interface to start and stop the servers.
  • Step P2: Download and install Sequel Pro -- This gives you a simple tool to manage your MySQL databases.
  • Step P3: Open MAMP (not MAMP Pro), and click "Start Servers" -- This starts the Apache Web server and the MySQL database engine.
  • Step P4: In MAMP, click "Open WebStart page," if MAMP didn't already open one in your browser automatically.
  • Step P5: Open Sequel Pro, and use the parameters on MAMP's WebStart page to create a connection -- I could not create a Standard connection using '127.0.0.1'. I had to create a Socket connection, using 'localhost'. This is the scenario that the Craft installation instructions don't anticipate.
  • Step P6: Click the control in the upper lefthand of Sequel Pro, and choose "Add Database ..." to create a database, e.g.. 'crafted'. Choose UTF-8 for “Database Encoding."
These steps track those in Craft's installation instructions but with some additional explanation. Read them together:
  • Step C1: Upload the files -- You're uploading the files, because you're pushing them to your Apache Web server on your Macbook.
    • Download Craft -- It will end up in your Downloads folder. There are two folders craft/ and public/ in the downloaded 'Craft-2' folder.
    • Copy the craft/ folder to /Applications/MAMP, i.e., above the webfoot (htdocs/). 
    • Copy the files htaccess, index.php, and robots.txt from public/ to the htdocs/ folder.
    • Rename htaccess to .htaccess -- Open Terminal and run the following commands:
      • "cd /Applications/MAMP/htdocs" -- go to where htaccess lives 
      • "defaults write com.apple.finder AppleShowAllFiles -bool true" -- show hidden files
      • [Relaunch the Finder] -- hold down control-Alt while you click on Finder in the dock and then choose Relaunch
      • "mv htaccess .htaccess" -- rename htaccess
      • "defaults write com.apple.finder AppleShowAllFiles -bool false" -- hide hidden files
      • [Relaunch the Finder again]
  • Step C2: Set the permissions
    • Open Terminal and run the following commands:
      • "cd /Applications/MAMP/craft" -- go to the craft/ directory
      • "chmod -R 744 app"
      • "chmod -R 744 config"
      • "chmod -R 744 storage"
  • Step C3: Create your database 
    • Nothing to do here -- You already did this in Step P6 above.
  • Step C4: Tell Craft how to connect to your database
    • Copy the default settings in "craft/app/etc/config/defaults/db.php" into "craft/config/db.php"
    • Update the parameters
    • Set 'unixSocket' to the Socket value on your MAMP WebStart page (e.g., "/Applications/MAMP/tmp/mysql/mysql.sock"). 
  • Step C5: Run the installer! 
    • The URL is "localhost:8888/admin"
    • Follow the installation screens
I'm guessing that if you are setting up Craft in a production environment on a production Mac you might want to do somethings a little different, but this will get you a sandbox to play in. Enjoy!