Today I started working on the next feature of lambdaisland/uri, URI normalization. I worked test-first, you’ll get to see how that went in the next Lambda Island episode.
One of the design goals for this library is to have 100% parity between Clojure and ClojureScript. Learn once, use anywhere. The code is all written in
.cljc files, so it can be treated as either Clojure or ClojureScript. Only where necessary am I using a small amount of reader conditionals.
#?(:clj (defmethod print-method URI [this writer] (.write writer "#") (.write writer (str edn-tag)) (.write writer " ") (.write writer (prn-str (.toString this)))) :cljs (extend-type URI IPrintWithWriter (-pr-writer [this writer _opts] (write-all writer "#" (str edn-tag) " " (prn-str (.toString this))))))
Example of a reader conditional
For this feature however I’m digging quite deeply into the innards of strings, in order to do percent-encoding and decoding. Once you get into hairy stuff like text encodings the platform differences become quite apparent. Instead of trying to smooth over the differences with reader conditionals, I decided to create two files
platform.cljs. They define the exact same functions, but one does it for Clojure, the other for ClojureScript. Now from my main namespace I require
lambdaisland.uri.platform, and it will pull in the right one depending on the target that is being compiled for.
(ns lambdaisland.uri.normalize (:require [clojure.string :as str] ;; this loads either platform.clj, or platform.cljs [lambdaisland.uri.platform :refer [string->byte-seq byte-seq->string hex->byte byte->hex char-code-at]]))
The first challenge I ran into was that I needed to turn a string into a UTF-8 byte array, so that those bytes can be percent encoded. In Clojure that’s relatively easy. In ClojureScript the Google Closure library came to the rescue.
;; Clojure (defn string->byte-seq [s] (.getBytes s "UTF8")) (defn byte-seq->string [arr] (String. (byte-array arr) "UTF8")) ;; ClojureScript (require '[goog.crypt :as c]) (defn string->byte-seq [s] (c/stringToUtf8ByteArray s)) (defn byte-seq->string [arr] (c/utf8ByteArrayToString (apply array arr)))
To detect which characters need to be percent-encoded I’m using some regular expressions. Things seemed to be going well, but when re-running my tests on ClojureScript I started getting some weird results.
;; Clojure (re-seq #"." "🥀") ;;=> ("🥀") ;; ClojureScript (re-seq #"." "🥀") ;;=> ("�" "�")
This, gentle folks, is the wonder of surrogate pairs. So how does this happen?
Sadly I don’t have time to give you a complete primer on Unicode and its historical mistakes, but to give you the short version…
Unicode has grown a lot since then, and now also has a lot of codepoints with numbers greater than 65536. These include many old scripts, less common CJK characters (aka Hanzi or Kanji), many special symbols, and last but not least, emoji!
.slice all still happily assume it’s 1995, and so they’ll cut surrogate pairs in half without blinking.
ClojureScript builds on those semantics, so you are liable to all the same mess.
(seq "🚩 ") ;;=> ("�" "�")
I managed to work around this by first implementing
char-seq, a way of looping over the actual characters of a string.
I imagine this snippet might come in handy for some. Notice how it’s basically identical for Clojure and ClojureScript. This is because Java suffers from the same problem. The only difference is that there some of the language got the memo. So for instance regular expressions correctly work on characters, but things like substring or
.charAt are essentialy broken.
Hopefully ClojureScript will eventually fix some of this mess, for instance by having a
In the meanwhile what we can do is document the things you need to watch out for, and write cross-platform libraries like lambdaisland/uri that smooth over the differences. 👍
About the author
Arne divides his time between making Clojure tutorial videos for Lambda Island, and working on open-source projects like Chestnut. He is also available for Clojure and ClojureScript training and mentoring. You can support Arne through his Patreon page.