rubypants.rb
82d0251d
 #
 # = RubyPants - SmartyPants ported to Ruby
 #
 # Ported by Christian Neukirchen <mailto:chneukirchen@gmail.com>
 #   Copyright (C) 2004 Christian Neukirchen
 #
 # Incooporates ideas, comments and documentation by Chad Miller
 #   Copyright (C) 2004 Chad Miller
 #
 # Original SmartyPants by John Gruber
 #   Copyright (C) 2003 John Gruber
 #
 
 #
 # = RubyPants - SmartyPants ported to Ruby
 #
 # == Synopsis
 #
 # RubyPants is a Ruby port of the smart-quotes library SmartyPants.
 #
 # The original "SmartyPants" is a free web publishing plug-in for
 # Movable Type, Blosxom, and BBEdit that easily translates plain ASCII
 # punctuation characters into "smart" typographic punctuation HTML
 # entities.
 #
 #
 # == Description
 # 
 # RubyPants can perform the following transformations:
 # 
 # * Straight quotes (<tt>"</tt> and <tt>'</tt>) into "curly" quote
 #   HTML entities
 # * Backticks-style quotes (<tt>``like this''</tt>) into "curly" quote
 #   HTML entities
 # * Dashes (<tt>--</tt> and <tt>---</tt>) into en- and em-dash
 #   entities
 # * Three consecutive dots (<tt>...</tt> or <tt>. . .</tt>) into an
 #   ellipsis entity
 # 
 # This means you can write, edit, and save your posts using plain old
 # ASCII straight quotes, plain dashes, and plain dots, but your
 # published posts (and final HTML output) will appear with smart
 # quotes, em-dashes, and proper ellipses.
 # 
 # RubyPants does not modify characters within <tt><pre></tt>,
 # <tt><code></tt>, <tt><kbd></tt>, <tt><math></tt> or
 # <tt><script></tt> tag blocks. Typically, these tags are used to
 # display text where smart quotes and other "smart punctuation" would
 # not be appropriate, such as source code or example markup.
 #
 #
 # == Backslash Escapes
 # 
 # If you need to use literal straight quotes (or plain hyphens and
 # periods), RubyPants accepts the following backslash escape sequences
 # to force non-smart punctuation. It does so by transforming the
 # escape sequence into a decimal-encoded HTML entity:
 # 
 #   \\    \"    \'    \.    \-    \`
 #
 # This is useful, for example, when you want to use straight quotes as
 # foot and inch marks: 6'2" tall; a 17" iMac.  (Use <tt>6\'2\"</tt>
 # resp. <tt>17\"</tt>.)
 #
 # 
 # == Algorithmic Shortcomings
 # 
 # One situation in which quotes will get curled the wrong way is when
 # apostrophes are used at the start of leading contractions. For
 # example:
 # 
 #   'Twas the night before Christmas.
 # 
 # In the case above, RubyPants will turn the apostrophe into an
 # opening single-quote, when in fact it should be a closing one. I
 # don't think this problem can be solved in the general case--every
 # word processor I've tried gets this wrong as well. In such cases,
 # it's best to use the proper HTML entity for closing single-quotes
 # (``&#8217;``) by hand.
 # 
 # 
 # == Bugs
 #
 # To file bug reports or feature requests (except see above) please
 # send email to: mailto:chneukirchen@gmail.com
 #
 # If the bug involves quotes being curled the wrong way, please send
 # example text to illustrate.
 #
 #
 # == Authors
 # 
 # John Gruber did all of the hard work of writing this software in
 # Perl for Movable Type and almost all of this useful documentation.
 # Chad Miller ported it to Python to use with Pyblosxom.
 #
 # Christian Neukirchen provided the Ruby port, as a general-purpose
 # library that follows the *Cloth api.
 # 
 #
 # == Copyright and License
 # 
 # === SmartyPants license:
 # 
 # Copyright (c) 2003 John Gruber
 # (http://daringfireball.net)
 # All rights reserved.
 # 
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
 # 
 # * Redistributions of source code must retain the above copyright
 #   notice, this list of conditions and the following disclaimer.
 # 
 # * Redistributions in binary form must reproduce the above copyright
 #   notice, this list of conditions and the following disclaimer in
 #   the documentation and/or other materials provided with the
 #   distribution.
 # 
 # * Neither the name "SmartyPants" nor the names of its contributors
 #   may be used to endorse or promote products derived from this
 #   software without specific prior written permission.
 # 
 # This software is provided by the copyright holders and contributors
 # "as is" and any express or implied warranties, including, but not
 # limited to, the implied warranties of merchantability and fitness
 # for a particular purpose are disclaimed. In no event shall the
 # copyright owner or contributors be liable for any direct, indirect,
 # incidental, special, exemplary, or consequential damages (including,
 # but not limited to, procurement of substitute goods or services;
 # loss of use, data, or profits; or business interruption) however
 # caused and on any theory of liability, whether in contract, strict
 # liability, or tort (including negligence or otherwise) arising in
 # any way out of the use of this software, even if advised of the
 # possibility of such damage.
 # 
 # === RubyPants license
 # 
 # RubyPants is a derivative work of SmartyPants and smartypants.py.
 # 
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
 # are met:
 # 
 # * Redistributions of source code must retain the above copyright
 #   notice, this list of conditions and the following disclaimer.
 # 
 # * Redistributions in binary form must reproduce the above copyright
 #   notice, this list of conditions and the following disclaimer in
 #   the documentation and/or other materials provided with the
 #   distribution.
 # 
 # This software is provided by the copyright holders and contributors
 # "as is" and any express or implied warranties, including, but not
 # limited to, the implied warranties of merchantability and fitness
 # for a particular purpose are disclaimed. In no event shall the
 # copyright owner or contributors be liable for any direct, indirect,
 # incidental, special, exemplary, or consequential damages (including,
 # but not limited to, procurement of substitute goods or services;
 # loss of use, data, or profits; or business interruption) however
 # caused and on any theory of liability, whether in contract, strict
 # liability, or tort (including negligence or otherwise) arising in
 # any way out of the use of this software, even if advised of the
 # possibility of such damage.
 # 
 #
 # == Links
 #
 # John Gruber:: http://daringfireball.net
 # SmartyPants:: http://daringfireball.net/projects/smartypants
 #
 # Chad Miller:: http://web.chad.org
 #
 # Christian Neukirchen:: http://kronavita.de/chris
 #
 
 
 class RubyPants < String
   VERSION = "0.1"
 
   # Allowed elements in the options array:
   # 
   # 0 :: do nothing
   # 1 :: set all
   # 2 :: set all, using old school en- and em- dash shortcuts
   # 3 :: set all, using inverted old school en and em- dash shortcuts
   # -1 :: stupefy (translate HTML entities to their ASCII-counterparts)
   #
   # <tt>:quotes</tt>        :: quotes
   # <tt>:backticks</tt>     :: backtick quotes (``double'' only)
   # <tt>:allbackticks</tt>  :: backtick quotes (``double'' and `single')
   # <tt>:dashes</tt>        :: dashes
   # <tt>:oldschool</tt>     :: old school dashes
   # <tt>:inverted</tt>      :: inverted old school dashes
   # <tt>:ellipses</tt>      :: ellipses
   # <tt>:convertquotes</tt> :: convert <tt>&quot;</tt> entities to
   #                            <tt>"</tt> for Dreamweaver users
   # <tt>:stupefy</tt>       :: translate SmartyPants HTML entities
   #                            to their ASCII counterparts.
   #
   def initialize(string, options=[2])
     super string
     @options = [*options]
   end
 
   # Apply SmartyPants transformations.
   def to_html
     do_quotes = do_backticks = do_dashes = do_ellipses = do_stupify = nil
     convert_quotes = false
 
     if @options.include? 0
       # Do nothing.
       return self
     elsif @options.include? 1
       # Do everything, turn all options on.
       do_quotes = do_backticks = do_ellipses = true
       do_dashes = :normal
     elsif @options.include? 2
       # Do everything, turn all options on, use old school dash shorthand.
       do_quotes = do_backticks = do_ellipses = true
       do_dashes = :oldschool
     elsif @options.include? 3
       # Do everything, turn all options on, use inverted old school
       # dash shorthand.
       do_quotes = do_backticks = do_ellipses = true
       do_dashes = :inverted
     elsif @options.include?(-1)
       do_stupefy = true
     else
       do_quotes =                @options.include? :quotes
       do_backticks =             @options.include? :backticks
       do_backticks = :both    if @options.include? :allbackticks
       do_dashes = :normal     if @options.include? :dashes
       do_dashes = :oldschool  if @options.include? :oldschool
       do_dashes = :inverted   if @options.include? :inverted
       do_ellipses =              @options.include? :ellipses
       convert_quotes =           @options.include? :convertquotes
       do_stupefy =               @options.include? :stupefy
     end
 
     # Parse the HTML
     tokens = tokenize
     
     # Keep track of when we're inside <pre> or <code> tags.
     in_pre = false
 
     # Here is the result stored in.
     result = ""
 
     # This is a cheat, used to get some context for one-character
     # tokens that consist of just a quote char. What we do is remember
     # the last character of the previous text token, to use as context
     # to curl single- character quote tokens correctly.
     prev_token_last_char = ""
 
     tokens.each { |token|
       if token.first == :tag
         result << token[1]
         if token[1] =~ %r!<(/?)(?:pre|code|kbd|script|math)[\s>]!
           in_pre = ($1 != "/")  # Opening or closing tag?
         end
       else
         t = token[1]
 
         # Remember last char of this token before processing.
         last_char = t[-1]
 
         unless in_pre
           t = process_escapes t
           
           t.gsub!(/&quot;/, '"')  if convert_quotes
 
           if do_dashes
             t = educate_dashes t            if do_dashes == :normal
             t = educate_dashes_oldschool t  if do_dashes == :oldschool
             t = educate_dashes_inverted t   if do_dashes == :inverted
           end
 
           t = educate_ellipses t  if do_ellipses
 
           # Note: backticks need to be processed before quotes.
           if do_backticks
             t = educate_backticks t
             t = educate_single_backticks t  if do_backticks == :both
           end
 
           if do_quotes
             if t == "'"
               # Special case: single-character ' token
               if prev_token_last_char =~ /\S/
                 t = "&#8217;"
               else
                 t = "&#8216;"
               end
             elsif t == '"'
               # Special case: single-character " token
               if prev_token_last_char =~ /\S/
                 t = "&#8221;"
               else
                 t = "&#8220;"
               end
             else
               # Normal case:                  
               t = educate_quotes t
             end
           end
 
           t = stupefy_entities t  if do_stupefy
         end
 
         prev_token_last_char = last_char
         result << t
       end
     }
 
     # Done
     result
   end
 
   protected
 
   # Return the string, with after processing the following backslash
   # escape sequences. This is useful if you want to force a "dumb" quote
   # or other character to appear.
   #
   # Escaped are:
   #      \\    \"    \'    \.    \-    \`
   #
   def process_escapes(str)
     str.gsub(/\\\\/, '&#92;').
       gsub(/\\"/, '&#34;').
       gsub(/\\'/, '&#39;').
       gsub(/\\\./, '&#46;').
       gsub(/\\-/, '&#45;').
       gsub(/\\`/, '&#96;')
   end
 
   # The string, with each instance of "<tt>--</tt>" translated to an
   # em-dash HTML entity.
   #
   def educate_dashes(str)
     str.gsub(/--/, '&#8212;')
   end
 
   # The string, with each instance of "<tt>--</tt>" translated to an
   # en-dash HTML entity, and each "<tt>---</tt>" translated to an
   # em-dash HTML entity.
   #
   def educate_dashes_oldschool(str)
     str.gsub(/---/, '&#8212;').gsub(/--/, '&#8211;')
   end
 
   # Return the string, with each instance of "<tt>--</tt>" translated
   # to an em-dash HTML entity, and each "<tt>---</tt>" translated to
   # an en-dash HTML entity. Two reasons why: First, unlike the en- and
   # em-dash syntax supported by +educate_dashes_oldschool+, it's
   # compatible with existing entries written before SmartyPants 1.1,
   # back when "<tt>--</tt>" was only used for em-dashes.  Second,
   # em-dashes are more common than en-dashes, and so it sort of makes
   # sense that the shortcut should be shorter to type. (Thanks to
   # Aaron Swartz for the idea.)
   #
   def educate_dashes_inverted(str)
     str.gsub(/---/, '&#8211;').gsub(/--/, '&#8212;')
   end
 
   # Return the string, with each instance of "<tt>...</tt>" translated
   # to an ellipsis HTML entity. Also converts the case where there are
   # spaces between the dots.
   #
   def educate_ellipses(str)
     str.gsub('...', '&#8230;').gsub('. . .', '&#8230;')
   end
 
   # Return the string, with <tt>``backticks''</tt>-style single quotes
   # translated into HTML curly quote entities.
   #
   def educate_backticks(str)
     str.gsub("``", '&#8220;').gsub("''", '&#8221;')
   end
 
   # Return the string, with <tt>`backticks'</tt>-style single quotes
   # translated into HTML curly quote entities.
   #
   def educate_single_backticks(str)
     str.gsub("`", '&#8216;').gsub("'", '&#8217;')
   end
 
   # Return the string, with "educated" curly quote HTML entities.
   #
   def educate_quotes(str)
     punct_class = '[!"#\$\%\'()*+,\-.\/:;<=>?\@\[\\\\\]\^_`{|}~]'
 
     str = str.dup
       
     # Special case if the very first character is a quote followed by
     # punctuation at a non-word-break. Close the quotes by brute
     # force:
     str.gsub!(/^'(?=#{punct_class}\B)/, '&#8217;')
     str.gsub!(/^"(?=#{punct_class}\B)/, '&#8221;')
 
     # Special case for double sets of quotes, e.g.:
     #   <p>He said, "'Quoted' words in a larger quote."</p>
     str.gsub!(/"'(?=\w)/, '&#8220;&#8216;')
     str.gsub!(/'"(?=\w)/, '&#8216;&#8220;')
 
     # Special case for decade abbreviations (the '80s):
     str.gsub!(/'(?=\d\ds)/, '&#8217;')
 
     close_class = %![^\ \t\r\n\\[\{\(\-]!
     dec_dashes = '&#8211;|&#8212;'
     
     # Get most opening single quotes:
     str.gsub!(/(\s|&nbsp;|--|&[mn]dash;|#{dec_dashes}|&#x201[34];)'(?=\w)/,
              '\1&#8216;')
     # Single closing quotes:
     str.gsub!(/(#{close_class})'/, '\1&#8217;')
     str.gsub!(/'(\s|s\b|$)/, '&#8217;\1')
     # Any remaining single quotes should be opening ones:
     str.gsub!(/'/, '&#8216;')
 
     # Get most opening double quotes:
     str.gsub!(/(\s|&nbsp;|--|&[mn]dash;|#{dec_dashes}|&#x201[34];)"(?=\w)/,
              '\1&#8220;')
     # Double closing quotes:
     str.gsub!(/(#{close_class})"/, '\1&#8221;')
     str.gsub!(/"(\s|s\b|$)/, '&#8221;\1')
     # Any remaining quotes should be opening ones:
     str.gsub!(/"/, '&#8220;')
 
     str
   end
 
   # Return the string, with each SmartyPants HTML entity translated to
   # its ASCII counterpart.
   #
   def stupefy_entities(str)
     str.
       gsub(/&#8211;/, '-').      # en-dash
       gsub(/&#8212;/, '--').     # em-dash
       
       gsub(/&#8216;/, "'").      # open single quote
       gsub(/&#8217;/, "'").      # close single quote
       
       gsub(/&#8220;/, '"').      # open double quote
       gsub(/&#8221;/, '"').      # close double quote
       
       gsub(/&#8230;/, '...')     # ellipsis
   end
 
   # Return an array of the tokens comprising the string. Each token is
   # either a tag (possibly with nested, tags contained therein, such
   # as <tt><a href="<MTFoo>"></tt>, or a run of text between
   # tags. Each element of the array is a two-element array; the first
   # is either :tag or :text; the second is the actual value.
   #
   # Based on the <tt>_tokenize()</tt> subroutine from Brad Choate's
   # MTRegex plugin.  <http://www.bradchoate.com/past/mtregex.php>
   #
   # This is actually the easier variant using tag_soup, as used by
   # Chad Miller in the Python port of SmartyPants.
   #
   def tokenize
     tag_soup = /([^<]*)(<[^>]*>)/
 
     tokens = []
 
     prev_end = 0
     scan(tag_soup) {
       tokens << [:text, $1]  if $1 != ""
       tokens << [:tag, $2]
       
       prev_end = $~.end(0)
     }
 
     if prev_end < size
       tokens << [:text, self[prev_end..-1]]
     end
 
     tokens
   end
 end