
A Ruby gem for reading and extracting MHTML files

A ruby gem for parsing MHTML.

Uses the NodeJS C HTTP Parser under the hood (thanks to @cotag for the gem).


Add this line to your application's Gemfile:

gem 'mhtml'

And then execute:

$ bundle

Or install it yourself as:

$ gem install mhtml


Two interfaces are provided - all at once, or chunked.

All at once

For when you have all of the data in memory.

source = File.open('/file/path.mht').read
doc = Mhtml::RootDocument.new(source)

doc.headers.each { |h| puts h }

# body is decoded from printed quotable, and encoded according to charset header
puts doc.body 

doc.sub_docs.each { |s| puts subdoc }


For when source data is being streamed, or when concerned about memory usage.

doc = Mhtml::RootDocument.new

doc.on_header { |h| handle_header(h) } # yields each header

# yields body, possibly in parts
doc.on_body do |b|
  encoding = doc.encoding

doc.on_subdoc_begin { handle_subdoc_begin } # yields nil on each subdoc begin
doc.on_subdoc_header { |h| handle_subdoc_header(h) } # yields each subdoc header
doc.on_subdoc_body { |b| handle_subdoc_body(b) } # yields each subdoc's body, possibly in parts
doc.on_subdoc_complete { handle_subdoc_begin } # yields nil on each subdoc complete

File.open('/file/path.mht').read.scan(/.{128}/).each do |chunk|
  doc << chunk


The header class looks like this (portayed as a hash):

# Content-Type: multipart/related; charset="windows-1252"; boundary="----=_NextPart_01C74319.B7EA56A0"
  key: 'Content-Type',
  values: [
    { key: nil, value: 'multipart/related' },
    { key: 'charset', value: 'windows-1252' },
    { key: 'boundary', value: '----=_NextPart_01C74319.B7EA56A0' }


  • Revisit spec fixtures - either use existing solution or break out to separate gem
  • Build up body of fixtures using MHTML from various sources


After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

