etr/libhttpserver

[BUG] Unable to access Content-Type of file when part of multipart form

jmslocum opened this issue ยท 12 comments

Prerequisites

Description

When uploading a file as part of a multipart form, the Content-Type, and filename are not accessible from the http_request object. I have tried get_args(), get_headers(), and get_footers(). However if i dump the body with get_content(), I can see the fields in the body, but they are not hoisted into higher level objects.

Example Code

#include <iostream>
#include <algorithm>

#include <httpserver.hpp>

using namespace httpserver;

class file_resource : public http_resource {
  public :
    const std::shared_ptr<http_response> render_POST(const http_request& req) {

      //Loop through the headers map and output them
      const auto& headers = req.get_headers();
      std::for_each(headers.begin(), headers.end(), [](auto& p){
        std::cout << p.first << " -> " << p.second << std::endl;
      });

      //Loop through the args map and output them
      const auto& args = req.get_args(); 
      std::for_each(args.begin(), args.end(), [](auto& p) {
        std::cout << p.first << " -> " << p.second << std::endl;
      });

      std::cout << "Body: " << req.get_content() << std::endl;
      return std::shared_ptr<http_response>(new string_response("OK"));
    }
};

void custom_access_log(const std::string& url) {
  std::cout << "Request to: " <<  url << std::endl;
}

void custom_error_log(const std::string& error) {
  std::cout << "Error: " << error << std::endl;
}

int main() {
  webserver ws = create_webserver(8080)
    .log_access(custom_access_log)
    .log_error(custom_error_log);

  file_resource fr;

  ws.register_resource("/file", &fr);

  ws.start(true);

  return 0;
}

Steps to Reproduce

  1. Compile example code above
  2. curl -X POST -F upload=@<some_file>.jpg localhost:8080/file
  3. the "upload" argument will have the file data, but the Content-Type, and original filename are not accessible
  4. Looking at the output, you can see the get_contents() string contains the Content-Type and filename
Method: POST, Body: --------------------------ea9b4d3e74b2950b
Content-Disposition: form-data; name="upload"; filename="suite_and_tie.jpg"
Content-Type: image/jpeg

Expected behavior:
Some way to access the filename and Content-Type of the file

Actual behavior:
No way to access it easily

Reproduces how often: 100%

I was able to work around the issue by parsing the file info from the body itself.

    const std::pair<std::string, std::string> getFileInfo(const std::string& body, const std::string& key) {
      std::string regex = "name=\"" + key + "\"; filename=\"(.*\\.[a-zA-Z]+)\"[\r\n]+Content-Type: (.*)";
      std::regex fileInfo(regex);
      std::smatch matches;

      if (std::regex_search(body, matches, fileInfo)) {
        //Found matches, should be two matches.
        spdlog::debug("Matches: {}, {}", matches[1].str(), matches[2].str());
        return std::make_pair(matches[1].str(), matches[2].str());
      }
      
      return std::make_pair("", "");
    }

Actually, I guess this is more of a feature request.

I had a look into libhttpserver's code for this, basically because we need a way to upload files and store them (temporarily) on the server. libmicrohttpd assists here with these:

I think libhttpserver does the right thing by hooking into it, but it uses the same post processor for mime-types "application/x-www-form-urlencoded" and "multipart/form-data". I guess it works for the first case (a usual web form without file upload) and does not for the second (a web form with file upload) because a completely different parser would be required.

This could be fixed for libhttpserver by providing different post iterator methods per Content-Type. In a first step this would allow for extracting args from the post data, given someone comes up with a parser for multipart mime, leaving the question what to do with the file content aside. A proposal by @etr for that file handling would be nice then.

I had a look into libhttpserver's code for this, basically because we need a way to upload files and store them (temporarily) on the server. libmicrohttpd assists here with these:

* [7 Improved processing of POST data](https://www.gnu.org/software/libmicrohttpd/tutorial.html#Improved-processing-of-POST-data)

* [11 Adding a POST processor](https://www.gnu.org/software/libmicrohttpd/manual/html_node/microhttpd_002dpost.html)

* [11.1 Programming interface for the POST processor](https://www.gnu.org/software/libmicrohttpd/manual/html_node/microhttpd_002dpost-api.html)

I think libhttpserver does the right thing by hooking into it, but it uses the same post processor for mime-types "application/x-www-form-urlencoded" and "multipart/form-data". I guess it works for the first case (a usual web form without file upload) and does not for the second (a web form with file upload) because a completely different parser would be required.

I was wrong. The key is (call this misleading or "not covered" by libmicrohttpd docs or my misunderstanding) this function prototype:

/**
 * Iterator over key-value pairs where the value
 * maybe made available in increments and/or may
 * not be zero-terminated.  Used for processing
 * POST data.
 *
 * @param cls user-specified closure
 * @param kind type of the value, always #MHD_POSTDATA_KIND when called from MHD
 * @param key 0-terminated key for the value
 * @param filename name of the uploaded file, NULL if not known
 * @param content_type mime-type of the data, NULL if not known
 * @param transfer_encoding encoding of the data, NULL if not known
 * @param data pointer to @a size bytes of data at the
 *              specified offset
 * @param off offset of data in the overall value
 * @param size number of bytes in @a data available
 * @return #MHD_YES to continue iterating,
 *         #MHD_NO to abort the iteration
 */
typedef enum MHD_Result
(*MHD_PostDataIterator)(void *cls,
                        enum MHD_ValueKind kind,
                        const char *key,
                        const char *filename,
                        const char *content_type,
                        const char *transfer_encoding,
                        const char *data,
                        uint64_t off,
                        size_t size);

In src/webserver.cpp libhttpserver ignores kind, filename, content_type, transfer_encoding, and off.

This could be fixed for libhttpserver by providing different post iterator methods per Content-Type. In a first step this would allow for extracting args from the post data, given someone comes up with a parser for multipart mime, leaving the question what to do with the file content aside. A proposal by @etr for that file handling would be nice then.

This is not needed, libmicrohttpd has a state machine based parser for "multipart/form-data", libhttpserver just does not make use of the additional fields set by libmicrohttpd, if the part is a file. I will look deeper again. Sorry for the noise.

etr commented

You are correct though; the issue is that libhttpserver currently ignores those fields and pulls the whole content in a single blurb. The post iterator method needs to be modified to use those parameters appropriately and store the content of each file separately.

This is not needed, libmicrohttpd has a state machine based parser for "multipart/form-data", libhttpserver just does not make use of the additional fields set by libmicrohttpd, if the part is a file. I will look deeper again. Sorry for the noise.

After reading and trying 7 Improved processing of POST data from the libmicrohttpd website and reviewing src/webserver.cpp again, I have a better picture now. My motivation is still to support uploading (potentially big) files through multipart/form-data and I spotted the following flaws in libhttpserver:

  • If I want to accept file upload (in my case files of ~15 MiB size) I must not fiddle with webserver::content_size_limit, otherwise my POST data might be truncated early
  • http_request::grow_content() is always called and the whole body of the message is put into std::string http_request::content, although some parts of the body may contain binary data. That means the whole body is put into applications memory which will make embedded targets with low memory (32 MiB here) crash, probably (did not try this yet)
  • Even worse: if you set webserver::post_process_enabled (default set to true) the file data is put into memory a second time with http_request::set_arg() in webserver::post_iterator(). However disabling the post processor is not smart, because you drop the whole multipart/form-data parser of libmicrohttpd then.

Proposal to solve this:

  • do not call http_request::grow_content() at all if a post processor is set
  • let the post processor set args as before (you'll need those anyway), but only for parts which are detected as text only form fields (the callback parameters kind, filename, and content_type should give good hints)
  • if a part of the multipart/form-data is of type file, don't put that to memory by default
  • find a clever way to process the actual file data (write to file or whatever)

The last point is the way to solve this issue I guess? ๐Ÿ˜‰

etr commented

A few comments:

  • let the post processor set args as before (you'll need those anyway), but only for parts which are detected as text only form fields (the callback parameters kind, filename, and content_type should give good hints)

Makes sense to me. I think that using kind and content_type should basically tell us if the part needs to be processed by the post processor.

  • if a part of the multipart/form-data is of type file, don't put that to memory by default
  • do not call http_request::grow_content() at all if a post processor is set

Bulking both these (as I think they are basically the same). I don't think this is necessary correct. We had use cases where we wanted to both pull the content and also set the content into memory. I think it is really a matter of leaving the choice to who uses the API. In this case, I would provide a rational default (or set of defaults) for the behavior, and then allow the consumers to customize it. Ideally, we would have a "content_sink" object that, for example, by default puts the content to memory (and/or another that pushes to file), and provide an interface to the consumers to create their own content_sink to manage the data differently. I think it would be reasonable to have three pre-baked implementations of content_sink, one that discards the data, one that puts the data to memory (some sort of map) and another one that puts it into files.

  • find a clever way to process the actual file data (write to file or whatever)

Whatever mechanism we use above, the grow_content will have to take into account the upload of distinct parts and store them separately for easy consumption.

  • if a part of the multipart/form-data is of type file, don't put that to memory by default

  • do not call http_request::grow_content() at all if a post processor is set

Bulking both these (as I think they are basically the same). I don't think this is necessary correct. We had use cases where we wanted to both pull the content and also set the content into memory.

Maybe let me rephrase to make my thoughts on the second point more clear: what would be the usecase for first storing the whole content unprocessed and additionally post process it in libhttpserver and store the parts a second time? Why storing the unprocessed content if the content is post processed and consumed anyway? ๐Ÿค”

etr commented

The practical use-case that this supported, at the time of writing the library, was a raw log (of the request). The processed content was still used by the application itself. To be fair, the "get_content" method existed with a semantic that meant "it returns the whole body of the request" before the get_args method started returning processed content from POST requests. I think that's beyond the point though, as the most important point is one of clarity/expectations to me.

From the clients point of view, for the "get_content" method, the simplest expectation is for it to return the body of the request. If we decide that it doesn't contain processed content, the expectations around the "get_content" method become quite weird; as in: "it will return you the content of the body request except for the parts that have been processed." I understand the efficiency argument, but I'd rather go for clarity over efficiency.

The argument would make more sense to me if we were proposing to eliminate the "get_content" method altogether, but we cannot really do that as it has clear use cases of its own.

As we are currently facing the very same problem (file upload via libhttpserver is quite slow and consumes much RAM), I would like to ask about the status of this issue.

I looked into the proposal for a solution and gave it a test. So far it works good, but it is obiously not finished yet.

As we would need a better way for file upload quite quickly (within the next 4 weeks) I would like to contribute and help with an implementation. Or improve/finalize the proposed patch.
I don't expect a solution to be approved and integrated in the official libhttpserver release within this time. But I want to avoid using an own patch (which I have to make nevertheless for the first moment) forever. So I would like to develop a patch, which is likely to be integrated into the official libhttserver eventually.

To avoid going into the wrong direction, it would be great to have some comments on the following, before I submit a possible patch:

  • Int eht patch there is a new option to specify an upload directory. Obviously this should be the place, where uploaded files will be stored. But this is never used. So it is quite obvious, that this should be used in the webserver::post_iterator to create the actual filepath
  • Additionally there should be a check, whether the file would really end up in exactly this location. Otherwise it might be possible for an attacker to overwrite existing files if he uses for example "../" in the filename.
    This check could be done via the realpath() function, which resolves the pathname to its real location. The real location could then be checked, whether it would still be in the specified upload directory. If not, the upload should be rejected.
  • I would not store the file with its original (provided) filename. This would use spaces and other special signs (if part of the filename), which might result in file system problems. I would suggest, that for each uploaded file a unique name with a random part (for example via mkstemp()) would be generated. This would prevent issues, if the same file would be uploaded twice in different threads. Also this would be safer for users, who directly put such filenames into command line calls (system()). This could in the worst case lead to command injections, SQL injections, etc. Yes, the security of the actual application is not up to the libhttpserver, but it would be (in my opinion) a nice to have.

Regarding the get_content vs post_processor discussion:
As far as I understand the libmicrohttpd, the post processor can only be used for content of type multipart/form-data and application/x-www-form-urlencoded. For other content the libmicrohttpd won't create a post_processor. So to get the body of other content (i.e. application/json) you would still need the get_content method. Removing the get_content completely might not be possible. Or at least another solution would be needed to retrieve the body of such calls.

So I would propose the following

  • Implement a new option omit_processed_data_in_content. If active, this would do what is already in the proposal: If the content allows it, process it via post_processor. If not, put it into the content. If inactive the libhttpserver would act as it did until now: Stuff everything into the content and post_process the data additionally (according to the content type). The default setting would be disabled, so leaving the behaviour unchanged for current users of the lib.
  • Make use of the proposed parameter post_upload_dir. If the upload directory is not set, behave as now: put the content of uploaded files into the map. If the upload directory is given, create a unique filename in the given path and put the content directly into this file (with the additions mentioned above) and only put the filename into the map.

Only thing, which I am not sure about:
If the filename would be a unique (random) name in the upload directory, there would be the need for two entries in the map (per file)

  • the unique filename in the file system
  • the original name of the uploaded file (i.e. to report back to a end user in a webinterface, which file was uploaded)

So the original key should be the unique name, but there would have to be a second one.
Whatever the name of this key might be, having some hardcoded key to be injected into the map could result in issues, if the same key would be used by an application.
So maybe this information (about uploaded files) should be put into a new map, which may hold the information about unique filename and original filename (and maybe additional information like file size).
Something like:

typedef string uniqueFilename;
struct fileInfos {
    size_t size;
    string originalName;
}
map<uniqueFilename; fileInfos> theMap.....

@ctomahogh I know what I proposed in #246 is somewhat incomplete, and I did not invest time to come up with a solution going in the direction outlined by @etr. Sorry about that, but I don't think I have time for that in the near future. I think I could have a look at other ideas and maybe test those however.

Let me add one more thing: you are right, the filename is potentially malicious user input. What I did in my app though was sanitizing the filename instead of generating a random one, and I followed the suggestions from https://wiki.sei.cmu.edu/confluence/display/c/STR02-C.+Sanitize+data+passed+to+complex+subsystems

The function I came up with for that, looks like this:

#include <cerrno>
#include <cstring>

static char *sanitize_filename(char *s) {
    /*
     *  reject if:
     *
     *  - pointer is invalid
     *  - name is implicit directory "." or ".."
     *  - length is zero or longer than 255 byte
     *
     *  TODO    There is this list of reserved words on Microsoft Windows โ€ฆ
     */
    if ( (s == nullptr)
            || (s[0] == '.' && ( s[1] == '\0' || (s[1] == '.' && s[2] == '\0') ))
            || (s[0] == '\0') || (strlen( s ) > 255) )
    {
        errno = EINVAL;
        return nullptr;
    }

    //  @see    https://wiki.sei.cmu.edu/confluence/display/c/STR02-C.+Sanitize+data+passed+to+complex+subsystems

    const char * const a = "abcdefghijklmnopqrstuvwxyz"
                           "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                           "1234567890_-.";
    const char *end = s + strlen( s );
    char *cp = s;

    for (cp += strspn(cp, a); cp != end; cp += strspn(cp, a))
    {
        *cp = '_';
    }

    return s;
}

This might produced collisions on those sanitized filenames however, hope that helps anyways. ๐Ÿ™ƒ

etr commented

Thanks a lot for all the thinking you have put into this. I understand you have a urgency on this as well, so I'd be happy to help you iterate fast on reviews.

I don't expect a solution to be approved and integrated in the official libhttpserver release within this time. But I want to avoid using an own patch (which I have to make nevertheless for the first moment) forever. So I would like to develop a patch, which is likely to be integrated into the official libhttserver eventually.

I would welcome changes from you if you are working already on a solution.

Int eht patch there is a new option to specify an upload directory. Obviously this should be the place, where uploaded files will be stored. But this is never used. So it is quite obvious, that this should be used in the webserver::post_iterator to create the actual filepath

You are correct. I think that storing to disk shouldn't be the default behavior though. I am happy for the library to provide that as an option (as you suggest later in your post). For example, by having the upload path specified on server creation.

Additionally there should be a check, whether the file would really end up in exactly this location. Otherwise it might be possible for an attacker to overwrite existing files if he uses for example "../" in the filename.
This check could be done via the realpath() function, which resolves the pathname to its real location. The real location could then be checked, whether it would still be in the specified upload directory. If not, the upload should be rejected.

I think that a better solution is what you suggest later in your post. Generating a random filename that the library controls sounds like a better solution to me and should make this check redundant. I think this behavior should be optional (i.e. by adding a "generate_random_filename_on_upload" option) - if a library user wants to maintain the original filename they should be able to do so.

I would not store the file with its original (provided) filename. This would use spaces and other special signs (if part of the filename), which might result in file system problems. I would suggest, that for each uploaded file a unique name with a random part (for example via mkstemp()) would be generated. This would prevent issues, if the same file would be uploaded twice in different threads. Also this would be safer for users, who directly put such filenames into command line calls (system()). This could in the worst case lead to command injections, SQL injections, etc. Yes, the security of the actual application is not up to the libhttpserver, but it would be (in my opinion) a nice to have.

I think this is a good approach in general - see my answer to the previous point.

Implement a new option omit_processed_data_in_content. If active, this would do what is already in the proposal: If the content allows it, process it via post_processor. If not, put it into the content. If inactive the libhttpserver would act as it did until now: Stuff everything into the content and post_process the data additionally (according to the content type). The default setting would be disabled, so leaving the behaviour unchanged for current users of the lib.

I am good with this.

Make use of the proposed parameter post_upload_dir. If the upload directory is not set, behave as now: put the content of uploaded files into the map. If the upload directory is given, create a unique filename in the given path and put the content directly into this file (with the additions mentioned above) and only put the filename into the map.

Ideally I'd separate the option to specify the location from the option that activates the behavior. The option to activate the behavior should have 3 states: memory_only (current behavior), disk_only (stores on disk only), disk_and_memory (fills both the content in memory and on disk). if disk_only or disk_and_memory are provided, the location should be required.

So the original key should be the unique name, but there would have to be a second one.
Whatever the name of this key might be, having some hardcoded key to be injected into the map could result in issues, if the same key would be used by an application.
So maybe this information (about uploaded files) should be put into a new map, which may hold the information about unique filename and original filename (and maybe additional information like file size).

Might need to see how it comes up in code, but the overall approach of the map seems sound to me. To simplify things, whatever value of the option "generate_random_filename_on_upload" we should just maintain the same map structure (if it is false, key and value would just simply be the same).

etr commented

Closing this issue given it should be addressed by the latest change on head. You should expect a released version soon containing this change.