singerla/pptx-automizer

Output files are bigger than input

Closed this issue ยท 10 comments

Hello! Really appreciate your work on this. One small thing: I'm finding that the output files are often quite large, larger than the initial file even when only retaining one slide. For example using inputdemo.pptx as an example input to the following code I'm getting a 1.5Mb output file which is ~50% bigger than the input file despite having removed 3 slides.

import Automizer from './index';

const automizer = new Automizer({
  templateDir: `./`,
  outputDir: `./`,
  removeExistingSlides: true,
});

const run = async () => {
  const pres = automizer
    .loadRoot(`inputdemo.pptx`)
    .load(`inputdemo.pptx`, 'ContentSlides')

  const result = await pres
    .addSlide('ContentSlides', 2, (slide) => {
      // Could modify here
    })
    .write(`outputdemo.pptx`);
};

run().catch((error) => {
  console.error(error);
});

Manually unzipping the output file it looks to have retained all three of the images from the four slides of the template file, then re-added the image which was used in the slide that was added back as content to the newly cleaned template.

Does this make sense? Is there anything I can do to work around this? Thanks again!

Hi! I'm glad to hear you like pptx-automizer, but, I regret, you have discovered the darkest secret of it: There is no post production cleanup at all. I have noticed that PowerPoint does not complain about additionally archived content, as long as there is no relation to a non existing file. File size did not matter to me, so I kept it simple.

You can open the file in PowerPoint and save it without any modification. PowerPoint will remove all unused content and shrink filesize to the expected value.

There is a second weakness related to this issue: It might happen to face a repair-message on opening a file due to an unhandled relation. The most elegant solution would be to catch both, unneeded contents as well as unhandled relations. You might want to take a look at normalizePresentation(). This is where it started some time ago.

As you have described, pptx-automizer will only modify copied content and leaves the original parts untouched. This causes the filesize to increase. You can reduce this unwanted effect by using an empty root presentation and adding all required content to it. In your case described above, it seems you have modified a single presentation.

To create a root presentation from your existing template inputdemo.pptx, you need to delete every slide and save it as a new file inputdemoRoot.pptx. Your new root presentation should contain no content slides, but all master slides an all images used by master slides.

Your loader goes like this:

  const pres = automizer
    .loadRoot('inputdemoRoot.pptx')
    .load('inputdemo.pptx', 'ContentSlides')

This should significantly reduce your output file size, because every added content will be copied into root presentation without duplicates.

Thank you!

To create a root presentation from your existing template inputdemo.pptx, you need to delete every slide and save it as a new file inputdemoRoot.pptx. Your new root presentation should contain no content slides, but all master slides an all images used by master slides.

Yes this makes sense, unfortunately the input files won't be mine so I can't do the pre-trimming.

I guess the crudest way forward might be to iterate through each media file and look for references to it in the XML and if it's not found, unlink/delete it - and put this either in truncate() or normalizePresentation()? Given your extensive knowledge on this topic are there any watchouts to an approach like that?

If no additional files are inserted, but only existing ones rearranged, the crude way is exactly as you have figured out. More detailed:

  • iterate through p:sldId items from /ppt/presentation.xml to get the finally visible slides.
  • The "Rel"-attributes will lead you to the target slide xml via ppt/_rels/presentation.xml.rels
  • Find out all used rId-ids from /ppt/slides/slide#.xml
  • iterate through all Relationship items from /ppt/slides/_rels/slide#.xml.rels and find the unused relations. The Target attribute will lead to the unneeded media file.

This can be done inside normalizePresentation().

I would prefer to do some refactoring and create an index object during building process. All information about used and unused relations is already available, we could avoid redundant iterations. I need to dig deeper into this next week.

I could btw discover another very obvious reason for the larger files: JSZip defaults to a lower compression level than ppt.

@adamstamper You can checkout feature-content-tracker branch and try again with compression param:

const automizer = new Automizer({
  templateDir: `./`,
  outputDir: `./`,
  removeExistingSlides: true,
  compression: 9  // choose 9 for highest compression level, 0 to skip deflation
});

This issue also overlaps my plans to increase performance in general, so I need to keep on digging. Stay tuned! ๐Ÿ˜„

Thanks! I've played with the branch and I can see more compression helps with embedded content like chart data or text-based, however with JPEG images (eg in the inputdemo.pptx file I shared previously) the difference between compression level 1 and 9 was only reducing by 600 bytes. I guess JPEGs are already fairly optimised so they're not very compressible?

That's right! You will see more compression only if identical media files are copied twice or more. Your problem still exists: Unneeded media files will remain in archive.

My latest commit will remove xml- and xlsx-files that became unnecessary due to removeExistingSlides, but that's cosmetics. So far, it will not remove unneeded images from media folder. We are getting closer, but still some work to do.

Next step will be to check all slides/_rels:

  • is existing content eventually used on any slide?
  • is required content really available in archive?

You can now checkout my latest commit on feature-content-tracker branch. This should do the job.

Please note: ModifyPresentationHelper.removedUnusedImages will look for all images placed on at least a single slide and delete all the others. This will also work for images on slideMasters, but I did not test this excessively.

It will not work for chart point markers filled by an image at the moment.

@adamstamper could you play again, please? Thanks in advance for your participation! ๐Ÿ˜„

You can now use cleanup: true param to remove unused images and xml files.

const automizer = new Automizer({
  templateDir: `./`,
  outputDir: `./`,
  removeExistingSlides: true,
  cleanup: true, // activate archive cleanup
  compression: 9  // choose 9 for highest compression level, 0 to skip deflation
});

Fantastic! This is working really well @singerla - thank you for your work on this!