pandoc/lua-filters

Change this repository into a collection of links?

jgm opened this issue · 60 comments

jgm commented

One drawback of the current structure is that people submit code here but then don't monitor the repository, and issues are neglected. Perhaps it would be better to make this simply a collection of links to lua filters that are maintained in independent repositories?

cagix commented

I kind of like this idea. Maybe this repo could serve as a kind of collection of "official" scripts from the Pandoc creators and all other filters could be linked in the README (sorted by topic)? That would reduce the maintenance to checking the links every year. In addition, the "official" code provided could serve as a live demo / live documentation of the Lua API.

I'm very much in favor of that; it would save me a lot of time. It takes a significant amount of effort, on almost each new pandoc release, to adjust the tests and filters to the changes. It's tiresome, and pinging all authors and waiting for them to change the code would take just as long. I'd be glad to get out of that obligation.

We could still do occasional automatic "releases", which pack the filters into a single archive. This shouldn't be too hard if the individual repos use a common structure.

cagix commented

We could still do occasional automatic "releases", which pack the filters into a single archive.

That sounds interesting, but might not be quite easy with regard to the then presumably different licences in different repos?

This shouldn't be too hard if the individual repos use a common structure.

Would maintaining a template repo help with this?

Pinging everyone who contributed a filter so far: what do you think of this idea? What would you need to make this as painless as possible for you?

@jdutant @tolot27 @blake-riley @not-my-profile @svenevs @b3 @jkr @cole-miller @sokotim @korintje @gtuckerkellogg @stroobandt @frederik-elwert @odkr

cagix commented

Since each filter belongs to a subfolder, it should be easy to split your repository into several individual repositories using git filter-branch and retain the individual history :)

I am not sure about this. There are advantages to having a common repository. Having them all here guarantees that other pepole can contribute improvements even when the original author has ceased maintenance.

It takes a significant amount of effort, on almost each new pandoc release, to adjust the tests and filters to the changes. It's tiresome, and pinging all authors and waiting for them to change the code would take just as long. I'd be glad to get out of that obligation.

I think an easy fix for that would be to have a latest_pandoc_supported variable for each filter. When a new pandoc version is released that variable could be automatically bumped for each filter which tests pass with the new version. And the script could automatically update a table in the README of this repository that lists all scripts known to work with the latest version. Even the pinging of authors when their filter is no longer compatible with the latest version of pandoc could be easily automated.

Especially if you still occasionally want to release filter bundles having all filters in a single repository should make that easier. Otherwise you just have new potential problems to deal with (e.g. some repository went offline, some repository suddenly has an unexpected file structure, etc.).

bpj commented

I agree with this idea. It is because the roles of the filters are highly independent, so we cannot expect large synergy effects by collecting them at one repository. Instead, I think it is better to focus on keeping accessibility, readability, and consistency of each filter documents. Listing the filters at Pandoc official web page or GitHub Pages would be nice. As far as I know, one ideal example is crates.io, which is a kind of library repository for Rust language, is known to have well-formatted and easily-accessible documents.

jgm commented

There are advantages to having a common repository. Having them all here guarantees that other pepole can contribute improvements even when the original author has ceased maintenance.

In that case you could always fork the original repository, make your changes, and submit a PR here for an updated link to the fork.

I am in favor, but what about having this repository contain
submodules/subtrees/subrepos linking to contributors' repositories so that
people can still pull this repository and get all filters?

I'm not sure it's all that valuable to be able to get all the filters in one repository. Generally you only need some of them; why not just clone those separately?

I like the idea of submodules and switching to them should be easy because we have subdirectories, already. submodules can be checked out from their origin and individually.
Having this repository as the main repository has the advantage that checks (i. e. in case of a new pandoc version) can be maintained at a central place.

I don't have any preference either way, if it makes things easier for maintainers then I'm all for it ❤️ I'm pretty sure my filter is feature complete, but I'm sorry if I've missed any issues related to it.

I'm really not convinced this would be an adventitious move. Having some complex filters that see a lot of development in their own repos is a good thing perhaps (and we have a history of suggesting that) but for small one-off ones that tend to be submitted, used by them a few times, and then the submitter moves on I think a large chunk of them would fall below some minimum threshold that would make them viable FOSS projects on their own. Having a team of maintainers at least reviewing submissions here adds some amount of normalizing and consistency that makes filters in this repo much more attractive than random ones out of people's Gists, and for maintenance not having the bottle neck of one maintainer that got it working form themselves and then is never motivated to tweak it to be more generally useful seems seems like a benefit to most simpler filters.

jgm commented

@alerque I think the filters could still be reviewed -- at any rate, we wouldn't need to include links to filters that didn't look good. The aim would be to change where bug reports and enhancement or support requests go. They should go to the author of the filter, not to the pandoc maintainers.

As a first-time, one-off contributor, I have to admit that the current process together with the suggestions and help provided by @tarleb rendered my contribution more worthwhile and universally applicable.
That would not have happened without the "editorial work" of @tarleb.
The current process could be considered as a very valuable peer-review, where the value eventually goes to the end user.

Another admission of mine is that my extended family and I usually employ my filter only with the version of pandoc that comes packed with the latest Ubuntu LTS release and upgrades. The reason for this being the fact that too many of my users on too many machines require a stable system environment for work/study.

This certainly does not mean that I would not maintain my filter. However, if the user community at large fails to prod me, I would typically notice a version compatibility problem with my filter only when a new version of pandoc eventually lands in the Ubuntu LTS repositories.

I hope this straightforwardness helps with reaching a consensus about how to proceed with this great, curated collection of filters.

b3 commented

I do not have any smart definitive answer to this good question. I added thumb up to comments given ideas that I like.

It is a fact that I didn't follow issues here for my small filters (thanks to @jgm I will now try to check them).

It is also a fact that, as @stroobandt states, @tarleb work rendered my small contributions more worthwhile and universally applicable.

IMHO I think that keeping a common framework (at least for tests and description for instance) need however to be kept.

Being able to fetch all code at once is also a nice facility (which helps me being inspired) but still can be offered if this repo is changed to a simple list of links.

Sorry not being able to help more concretely.

The aim would be to change where bug reports and enhancement or support requests go. They should go to the author of the filter, not to the pandoc maintainers.

To some extent, we can get the best of both worlds. If the code stays here and we add contributors to a GitHub team with limited access to this repo, we can use .gitattributes to specify GitHub users as code owners for the filters they contribute. That way they would not only get asked to be involved in code review if somebody touched their code, but they could be assigned to related issues and such.

My experience is that people are even more likely to stay involved and take some ownership over their code if it has the publicity of being in an official repository rather than being in their own ad-hoc repos. Anybody that is going to keep on top of issue reports on their own repo is also likely to stay involved with it if they have some ownership in a bigger project.

I really like this repository and it is a great source of inspiration when writing filters. It is very useful to have all these filters in one place. Of course, I agree the plugins creators should maintain their plugins( if they have the time to). Some plugins could be placed in a unmaintained folder or repo if they can't. Also a table in the README would be useful indicating filters name, description and formats processed.

jgm commented

we can use .gitattributes to specify GitHub users as code owners for the filters they contribute.

Can you elaborate? What would the syntax be? That would certainly be an improvement, as now there's no way to figure out who contributed which filter other than looking at git history.

Here is an example CODEOWNERS file that uses .gitattributes syntax to assign code is a repository to different people. The @... names can be individual accounts or teams (or mix and match) that can have multiple members. This will automatically request they review any PRs touching those code paths as well as open the door to other GitHub tooling like allowing them to approve PRs if code-owners approve them.

What it doesn't do is triage bug reports and assign them to those owners. That would still need to be done manually, only PRs are automatically assigned.

jgm commented

I've added .github/CODEOWNERS, but I don't know the github handles of the contributors. Maybe people can update this themselves with PRs?

ickc commented

As a side note, I think this is really about having a packing index and a package manager.

people like this and pandocfilters (the Python one) because they act like both. It is a centralized location that once cloned, someone else is maintaining that for you which should guarantee it is working with the latest-ish pandoc.

The problem of this repo is that it isn't going to scale well (into many filters) and the work of maintenance is transferred to the maintainer.

Years ago some of us proposed to have a package manager, and there was a prototype. But there was a few problems. First we mixed the 2 related concept in one solution, and second it isn't official.

In short I think the right direction would be to have an official package index (like CTAN). This is similar to the "link" concept above, but more formal. May be a YAML file with a certain spec. The official pandoc community advertise this as the pandoc packaging index that people should submit to as authors and discover as users.

Then we can let 3rd parties to build a ecosystem around it. Eg a package manager (similar to 3rd party filter framework), or a website (like the 3rd party Mac AppStore-like website for homebrew).

Allow me to think out loud for a moment; this gets a bit fundamental and includes some of the good points others already made above.

What I like about this repo:

  • It has become a beautiful collection of useful filters. Users can browse it, use the filters, and hopefully learn from the code.
  • Filters share a similar structure and allow customization in a similar manner.
  • I get to work with contributors, which is a chance for me to bring in the experience from writing many filteres and large parts of the Lua subsystem.

What I dislike:

  • Filter authors don't really get the recognition they deserve. I'm not at all a fan of "GitHub is my resumé", but that's often the way it is. A filter in a personal repository is easier to show off as achievement than a contribution in a repo like this one. It seems fair to encourage people to highlight their work.
  • We are currently excluding people who prefer other platforms like GitLab.
  • My code standards are opinionated and often high, quite possibly too high. This effectively prevents useful filters from becoming available as they are stuck in code review for far too long. (But I appreciate the kind words noting the positive side of this.)
  • The previous point is made worse by my time constraints. It felt in the past like I was often the bottleneck for changes and new filters. Not a good situation, neither for contributors, nor for me.
  • Tests are often difficult to write, and most tests depend on the specific output created by pandoc. That makes them, and the repo, high maintenance.

In conclusion, I'd still rather turn this repository into a collection of links. My proposal:

  1. Create a template repository for Lua filters. This way we can still encourage a certain standard layout, but filter authors have the freedom to do whatever they feel is right.
  2. Add an issue template for adding new links: this should include a checkbox to select if the author wishes for a detailed code review of their filter. We could go as far as to encourage community review by sending an automated mail to pandoc-discuss whenever such an issue is opened.
  3. Slowly move filters to separate repos, but explore ways to create collections of all filters listed here and adhering to certain conventions.

Edit: Forgot to make my main point: it seems unreasonable to expect people to maintain code that they no longer control; the sense of ownership is much stronger if authors can retain full control over their code.

jgm commented

I think this is a good plan!

I started work on a template repository. It's not quite done yet, but feedback is welcome, especially if it takes the form of a PR ;)

The template contains code to create a documentation page, but I'm not happy with having the HTML and CSS in the main branch. If anyone has some ideas on how this could be avoided, then please let me know.

bpj commented

I'd think that'd be perfectly fine, esp. if the transpiled Lua script can be downloaded somewhere.

Off topic: my long-term goal is to write filters with teal.

cagix commented

The template contains code to create a documentation page, but I'm not happy with having the HTML and CSS in the main branch. If anyone has some ideas on how this could be avoided, then please let me know.

Hmmm, if the goal is to provide nicely rendered documentation, you could write the documentation in Markdown and use a simple workflow that uses Jekyll or Hugo, deploying the result as Github pages.

Alternatively, you could use a workflow where Jupyter notebooks are generated from the Markdown with Pandoc and made available as Github pages (.ipynb files will be rendered as preview directly by Github).

bpj commented

Sorry I haven't had time to jump in and help with this yet. A template repo is a great idea. Including some CI to test pandoc interactions would be good to include there too.

Before this gets too far though I just wanted to throw in the idea that if filters are going to be independent, it might actually be useful to package them as Lua Rocks. The luarocks infrastructure can actually be used for this (plugins for some Lua app as opposed to stand alone packages) and has a concept of manifests to organize them. This would bring in free tooling for versioning, distribution/packaging, dependency management (including both on other Pandoc filters or other LuaRocks), etc.

odkr commented

Off topic: my long-term goal is to write filters with teal.

Okay, this is off-topic; but Teal seems intriguing. Why a "long-term" goal? Is there a reason not to use it just yet?

I've opened an issue for teal support on the hslua repo, let's move our OT discussion there. ;)

Leveraging luarocks has crossed my mind too; it seems orthogonal to using a template repo. In fact, if you want to add a sample rock definition there, the PR would be welcome.

The issue tracker of the new repo is probably a good place for additional suggestions.

Leveraging luarocks has crossed my mind too; it seems orthogonal to using a template repo.

Yes it is.

By the way I've talked about this with LuaRocks folks in the past (in various times in reference to SILE packages and vim plugins and) and they are universally supportive of the idea and willing to make any upstream adaptations that are necessary—but to date it doesn't seem that any really are, the use pattern is already supported.

Thanks @tarleb for pointing me to this conversation on my second contribution to this repository.
As a new contributor, I can say that :

  1. I got inspiration from the collection of filters I found here. Would have been less easy with separate repositories.
  2. The review on the first filter I submitted (column-div which is currently in draft mode due to a little bug I still have to chase) helped me to get it to a code quality and a functional level I wasn't aiming for (even if I wasn't happy with the review at first ;-) )
  3. I am not very comfortable with the CI test I tried to reproduce from what I found in this repository. I missed an official template. I see this template exists now. Good thing even if you finally decide to keep everything in one common repository.
  4. Regarding this common repository thing. I am a contributor to another similar project. It's lua-scripts, a repository of Lua plugins for Dark Table the photo processing tool. I contributed on localization (French) of multiples scripts. I event worked on some scripts I don't really use myself only because they belong to the same repository.
  5. In this repository, we have two main categories : official (scripts which are officially supported and reviewed by the team) and contrib (scripts which are supported by external contributors)

That said, I am totally OK to take my filters back to my own repository if you decide so. I would add that I am totally with @alerque on the need to package the filters so they would not be bunches of code floating around in github that users must catch one by one.

@tarleb I would like to contribute a plugin. After reading this discussion I am confused about the way forward what should I do?

That's great to hear @nandac. The best way is probably to create a new personal repo based on the template. See there for more instructions. Please let us know in case you run into issues with the template -- it is still experimental.

Once you have set up the filter repo, please open an issue and ask for it to be included. We still need a place to add links, so it may take us a little longer.

Thanks, @tarleb I have set up a repo in my personal space using the template.

bpj commented

@bpj Template repositories are meant to be used as a base to clone from (and GH and a function for doing this), and you want things named what the final name is going to be, not something that will need to be shuffled around.

Converting existing repos is a bit trickier. Merging as you describe is technically possible with some next level Git ninja commands to join histories with no common root, but it also brings with it a pile of issues that most people would struggle to deal with later (e.g. git blame needing special handling).

I suggest just using a tree diff on existing repositories and manually massage them to be as alike or different as you feel like without doing any merge foo. A how-to on this could be useful to add to the template, but I would focus usage on getting new projects going.


On a different topic, I'll be looking into some subtree splits to help people with filters here already get them split out with full history for use in stand alone repositories. Once the dust settles a little bit on what we are recommending for stand alone repos we can look at migrating current ones to that model.

Joining this discussion quite late: I think there is one huge argument in favor of a central repository for the most common LUA filter for Pandoc, and that is security: Since Pandoc is typically running with full user privileges, a LUA script can do really nasty stuff (steal information, load malicious content, ...). While this central repository does not make it impossible to inject malicious code into the most popular filters, it at least provides

  • some kind of community review and
  • a bit of centralization for a better flow of information (vulnerabilities can be reported in here, people are likely to take notice, more eyes increase the likelihood of spotting a malicious filter, ...).

A mere collection of links will cause more fragmentation and hence make it less likely and slower to spot and mitigate security risks.

Also, there are lots of commonalities among filters; with a central repository, it will be easier to modularize and reuse code.

I have a really bad feeling watching a growing community of non-developers running arbitrary code from some private Github repositories found by googeling for some Pandoc/LaTeX problem.

Search for "supply chain attacks" to get a glimpse. This is even more of an issue given that LUA is a bit of a niche language, further complicating it for many to understand what a piece of LUA code is doing on their machine.

jgm commented

From the pandoc manual:

.

A note on security

If you use pandoc to convert user-contributed content in a web
application, here are some things to keep in mind:

  1. Although pandoc itself will not create or modify any files other
    than those you explicitly ask it create (with the exception
    of temporary files used in producing PDFs), a filter or custom
    writer could in principle do anything on your file system. Please
    audit filters and custom writers very carefully before using them.

You are right, of course, that people can get into big trouble by running filters they download. And having a central repository would help with that. The problem is that it takes a lot of human-power to review the code, integrate pull requests, etc. We just don't have enough of that.

ickc commented

What is described is related to web of trust and basically what you said is as you trust pandoc you also trust other stuffs maintained by the same or related developers.

another related concept here is package manager. It does not solve the trust issue by itself, but basically now you’re trusting the maintainer of a package index rather than the developer. (Of course trusting both, ie you select package from author you trust only, is better.) Also just to mention that typically package index can be dangerous because there’s no “maintainer” you need to be approved from, in the example of PyPI.

Put it this way then the problem above is saying that “monolithic package index” like this puts too much burden to the maintainers, which is doing both job. A “proper package index” splits the burden into individual maintainers managing their own package, and a package index maintainer(s) who maintains the quality.

ickc commented

Just to elaborate a bit more, there’s also more incentive for the developer to maintain their script as they typically are the biggest user of that script. The problem then is to have a package index that people will have incentive to use, including official blessing and simplicity (and adequate level of trust.)

But to name the cons of having a centralized package index like this, it makes releasing a breaking release AST slightly easier. (But then the blame should be put to end users that upgrade without considering “pinning” their version. Again, a problem arises when not thinking this in terms of packaging.)

by the way, I’m not complaining as there’s no perfect solution. look at LaTeX for example, while they have a package index, packaging is a mess as version cannot easily be controlled so package can breaks mysteriously and then authors are conditioned to release backward compatible changes only, which leads to worse experience (bad behavior should be discontinued.)

I'm largely agnostic; I contributed a filter which my PhD students have used, but I'll own up to neglecting it recently if issues have come up. I'll maintain my agnosticism, but I'll be happy to rectify my neglect of the issue no matter how it's decided.

@alerque

On a different topic, I'll be looking into some subtree splits to help people with filters here already get them split out with full history for use in stand alone repositories. Once the dust settles a little bit on what we are recommending for stand alone repos we can look at migrating current ones to that model.

I am not a git expert but I can share what I figured out to move to their own repository two Lua filter I had proposed as my contribution to lua-filters .

  1. Create the repository on Github from @tarleb's lua-filter-template. NB. The master branch name is main (came from the template).
  2. Make sure everything is clean in the lua-filters forked repository I am working on and make a fresh clone (named after my new Github repo):
git clone lua-filters hk-pandoc-filters
  1. Install git-filter-repo since git-filter-branch man page advise to switch to the former.
  2. Remove the remote link, remove everything but my two filters, add the just created Github repository as a remote.
cd hk-pandoc-filters
git remote remove origin
git filter-repo --path column-div/ --path tables-vrules/
git remote add -f origin git@github.com:chrisaga/hk-pandoc-filters.git
  1. At this point I have my two Lua filters in the master branch and the new files from @tarleb's template in the main branch. The two branches are not related (no common ancestor) but I still can merge them with the appropriate option.
git checkout main
git merge --allow-unrelated-histories master
  1. Do some cleaning with git mv and git rm and push everything to Github
git commit -a
git push
  1. Check everything is OK on Github and remove the now useless branch
git branch --delete master

I have created a new organization pandoc-ext and have started to migrate filters there. Each filter will be placed in a separate repository, as this makes it easier to use the filters with RStudio's quarto. I will only transfer those filters that I intend to maintain.

The main impediment right now is my template repository, which still needs more work.

cagix commented

Step 4 could also be done more easily through git subtree: In hk-pandoc-filters you can perform a git subtree push --prefix=<yourfolder> <cloneofluafiltertemplate> <branch>, where <branch> should be different from main or master.

ickc commented

@tarleb, if you want to, I can invite you as maintainer of https://github.com/pandoc-extras which is intended for any "pandoc extras" kind of stuffs.

tarleb commented

Thank you @ickc, and sorry for the late reply, I had forgotten about this. I went with the pandoc-ext name to mirror the quarto-ext org. Now we have two such orgs, but I think that's ok.

I've sent you an invite to become a maintainer at the new org.

cagix commented

so, now we have both, https://github.com/pandoc-extras and https://github.com/pandoc-ext? does that mean some kind of split in the "pandoc extras"? and also there is https://github.com/pandoc/lua-filters ...

tarleb commented

I don't see it as a split, it's two separate orgs with slightly different goals.

As for this repo, it should probably be archived at some point.

bpj commented
ickc commented

Thank you @ickc, and sorry for the late reply, I had forgotten about this. I went with the pandoc-ext name to mirror the quarto-ext org. Now we have two such orgs, but I think that's ok.

I've sent you an invite to become a maintainer at the new org.

The invitation just expired. Could you send one again? Thanks.

and issues are neglected.

Is that worse or better than neglecting things in their own repos? At least here the issue is seen by many more users and perhaps fixed.

Just because something has problems does not mean that the problem is resolved by changing the physical location of the code. The problem just moved. 99% of the code here will be written by PhD students and as soon as they are out of academia they will forget all about their pandoc hacks. I believe that moving things out will make it less probable for code to be 'inherited' by new students, and rather they will reinvent the same solutions, ad infinitum.

As a legacy user of pandoc (I have written these kind of filters myself, when I was a PhD student) I find the pandoc vs pandoc-ext confusing as I find the reasons for this change confusing. With 30 open issues, the problem rationale appears to me more theoretical than practical. Consider a repositories like Ansible which has a myriad of custom domain plugins. If they would go down this route, I could sympathize. They have >30K issues, and often have thousands of open issues, and hundreds of open PRs. This repo has 30 open issues. I am a skeptic but I can be convinced by reason: what problem is being solved and what problem is invented?

Before I invest time in an answer, I'd like to learn more about your motivation for this question. Do you plan to contribute, maybe by writing explanatory paragraphs to place in some of the readme files? Or is this just curiosity and/or frustration?

I think the users of pandoc deserves the best possible ecosystem. My time is as important as yours and I already invested because I see that this important. I stated that I already wanted to contribute the d2 filter. I have used pandoc for some time. At least 10 years, but likely 15 years. The little code that I made available is largely unused by the community. Wasted efforts and I also have no interest in maintaining small lua hacks in my private github. Put together these "hacks" becomes valuable.

You can expand on the rationale and we can strawman on this as a case study:

https://github.com/JensErat/pandoc-scrlttr2

It seems to be a very nicely setup repository. Last changed 8 years ago. (I think we can keep up with the maintenance work?) Not everything would need to go into this repo either. Let the users decide what they want to put here. Perhaps there is no need to have pandoc-scrlttr2 (weird name, only makes sense in academia) in this repo because it is well maintained and discoverable in other places. I want to make it easy for people to hack up some code and share it with people. Collaborate and build together as not everyone want to put up a complete OSS project and put on their github.

jgm commented

That repository is a single project by a single person who maintains it. That's exactly what we want to move to for Lua filters.

We want to get away from the model where you contribute your script to a giant repository, then expect others to field bug reports for it, deal with questions about it, and so on.

We want to get away from the model where you contribute your script to a giant repository, then expect others to field bug reports for it, deal with questions about it, and so on.

I see nothing wrong with that expectation. This is exactly what happens with all successful open source. People learn about the code and knowledge is shared, and software is improved.

jgm commented

You're saying that your expectation is that others will maintain your code, answers questions about it, and so on? Where others = @tarleb and me? No, thanks. I'm already overburdened with my open source maintaining. I'd rather have you maintain your code in your own repository. Happy to link to it.

bpj commented

I took the liberty to add a page on the repository wiki where authors can add links to their filters themselves with zero hassle for the owners of this repository.

Thoughts:

  • Perhaps that page can be (prominently) linked from and/or periodically copied to the repository README.
  • Maybe it should better go into the pandoc-ext wiki as/if that is activated. On the other hand this repository will probably come up if people search for "Pandoc (Lua) filters".
  • I will try to find time/remember to do some housekeeping on that page, but the more people help with that and with adding entries the better of course!l
odkr commented

There's already https://github.com/jgm/pandoc/wiki/Pandoc-Filters. But perhaps it'd be a good idea to place a prominent link to that Wiki in the README.