Adding chatbot capabilities for Python Docs website

balaji1233 · April 20, 2025, 2:45pm

I think adding LLM Chatbot for Python docs would be really helpful for navigating the docs faster .

AA-Turner · April 20, 2025, 2:51pm

Who would be responsible for this, pay for it, or maintain the infrastructure?

You’re welcome to use a language model and have it query the documentation (or the static HTML files: Download — Python 3.13.3 documentation), but I doubt we will do this ourselves in the near future.

A

Wombat · April 20, 2025, 8:48pm

A number of chatbots have a provision for transfer learning by adding “local docs”.

It would be really nice if there was an official corpus download for this purpose. It could consist of the main docs, the source code, all peps, github history including discussions, and the discussion histories on discuss.python.org and python-dev (but not python-ideas).

That would be a huge win. A bot could then answer questions like “what problem was this feature designed to solve”, “why wasn’t this approach approved”, “what do the docs mean when they say …”, and “generate more detailed documentation (including corner cases) from the implementation”.

It would be fantastic if a user could ask, “why does python need lambda when it already has def” and get an answer based on Guido van Rossum’s published commentary on why lambda was kept in Python 3.

It would also be cool to leverage code translation abilities, “Translate the C code for collections.defaultdict into a pure python equivalent that passes the tests and matches the documentation.”

nedbat · April 20, 2025, 9:17pm

This does sound like it could be useful, but I don’t see why it needs to be “official”. All of that information is publicly available. Anyone can take on the task of collecting and curating it.

If there’s a reason in the future for the PSF or the the core team to take ownership of it, then we can discuss doing that. In the meantime, any energetic volunteer can do this and show the benefits.

Wombat · April 20, 2025, 9:25pm

Perhaps “official” was the wrong term. Substitute “maintained”, “curated”, “standard”, “easily available”, “automated” and/or etc.

While these sources are publicly available, they aren’t that easy to get to. At one point, I wrote a bot to download the history of the python-dev list. It was a pain and took a long time to run. I presume accessing the full Github history of issues and discussions would be even worse.

guido · April 20, 2025, 9:53pm

It sounds like you would enjoy using such a service, but you want somebody else (the core dev team? the PSF?) to do the work.

I think the pushback you are getting is because we don’t have a budget for such a project, and it’s not a priority (just nice to have). Also it looks like a slog, honestly.

I
Think if someone would actually do the work and put it up so it’s easy for a new user to get started, it might get traction. Then again, just asking questions of the new models might be all the help you need…

Wombat · April 20, 2025, 10:23pm

I would be willing to do some of the work but would need some support from the infrastructure team.

A Mailman admin would be able to get the entire python-dev mailing list history with a simple file copy or select-star query. Without admin rights, I have to run a website scraper, and it would indeed be a slog.

Likewise, a GitHub project admin can easily pull the entire project history including issue and PRs discussions. Brett Cannon did something like this when moving the Mercurial project history into Github. However, the best a normal user can do is run a loop with the Github REST API and be subject to rate limits which might render the exercise impossible.

For someone “on the inside”, the data collection task might be very easy. Likely, it could also be automatable so that it stays up to date.

For someone “on the outside”, it would be like court records being public but only for those who go to a courthouse with rolls of quarters for the copying machine.

One other thought: The infrastructure team might already be doing this work. Presumably, they are making backups of as much of the project as they can.

Anyway, this wasn’t my idea. I was just agreeing with the OP that this might be a transformational capability for Python users.

MegaIng · April 20, 2025, 11:20pm

This is an entirely different request to what was said in OP, and honestly, I don’t think this is a good.

In fact, this is a terrible idea:

Purely from a legal perspective the official docs have a clear license transfer happening - this is far less certain for mailing lists and github issues. I know it’s fashionable in AI business to ignore these kind of concerns, but the PSF shouldn’t contribute to it. (just because it’s publicly available, doesn’t mean you can do with it whatever you want. In fact, even your plan to scrape it might literally be illegal - I haven’t read the TOC)
Especially emails have a good chance to contain personal information that people might not want to be easily searchable. There is a mountain of difference between having it archived somewhere and having it feed into an AI system.
These discussions are riddle with new ideas, half-implemented things and historic, no-longer-relevant discussions. Training an AI on this would probably be worthless unless your goal is to dig through the history instead of learning about python as it is right now. (note: AIs are not good enough to cleanly distinguish this kind of stuff because of their finite context window - and it’s unlikely they will be good enough anytime soon.)

The actual docs are easily part of a single github repo and can easily be feed into an AI bot.

(Also, just in case: I explicitly do not want my contributions to this site be part of any AI chatbot that doesn’t 100% clearly, provably follow CC-BY-NC, as it is legally required to. Good luck finding an AI chatbot that guarantees this…)

oscarbenjamin · April 20, 2025, 11:58pm

We apparently have an AI bot plugged into discourse (where does that come from?). I just asked it to summarise this thread. It did a reasonable job in this short thread:

The discussion is about adding chatbot capabilities to the Python Docs website. user1 suggests that a chatbot would be helpful for navigating the docs faster. However, Adam Turner questions who would be responsible for maintaining the infrastructure and paying for it.

Zeke W proposes creating an official corpus download for training the chatbot, including the main docs, source code, PEPs, GitHub history, and discussion histories. This would enable the chatbot to answer questions like “what problem was this feature designed to solve” and “why wasn’t this approach approved”.

Ned Batchelder suggests that collecting and curating the information doesn’t need to be “official” and can be done by any volunteer. Zeke W clarifies that he means “maintained” or “curated” and notes that accessing the full GitHub history and mailing list archives can be difficult.

Guido van Rossum thinks that the project is not a priority and would require significant work, but if someone were to do it, it might gain traction. Zeke W offers to do some of the work but would need support from the infrastructure team.

However, Cornelius Krupp strongly disagrees with the idea, citing legal concerns, potential privacy issues, and the difficulty of training an AI on historical discussions. He suggests that the official docs are easily accessible and can be fed into an AI bot, but warns against using mailing list archives and GitHub issues without proper licensing and consideration for personal information.

AA-Turner · April 21, 2025, 12:22am

It appears to be a built-in Discourse feature, though site administrators can disable it. Unclear if it is opt-in or opt-out, though.

A

Mariatta · April 22, 2025, 5:59pm

The last several posts seemed to be off-topic about discourse AI bot instead of the original topic for adding chatbot to Python docs.

I think the solutions have been proposed. People can set up their own bot/service using the export of Python docs which is open source, see if it gets traction. At that time, we can reconsider and re-evaluate the situation.

Asking more data export eg from mailman and GitHub seemed out of scope to the original idea.