Switching to distributed egg repositories (historical revision 22931)

You are looking at historical revision 22931 of this page. It may differ significantly from its current revision.

Switching to distributed egg repositories

Switching to distributed egg repositories

Here is a proposal on how to switch from our single-site centralised subversion-based egg repository to a distributed one.

But first, let's discuss the pros and cons:

Pros

The reasons we would want to make this move would include the following:

Proper usage of version control

Several people are currently not really using the subversion repository. They're simply importing their stuff into subversion from their own private git, mercurial or fossil repositories whenever they want to make a release.

This means you lose the real commit history and all goodness that comes from that, as well as the ability to easily follow along with developments on "trunk".

By not supporting whatever VCS people prefer, the default has become not to share your repository with others. A user must take extra steps to publish the "main" repo. By making the main repo directly usable, this would encourage people to push changes much earlier and make contributing easier.

Distribution of the permission bottleneck

Currently, contributors need to request a user account on our svn server, and every single time they create a new egg they need to request creation of an egg dir. This is needless overhead.

Also, whenever someone new starts contributing to an existing egg, this means the contributor also needs to be added. The administrators of the svn repo should check that the person doing the requesting is indeed the egg's owner. Right now that's doable but as we get larger this might become more of a hassle.

With DVCSes and code hosting sites, people can manage their own way of collaborating and hand out commit rights any way they want.

Lower barrier of entry

The manual account management also results in a pretty high barrier of entry. Often, people don't bother to contact the community to create an account for them but prefer to host the code on their own and point to it in some unstructured way (i.e. their website/blog). In the cases of the epoll and the slime eggs, members of the community had to actively approach the authors to make them commit to the central egg repo. A decentralized system would allow arbitrary people to publish eggs which can be consumed conveniently by all users.

Reducing checkout and repository size

We can change the repository layout for the svn repo to fix the "huge checkout" issue, but this will be a pretty big operation. If we do that, we might as well take the opportunity to look at the possibilities of changing it completely.

The repository itself is not that huge right now; only 752MB. Even if we start getting really popular, we can probably cope with the growth.

Cons

The reasons we would want to stick with the current scheme:

Broken links & dependencies

As Felix has pointed out, if anyone can create an egg anywhere, this will eventually result in eggs being taken down:

Code hosting sites can get shut down (permanently or temporary outages)
People can lose interest, move on and forget about the fact they were sharing an egg repository from their personal server and it gets deleted on the next update or reinstall.

This is bad when it's a popular egg, but absolutely horrible when it's a low-level dependency used by many eggs.

Documentation

People need an svn account anyway when they want to make authenticated commits to the wiki. This is a bit less of a problem because you only need to request an account once, not for every egg or wiki page you edit...

The alternative, to make the documentation distributed along with the egg, would make it harder to easily find things; you can't easily run a full text search against distributed docs, and chicken-doc would have a harder time to fetch its stuff too.

Additional complexity

The distributed nature of the egg repository would require more tooling support, and some things become impossible.

For example, checking out all eggs would require someone to write a script that calls the "clone" command for each egg, using the relevant VCS. This would mean one who would want to do this will have to install all of subversion, mercurial, git, fossil and this week's popular DVCS someone decides to use for their one-off throwaway project. This is not fun.

It would be harder to have mirrors too, because that would need to clone all these different types of repositories. We already have working mirrors right now. We don't want to lose that.

Henrietta needs to be modified/rewritten as well as the script that generates the egg index on the wiki.

This probably has consequences for Salmonella too, as well as vandusen (IRC notification of commits), and it will make writing a summary for the gazette of what's happened recently harder.

Design goals

Backward compatibility

Any system we use should only require modification of Henrietta, the server-side CGI program that is contacted by chicken-install. This means this system could be deployed transparently and gradually. Older Chicken releases would keep working and be able to receive new eggs and updates for existing eggs.

Existing eggs should initially stay in subversion and people who aren't interesting in this hubbub should be allowed to stick to their current workflow with svn. When someone prefers to move their egg to some other repo hoster, an upgrade path should be provided.

TODO: This is currently missing. A way to do it could be to convert the svn repo to the new VCS. As long as tags are converted along with the history and specific older release versions are still available this should be fine.

Ease of use, no extra hassle whatsoever

The new system should not put any additional demands on the user. Developing eggs and making new releases should be as easy as it already is today; you can do everything from the VCS using only simple operations like "copy", "tag" and "branch".

Independent of any VCS

If we do this, it is mostly because people prefer to choose their own tools. If we make a choice for one particular system we will end up with a flamewar because different people like different VCSes. There is also the possibility that we would have to go through all this again when the next shiny new thing comes along in the VCS world.

The basics of the system described below are such that it is actually completely oblivious of any version control systems; it is based on HTTP, plaintext files and tarballs only. One could even decide not to use a VCS at all, and simply upload tarballs to a server.

Description of the system

Release-info file

An egg has a file with an s-expression in it. This could be the existing .meta with extra options or a new "release-info file". This file is the way to locate the egg's releases.

This repo-file could be kept under version control, but it doesn't necessarily have to. Most code hosting sites provide a way to obtain a "raw" plaintext dump of a file under version control. As long as there's a way to get the one that always points to "trunk", this will work. Example:

https://bitbucket.org/sjamaan/spiffy/raw/tip/release-info.meta

Github has a similar way to do this, as do gitweb and hg-web. Subversion over http provides this "for free" also. We could add a wiki howto page on determining this URL for popular code hosting sites. For where it doesn't work, people could simply upload it to their favorite place. If people can't find anywhere to host it, the old svn repo would be fine too. A per-user directory for their egg repos would suffice. The user only would need to request it once.

Releases

The release-info file contains a list of versions that are released. Each version can be downloaded individually, either as a tarball or one file at a time.

If the latter option is chosen, the meta-file (the original one already used, not the repo-file) should list the files so that henrietta only needs to request the files in a loop, like it currently does with "svn cat".

note: This implies that the meta-file can't be used to list the revisions; file lists may differ for different versions, so we must point to a meta-file from the release-info file. This doesn't necessarily mean the meta file can't contain the release info for tarball-based eggs, but if we allow that it's more complicated....

I'm not yet sure if the tarball is a good idea since it introduces extra complexity, but it does make it easier for people to say "all files" and reduces mistakes (like when you rename, add or delete a file and forget to update the meta file).

The location of the release files is converted with a stupid alist to make it easy to provide a template. The template would have a few predefined variables which will be substituted. Since code hosting sites like github and bitbucket allow you to download a full tarball based on a revision and/or a named tag, it is easy to make a release; just tag and push, just like before. There's exactly one extra operation that you need to do and that's adding the release to the "release-info file".

TODO: I don't like the s-expression syntax yet. It's way too complex.

;; Tarball:
(releases
  ("0.4" "https://bitbucket.org/sjamaan/spiffy/get/spiffy-0.4.tar.gz")
  ("0.3" "https://bitbucket.org/sjamaan/spiffy/get/spiffy-0.3.tar.gz"))

;; Same, with pattern replacement
(releases
  (with-base-uri "https://bitbucket.org/sjamaan/{egg-name}/get/{egg-name}-{version}.tar.gz"
    "0.4"
    ;; The full alist form.  A bare string is just a shorthand for that, but the
    ;; alist allows you to override other template replacements in the URI.
    ((version "0.3")))
  ;; Previous maintainer's repo
  (with-base-uri "https://github.com/bunny351/{egg-name}/tarball/{version}.tar.gz"
    ((version "0.2")) ((version "0.1"))))

;; Files:
(releases
  (with-base-uri "http://anonymous:@code.call-cc.org/svn/chicken-eggs/release/4/spiffy/tags/{version}/spiffy.meta"
    ((meta-file . #t) (version . "0.4")) ((meta-file . #t) (version . "0.3"))))

The URI template syntax is a simplified subset of http://tools.ietf.org/html/draft-gregorio-uritemplate-04

For those eggs under SVN that do not currently have a (files) entry under tags, we could change their code.

Central list of eggs

This system depends on one master centralised list of egg names to resolve dependencies. We could specify egg dependencies as fully qualified repository-file URIs, but that would cause more trouble than it's worth. For example, what happens if you install two repositories called "spiffy", possibly installing the same files and modules with the same names?

Also, since the wiki will still be the canonical source of documentation, and the wiki's namespace is pretty flat this is the easiest way to do it.

This list could be maintained in svn. Optionally, a simple interface could be made to ease the egg creation and make sure the list is both append-only and names on it are unique.

Henrietta would consult this list, which is simply a egg-name-to-URI mapping which allows you to discover the repo-file given an egg name.