Switching to distributed egg repositories

Switching to distributed egg repositories

Here is a proposal on how to switch from our single-site centralised subversion-based egg repository to a distributed one.

Update: This system has now been implemented. See releasing-your-egg for information on how to release an egg with this system and this e-mail to chicken-users on how to use eggs that use this system to distribute the code.

But first, let's discuss the pros and cons:

Pros

The reasons we would want to make this move would include the following:

Proper usage of version control

Several people are currently not really using the subversion repository. They're simply importing their stuff into subversion from their own private git, mercurial or fossil repositories whenever they want to make a release.

This means you lose the real commit history and all goodness that comes from that, as well as the ability to easily follow along with developments on "trunk".

By not supporting whatever VCS people prefer, the default has become not to share your repository with others. A user must take extra steps to publish the "main" repo. By making the main repo directly usable, this would encourage people to push changes much earlier and make contributing easier.

Of course the main reason people aren't happy with Subversion anymore is because the newer distributed style version control systems are more sophisticated and allow for off-line commits, easier cloning of full history and improved merging, among other great features. We should make it easier for people to get the most use out of these features.

Distribution of the permission bottleneck

Currently, contributors need to request a user account on our svn server, and every single time they create a new egg they need to request creation of an egg dir. This is needless overhead.

Also, whenever someone new starts contributing to an existing egg, this means the contributor also needs to be added. The administrators of the svn repo should check that the person doing the requesting is indeed the egg's owner. Right now that's doable but as we get larger this might become more of a hassle.

With DVCSes and code hosting sites, people can manage their own way of collaborating and hand out commit rights any way they want.

Lower barrier of entry

The manual account management also results in a pretty high barrier of entry. Often, people don't bother to contact the community to create an account for them but prefer to host the code on their own and point to it in some unstructured way (i.e. their website/blog). In the cases of the epoll and the slime eggs, members of the community had to actively approach the authors to make them commit to the central egg repo. A decentralized system would allow arbitrary people to publish eggs which can be consumed conveniently by all users.

Reducing checkout and repository size

We can change the repository layout for the svn repo to fix the "huge checkout" issue, but this will be a pretty big operation. If we do that, we might as well take the opportunity to look at the possibilities of changing it completely.

The repository itself is not that huge right now; only 752MB. Even if we start getting really popular, we can probably cope with the growth.

Cons

The reasons we would want to stick with the current scheme:

Broken links & dependencies

As Felix has pointed out, if anyone can create an egg anywhere, this will eventually result in eggs being taken down:

Code hosting sites can get shut down (permanently or temporary outages)
People can lose interest, move on and forget about the fact they were sharing an egg repository from their personal server and it gets deleted on the next update or reinstall.

This is bad when it's a popular egg, but absolutely horrible when it's a low-level dependency used by many eggs.

Documentation

People need an svn account anyway when they want to make authenticated commits to the wiki. This is a bit less of a problem because you only need to request an account once, not for every egg or wiki page you edit...

The alternative, to make the documentation distributed along with the egg, would make it harder to easily find things; you can't easily run a full text search against distributed docs, and chicken-doc would have a harder time to fetch its stuff too.

Additional complexity

The distributed nature of the egg repository would require more tooling support, and some things become impossible.

For example, checking out all eggs would require someone to write a script that calls the "clone" command for each egg, using the relevant VCS. This would mean one who would want to do this will have to install all of subversion, mercurial, git, fossil and this week's popular DVCS someone decides to use for their one-off throwaway project. This is not fun.

It would be harder to have mirrors too, because that would need to clone all these different types of repositories. We already have working mirrors right now. We don't want to lose that.

Henrietta needs to be modified/rewritten as well as the script that generates the egg index on the wiki.

This probably has consequences for Salmonella too, as well as vandusen (IRC notification of commits), and it will make writing a summary for the gazette of what's happened recently harder.

Design goals

Backward compatibility

Any system we use should only require modification of Henrietta, the server-side CGI program that is contacted by chicken-install. This means this system could be deployed transparently and gradually. Older CHICKEN releases would keep working and be able to receive new eggs and updates for existing eggs.

Existing eggs should initially stay in subversion and people who aren't interesting in this hubbub should be allowed to stick to their current workflow with svn. When someone prefers to move their egg to some other repo hoster, an upgrade path should be provided.

TODO: This is currently missing. A way to do it could be to convert the svn repo to the new VCS. As long as tags are converted along with the history and specific older release versions are still available this should be fine.

Ease of use, no extra hassle whatsoever

The new system should not put any additional demands on the user. Developing eggs and making new releases should be as easy as it already is today; you can do everything from the VCS using only simple operations like "copy", "tag" and "branch".

Independent of any VCS

If we do this, it is mostly because people prefer to choose their own tools. If we make a choice for one particular system we will end up with a flamewar because different people like different VCSes. There is also the possibility that we would have to go through all this again when the next shiny new thing comes along in the VCS world.

The basics of the system described below are such that it is actually completely oblivious of any version control systems; it is based on HTTP, plaintext files and tarballs only. One could even decide not to use a VCS at all, and simply upload tarballs to a server.

Description of the system

Release-info file

An egg has a "release-info file" with an s-expression in it. This file provides enough information to locate the egg's releases and optionally its repository.

This release-info-file could be kept under version control, but it doesn't necessarily have to. Most code hosting sites provide a way to obtain a "raw" plaintext dump of a file under version control, so the user doesn't have to do anything special; this file can just be added to the egg's tree.

As long as there's a way to get the one that always points to "trunk", this will work. Example:

https://bitbucket.org/sjamaan/spiffy/raw/tip/spiffy.release-info

Github has a similar way to do this, as do gitweb and hg-web. Subversion over http provides this "for free" also. We could add a wiki howto page on determining this URL for popular code hosting sites. For where it doesn't work, people could simply upload the file to their favorite place. If people can't find anywhere to host it, the old svn repo would be fine too. A per-user directory for their egg repos would suffice. The user only would need to request it once instead of for every new egg.

Releases

The release-info file contains a list of versions that are released. Each version can be downloaded individually, either as a tarball or one file at a time.

If the latter option is chosen, the meta-file (the original one already used, not the repo-file) should list the files so that henrietta only needs to request the files in a loop, like it currently does with "svn cat".

note: This implies that the meta-file can't be used to list the revisions; file lists may differ for different versions, so we must point to a meta-file from the release-info file. This doesn't necessarily mean the meta file can't contain the release info for tarball-based eggs, but if we allow that it's more complicated....

I'm not yet sure if the tarball is a good idea since it introduces extra complexity. However, it does make it easier for people to say "all files" and reduces mistakes (like when you rename, add or delete a file and forget to update the meta file).

The location of the release files is preprocessed with a simple pattern substitution (based on an alist). Since code hosting sites like github and bitbucket allow you to download a full tarball based on a revision and/or a named tag, it is easy to make a release; just tag and push, exactly like before with the old subversion repo. There's exactly one extra operation that you need to do and that's adding the release to the "release-info file".

Let's start with a simple example. This is the full release-info file:

(uri targz "https://bitbucket.org/sjamaan/{egg-name}/get/{egg-name}-{egg-release}.tar.bz2")
(release "0.4")
(release "0.3")

The egg-name and egg-release are patterns that are replaced automatically with the current egg's name, and the release version.

A more complex example, using bzip2 compression and linking to multiple repositories/release URI structures:

;; default uri
(uri tarbz2 "https://bitbucket.org/sjamaan/{egg-name}/get/{egg-name}-{egg-release}.tar.bz2")
(release "0.4")

;; A source repo (type hg) for the default uri:
(repo hg "https://bitbucket.org/sjamaan/{egg-name}")

;; Previous maintainer's repo, specified as a named URI.
;; If no uri is specified in the release entry, the nameless default uri is used
(uri targz "https://github.com/bunny351/{egg-name}/tarball/{egg-release}.tar.gz" bunny)
(release "0.2" (uri bunny))

;; Later, additional properties could be added at the end, alist-style.
;; here's a hypothetical example that says this version is for CHICKEN 3 only
;; we most likely want to keep separate egg lists and hence repositories for
;; different major CHICKEN versions though
(release "0.1" (uri bunny) (chicken-version "3"))

;; This repo is linked to the "bunny" URI
(repo git "https://github.com/bunny351/{egg-name}" (uri bunny))

Using plain files that must be separately downloaded instead of a tarball:

(uri meta-file "http://anonymous:@code.call-cc.org/svn/chicken-eggs/release/4/spiffy/tags/{egg-release}/spiffy.meta")
(release "0.4")
(release "0.3")

;; Assuming trunk/tags/branches?  This should probably be the base dir
(repo svn "http://anonymous:@code.call-cc.org/svn/chicken-eggs/release/4/spiffy")

TODO: What if something changed and there are now two base URIs where releases can be found which correspond to a repo which is still in the same location?

The URI template syntax is a simplified subset of http://tools.ietf.org/html/draft-gregorio-uritemplate-04

For those eggs under SVN that do not currently have a (files) entry in svn under a tag directory, we could just change their code.

Central list of eggs

This system depends on one master centralised list of egg names to resolve dependencies. We could specify egg dependencies as fully qualified repository-file URIs, but that would cause more trouble than it's worth. For example, what happens if you install two repositories called "spiffy", possibly installing the same files and modules with the same names?

Also, since the wiki will still be the canonical source of documentation, and the wiki's namespace is pretty flat this is the easiest way to do it.

This list could be maintained in svn. Optionally, a simple interface could be made to ease the egg creation and make sure the list is both append-only and names on it are unique.

Henrietta would consult this list, which is simply a egg-name-to-URI mapping which allows you to discover the repo-file given an egg name.