Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to Github Pages, Python 3, general code and dataset review #448

Open
ali1234 opened this issue Sep 2, 2022 · 3 comments
Open

Moving to Github Pages, Python 3, general code and dataset review #448

ali1234 opened this issue Sep 2, 2022 · 3 comments

Comments

@ali1234
Copy link

ali1234 commented Sep 2, 2022

Abstract:

  • It would be very easy but somewhat messy to move directly to Github Pages.
  • The biggest problem is due to the way the site uses subdomains.
  • The dataset itself also has some problems.

On moving to Pages:

  • Putting the English HTML site on pages was very easy. It took me less than 1 hour, having never looked at this codebase before. Steps:
    1. Configure Pages in the repository settings.
    2. Create a CNAME in your DNS pointing to the Github Pages server (this is the default URL in the Pages settings.)
    3. Add the URL as a custom URL. This might produce an error on "DNS check" but it works anyway.
    4. Add "Static HTML" workflow, and tweak it to install Python requirements, run make, and upload from the output subdirectory instead of the repository root.
  • The API is a bit harder to deal with. It is currently served from a subdomain of the HTML site, but Pages only allows one URL per repository. This could be handled by making a bare repository with a separate Pages config, and only containing a workflow file to pull from the main repo. Alternatively, you could stop using subdomains since the whole thing is static anyway. Just put the API under a subdirectory of the HTML site.
  • It isn't clear to me how to access translations but if they are subdomains then the same applies.
  • If you need a lot of subdomains it would probably make sense to make an org for all the repos.

Code review and Python 3 changes:

  • In some places map() is passed directly to the JSON serializer. This no longer works in Python 3 because map() returns a generator object. The easiest fix is to wrap the calls in list().
  • There is a large amount of duplicated code throughout the repository. Almost every file has slugify() and cssify() - even the ones that don't use them. Additionally load_md() appears a few times but is never used at all.
  • There are scripts to generate HTML, JSON, indexes etc. Each one has to parse all of the board files, and this takes nearly all of the total runtime. So merging these scripts together would have no drawbacks and would reduce code duplication.
  • pillow is missing from requirements.txt. It only seems to be needed for generating the API.

On the dataset itself:

  • In some places the JSON serializer fails to sort keys because the pin numbers are a mixture of strings and integers. This should be normalized in the data files. Since some pins have bcm in their number, they should all be converted to strings.
  • The translation system is problematic because there are no clear guidelines on which fields should be translated. So in some translations the category fields have been translated, while in others they are not. In some places True and False are used (the literal values, not strings) but occasionally these have been translated into native strings eg 'nein'.
  • Additionally each translation duplicates the entire board metadata, which means it is possible for them to get out of sync. For example if someone notices that a pin number is incorrect, it needs to be changed in every translation, including the ones where the file hasn't been translated yet. This can lead to things getting forgotten and causes additional maintenance work.
  • I think it would be better to split the data into a single dataset for all non-translated metadata and then some kind of system for applying translations on top. This would require first identifying which fields should and should not be translated, and then coming up with a system to merge the two datasets. The implementation details will depend on and maybe influence the choice of continuing to use subdomains.
  • Once the dataset has been normalized, a workflow can be written for pull requests to make sure it stays that way. It can additionally produce a list of untranslated strings, which would remove the need for the duplicated untranslated board descriptions.
  • I wrote a linter to check for the issues above and it found over 500 instances across the entire dataset of 1300 files.
  • I also wrote a tool to extract a sort of schema from the data, and this is what it produced:
['buy'] str
['class'] str
['collected'] str
['description'] str
['docs'] str
['eeprom'] str, bool
['flash'][string_int]['active'] str
['flash'][string_int]['mode'] str
['formfactor'] str
['github'] NoneType, str
['ground'][string_int] NoneType
['i2c']['0x60-0x6F']['device'] str
['i2c']['dynamic']['device'] str
['i2c']['dynamic']['name'] str
['i2c'][int]['device'] str
['i2c'][string_int]['alternate'][<list>] str
['i2c'][string_int]['device'] str
['i2c'][string_int]['name'] str
['image'] str
['install']['apt'][<list>] str
['install']['devices'][<list>] str
['install']['python'][<list>] str
['install']['python3'][<list>] str
['manufacturer'] str
['name'] str
['page_url'] str
['pin']['mode'] str
['pin'][bcm_pin]['direction'] str
['pin'][bcm_pin]['mode'] str
['pin'][bcm_pin]['name'] str
['pin'][int]['active'] str
['pin'][int]['mode'] str
['pin'][int]['name'] str
['pin'][string_int] NoneType
['pin'][string_int]['active'] str
['pin'][string_int]['description'] str
['pin'][string_int]['direction'] str
['pin'][string_int]['external_pull'] str
['pin'][string_int]['mode'] str
['pin'][string_int]['name'] NoneType, str
['pin'][string_int]['pull'] str
['pincount'] int
['power'] NoneType
['power'][string_int] str, NoneType
['schematic'] NoneType, str
['title'] NoneType, str
['type'] str
['url'] str

Some errors can be seen in the above. For example, somewhere there is a pin which has a mode without a number. Note that not every path appears in every file, which makes parsing a bit harder. The list of paths that are always present is surprisingly short:

['type']
['name']
['title']
['class']

So that's the state of things now. Fixing all these problems is a design challenge. Normalizing the data for a new design should be very easy to automate. Another consideration is the microbit and pico sites. If we are going to completely redesign one of them we should probably make it general enough to drive all of them, and anything new that comes out in the future.

@ali1234
Copy link
Author

ali1234 commented Sep 2, 2022

Python 3 fixes and a Pages workflow are on the default branch here: https://github.com/ali1234/Pinout.xyz

That repository is auto syncing to https://pinout.zerostem.io on every push (currently English HTML only.)

@Gadgetoid
Copy link
Collaborator

I need to get around to looking at this. I'm really keen to move over to GitHub pages to make keeping the site up-to-date easier, and potentially helping me delegate more responsibility to community members who've helped over the years. Should they want it!

An org actually sounds like a great idea. I had forgotten this issue exists (it landed just about the time we we're at critical with baby preparations) and was just wondering how to handle multiple subdomains for languages. It seems an org, and just repos that do nothing but contain the generated HTML files would be a reasonable approach.

@Gadgetoid
Copy link
Collaborator

Oh and pico.pinout.xyz at least is just a hand-crafted static HTML site. I have just moved that over to GitHub pages this morning. I don't currently have any plans to make it into a board database, though I wont rule out the idea altogether.

I think a lot of the friction with the Pi database is - reading between the lines here - just how awful this code and dataset is. It grew very organically and I tend to get caught up trying to optimise the actual end result rather than the process to produce it. Maybe a better system will make Pico and MicroBit board databases, and alternate pin layouts (for third party boards in RP2040s case) more attainable.

Have had an org set up for a while, must have forgotten about it - https://github.com/pinout-xyz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants