Writing a custom dash docset for Powershell docs
Powershell has became the default shell since Windows 10 Creator’s Update and it’s starting to become more than just a framework for malware deployment (not my words). Apart from the language itself which feel alien to me (is it a shell ? a scripting language ? a programming language ? a duck ?) my biggest gripe with Powershell is the lack of documentation accessible from an offline network (or simply without direct access to the Internet). For a shell that have been created for sysadmins, you would imagine MS would have thought of shipping Powershell with “batteries included”. I can not count the amount of times I needed to do a powershell-related search on my smartphone while operating on a detached network.
Secondly, until recently there was no easy way to query help for a specific API (for example
Get-ChildItem). There is evidently the
Get-Help cmdlet which show the “manpage” associated with the specific cmdlet but that too grab the documentation from the Internet ! At least since Powershell 3.0 there is also the
Save-Help cmdlet which can do a bulk download of the manpages of every posh modules installed on a system and
Update-Help to update it on a separate machine.
However I’m not a haxx0r elite programmer and for the life of me I can’t spend my time in a text-based console world. I grew up with click-based interfaces and browsers (not necessarily web browsers) therefore I’m way more at ease searching for information in an environment where a “mistype” cannot do serious damages on the system. I also like to click-click on colored boxes and purple links
Maybe to alleviate my silent issue (I’m surely not the only one bothered by this), the people from
microsoft.docs.com recently launched a Powershell modules browser in which you can do full-text search for Powershell Cmdlet :
This a great improvement but unfortunately it’s still online only. However the docs structure is sufficiently simple and well structured enough to be packaged in a dash docset for offline viewing. What follow in this blog post is how I proceed to build the docset as well as some “tips” for more advanced/obscure topics such as package navigation links and themes support.
TLDR : the generation script is here https://github.com/lucasg/powershell-docset but is subject to regular changes and breakages. Better download only the generated docsets : https://github.com/lucasg/powershell-docset/releases
The document on how to build a dash-compatible docset is quite clear and straightforward : https://kapeli.com/docsets#dashDocset.
Any dash compatible docset must follow a certain folder hierarchy, like any packaging format. Nothing crazy here, we have a layer of metadata on top of arbitrary data :
. └ $docset_name.docset/ ├ icon.png ├ firstname.lastname@example.org # optional for retina displays └ Contents/ ├ Info.plist └ Resources/ ├ LICENSE ├ docSet.dsidx └ Documents/ └ * # your documentation lives here
Documents can contain whatever you want, but usually you want to reproduce the URL paths if your docs comes from a website. In my case :
. └Documents\ └ docs.microsoft.com\ └ en-us\ ├ index.html # docset start page └ powershell\ └ module\ ├ CimCmdlets\ ├ Microsoft.PowerShell.Archive\ ├ Microsoft.PowerShell.Core\ ├ Microsoft.PowerShell.Diagnostics\ ├ Microsoft.PowerShell.Host\ ├ Microsoft.PowerShell.LocalAccounts\ ├ Microsoft.PowerShell.Management\ ├ Microsoft.PowerShell.Security\ ├ Microsoft.PowerShell.Utility\ ├ Microsoft.WSMan.Management\ ├ PackageManagement\ ├ Pester\ ├ PowerShellGet\ ├ PSDesiredStateConfiguration\ ├ PSDiagnostics\ └ PSReadLine\ ├ Get-PSReadlineKeyHandler.html ├ Get-PSReadlineOption.html ├ index.html # PSReadLine module index page ├ PSConsoleHostReadline.html ├ Remove-PSReadlineKeyHandler.html ├ Set-PSReadlineKeyHandler.html └ Set-PSReadlineOption.html
Icons are straigthforward :
Info.plist is a xml file describing the archive:
The only tricky item in
dashIndexFilePath : its value must respect the relative path within
(NB: Duplicate entries in
Info.plist will make Velocity crash when importing a docset, so double check there is none when you create your archive).
Crawling website endpoint
There is several ways to create a dash docset from existing documentation :
- Generating it from a compatible code comments system (Doxygen, GoDocs, etc) : Docset Generation Guide
docsets generators such as dashing which can automagically create a docset from an existing html documentation using CSS selectors. (Fun fact : dashing is created by a Microsoft Azure guy and written in
Go. Go figure.)
- Custom creating one with lots of scraping, copious amounts of sweat and a bit of luck/skill.
In my case, the Powershell Modules documentation is public and reside here : https://github.com/PowerShell/PowerShell-Docs. However the documentation is written in a markup language suited for DocFx, Dotnet documentation generator. DocFx is not dash-compatible yet, so the first option is out.
Theoretically, I could have tried to generate the HTML documentation using
DocFx and converting it in a docset using
dashing, but that would have implied to use two tools I don’t have any experience in and the resulting docset can be difficult to debug (as you will see further below) if anything went wrong somewhere. If the Microsoft Azure people want to do it this way, they are more than welcome. So out with the second option.
Third option it is, then.
Fortunately, as I said previously, the Powershell modules doc website is well structured and provide a json file describing the table of contents : https://docs.microsoft.com/en-us/powershell/module/psdocs/toc.json?view=powershell-6 (you change the powershell version number for previous major versions). The
toc.json is basically the sitemap and list all the modules and cmdlets for a given Powershell version.
requests are really life savers in those situations :
With only 40 lines of Python, I was able to do a full website save into disk.
Not a database expert, so I’ve copied from
It works for my purposes.
As said in the official documentation, there is a limited number of entry “types” : https://kapeli.com/docsets#supportedentrytypes. Even though Microsoft keeps on using
Cmdlet to designate Powershell commands I’ve listed them under the “Command” entry type.
Packaging the docset
Once you’ve got every HTML source files and you’ve populated the database, just tar-gz the folder while conserving the structure within the archive (relative path folders). This python snippet does the trick :
At the end of it, you -hopefully- get a well formed docset archive that can be imported into
Notice anything ? It’s really ugly, except if you’re a brutalist afficionado (disclaimer : I’m not). Furthermore, you can’t tell from an image, but navigation links are broken : you can’t click-click and turning those items purple, which makes me sad.
Fixing paths imply to parse and rewrite the downloaded HTML files, while supporting css “themes” usually ends up using a headless web browser to do the scraping. They are cumbersome and brittle operations, which is probably why most user-contributed docsets don’t bother doing it.
In order to have “clickable” links, you need to rewrite
<a> anchors and convert “dynamic”
href to “static” ones. For example transform :
<a href="/powershell/module/Microsoft.PowerShell.Archive/?view=powershell-6" data-linktype="relative-path">
<a href="powershell/module/Microsoft.PowerShell.Archive/index.html" data-linktype="relative-path">
There is also a fuckton a nav bars , dropdown menu and sidebars DOM elements that need to be removed. Fortunately for me, the docs HTML pages are well-formated and there is little to none variability in how the DOM is structured. I have basically three types of pages : the
index.html, each own modules’ index.html and each cmdlets page.
BeautifulSoup is the go-to python library for parsing HTML contents and it does the job admirably.
With working uri paths your docset is more “dynamic” and “discoverable”, but it’s still fugly. I don’t know for
Zeal are basically web browsers (
Velocity relies on the
Chromium Embedded Framework) so it’s possible to “theme” your docset via css.
Problem though, the css theme location is not present in the
toc.json file so you need to find it on your own. That’s where Chrome/Firefox/Edge’s devtools are your friend :
You will need to reconstruct the exact same tree directory structure in your docset package and also fix urls. I fortunately didn’t had to replicate the cross-domain “cdn” requests. “Dynamic” pages like powershell
Promise. In order to locate and retrieve all the necessary resources I end up using a
phantomjs) which is a headless web browser you can automate. This is way slower than using simple HTTP requests, but the result is more accurate.
Follow below is the result with the css theme correctly applied. That’s much nicer to see !
Overengineering the whole process
I’ve set up a
_travis.yml script that allow me to generate docsets on a new commit, as well as push them on my Github releases’ repository.
Travis is also set to regularly fire a build (“
cron job”) in order to keep a “up-to-date” build.
I even went as far as envisaging using backstroke.us to monitor/sync
https://github.com/Powershell/PowerShell-Docs and trigger a travis scraping job on a new commit in order to always have an up-to-date documentation. But on second though I think that’s a bit overboard for a simple weekend side project.