Wikispecies:Project Cleanup/Using AWB to convert hyphens to en-dashes

From Wikispecies
Jump to navigation Jump to search
Broom icon.svg

(Moved here from Village pump Dan Koehl (talk) 00:55, 16 February 2017 (UTC))

Participants[edit]

  1. Mariusm (talk) 16:24, 14 February 2017 (UTC)
  2. Scott Thomson (Faendalimas) talk 16:39, 14 February 2017 (UTC)
  3. Franz Xaver (talk) 16:52, 14 February 2017 (UTC)
  4. Justin (koavf)TCM 18:53, 14 February 2017 (UTC)
  5. Dan Koehl (talk) 19:39, 14 February 2017 (UTC)
  6. Andyboorman (talk) 15:09, 15 February 2017 (UTC)

Discussion[edit]

Category:Reference templates[edit]

Can someone with AWB knowledge convert all xxxx-xxxx to xxxxxxxx (from hyphen to en-dash) (where xxxx may be any number of any length) ? Mariusm (talk) 16:24, 14 February 2017 (UTC)

possibly do a search replace of - (ie a hyphen) with {{spaced ndash}}, assuming that template is here. as for the numbers, does AWB do wild card numbers? not sure but to find I would look for x-x only ignoring the rest, how to do that as a wild card for the replace I cannot say. I would have to look up the wildcards have not had to do it before. Cheers Scott Thomson (Faendalimas) talk 16:39, 14 February 2017 (UTC)
I suppose, the hyphen should not be replaced, when occurring in ISBN or ISSN codes? --Franz Xaver (talk) 16:52, 14 February 2017 (UTC)
I have a userscript at en.wp (which I cannot seem to import here for some reason) which is very smart about this--it gets normal strings of text, date ranges, page ranges, etc. but not IS(B/S)Ns. w:en:User:GregU/dashes.js. Tried it again now and it's still not working... —Justin (koavf)TCM 18:53, 14 February 2017 (UTC)
I can give it a try in about an hour. Dan Koehl (talk) 19:39, 14 February 2017 (UTC)
In which instances is there a need to change, @Mariusm:? Can you give an example, please? Dan Koehl (talk) 07:46, 15 February 2017 (UTC)
@Dan Koehl: The change is needed mainly for references (journal or book publications) for the page ranges. See for example Template:Chou & Wang, 1996. The change is required for 141-144 to be converted to 141–144. Mariusm (talk) 08:12, 15 February 2017 (UTC)

───────────────────────── This looks pretty easy to change, does it mean most of the pages in need of this change are located within Category:Reference templates? That would be a good way of filter them out, and avoid changes on pages where no changes should be made. Dan Koehl (talk) 08:44, 15 February 2017 (UTC)

@Mariusm:, I made a first attempt with 5, and at least 1 bad result: Cystofilobasidiaceae (OK?), but here on Cystofilobasidiales is not OK?, Filobasidiaceae (OK?), again here at Filobasidiales not OK, and here it changed the ISSN number hyphen at ISSN 0374-1036/32. Im trying to change configuration so it filters out only numbers now. Dan Koehl (talk) 09:18, 15 February 2017 (UTC)

Also see my contributions. After checking the pages, please revert the version with the errors. Dan Koehl (talk) 09:24, 15 February 2017 (UTC)

@Dan Koehl:

Try to make the following filters: (1) the hyphen must be squeezed between two numbers (2) no "ISSN " or "ISBN " before the first number.

Have a start with the Category:Reference templates - it may be easier. Mariusm (talk) 10:23, 15 February 2017 (UTC)

OK, let me try @Mariusm:. Do you have IRC, Im at #wikispecies (IRC channel) Dan Koehl (talk) 10:26, 15 February 2017 (UTC)
without discussing the how for one sec. If these are ISSN or ISBN numbers is it a good idea to change hyphens to ndash? If people copy and past ISBN numbers into google to look up the book will the search find it? since it is expected that number sets in these are separated by hyphens. I do not know just asking. Cheers Scott Thomson (Faendalimas) talk 10:31, 15 February 2017 (UTC)
We use ISSN numbers as links to WS ISSN pages, so we mustn't change the ISSN hyphens!! @Dan Koehl: try another filter rule: ": " must be before the first number. Mariusm (talk) 10:38, 15 February 2017 (UTC)
@Mariusm:, I think now its OK, please see my last edits. Dan Koehl (talk) 10:51, 15 February 2017 (UTC)
@Dan Koehl: Perfect! The ISSN aren't changed and the page numbers get their en-dash. You can go ahead and start rolling. Mariusm (talk) 12:35, 15 February 2017 (UTC)
Yes @Mariusm:, its working well now, Ill give it a couple of more edits and take a look, if all looks good, then I will call the obedient KoehlBot, and let him take care of it. Please let me know which other categories could be good to try it out, at best with a good variation in content, so I can reach the point where a larger part of WS could be run trough by the bot. Dan Koehl (talk) 12:41, 15 February 2017 (UTC)
@Dan Koehl: many reference-templates aren't marked with Category:Reference templates and you can spot them by searching for "Special:WhatLinksHere". Many many many references are also on regular pages without templates, so you'll need to scan all WS to get hold of every page range there is. Mariusm (talk) 12:52, 15 February 2017 (UTC)
@Dan Koehl and Mariusm: There is another instance, where hyphens should be replaced, which probably will not be found in this round. I mean the life data in author pages as e.g. Lujo Adamović or Adam Afzelius. Moreover also here one more replacement operation seems to be necessary to convert "(19-20)" into "(19–20)". Probably a different filter configuration will be needed here, covering these cases? --Franz Xaver (talk) 13:17, 15 February 2017 (UTC)

───────────────────────── I have in mind many such projects to standardize WS (for example to standardize the way Type locality: is displayed). But let's finish this project first. Mariusm (talk) 13:00, 15 February 2017 (UTC)

OK, very good @Mariusm and Franz Xaver:, maybe we should even mark this as mini project under cleanup project, for the sake of documentation?
Only the templates in Category:Reference templates are almost 25 000, so I will leave this task to User:KoehlBot now, on a second computer. The present selection of files looks very much the same, and not much can happen except for what we have seen on the ones I went through manually, but please, help me and now and then just check that the bot is operating correctly. This means I can soon take a look on the others files you both have in mind, and start doing the first tries manually with AWB. Do you two have IRC, so we can discuss the operation at #wikispecies? Help me look after the Bot, and that it does its job well Dan Koehl (talk) 13:49, 15 February 2017 (UTC)
@Dan Koehl: I have had a look at some of my ref templates that your bot has acted on and they look fine with no unexpected oddities. This is appreciated as I have not really had time to bother with the minor differences between hyphen and en-dashes. Regards Andyboorman (talk) 15:09, 15 February 2017 (UTC)
@Dan Koehl: I don't have IRC and I'm not regularly at my computer, but I'll check periodically on the progress. So far you've done a good job. Mariusm (talk) 15:31, 15 February 2017 (UTC)

@Dan Koehl: Can you give me the AWB rule(s) that you are using to make this change? I'm assuming it's some regex replacement... —Justin (koavf)TCM 20:35, 15 February 2017 (UTC)

@Koavf:, Im sure it can be more precise and complicated, I used a simplified code in Find: (: [0-9]+)-([0-9]+) replace with: $1–$2 in the find and replace/normal section. I guess with the dash, that a html code for the dash could be used instead, but I made the choice to make th dash like any user would do, with the keyboard. It seems to have worked well, I made a limit on 100 turnes, since all is working well Ill let the bot run through all the 24 000 files. KoehlBot (talk) 21:56, 15 February 2017 (UTC)
@Dan Koehl: That is beautiful. I've been meaning to do that for awhile myself. And yes, inserting the actual character is better than "–" or a Unicode escape. —Justin (koavf)TCM 22:41, 15 February 2017 (UTC)

@Mariusm, Faendalimas, Franz Xaver, Justin, Koavf, and Andyboorman:, I moved the discussion from Village pump here to a subproject page.

After 100 edit were made by user:KoehlBot, and apparently looking OK, with no obvious errors, I released the beast on the rest, going thorough apr 24 000 pages. It took some hours, finished this afternoon, and made changes to thousand of pages. Dan Koehl (talk) 18:46, 16 February 2017 (UTC)

Category:Taxonomists[edit]

Now, after Franz Xavers suggestion, I just finished manually, trying 100 edits on Category:Taxonomists (recursive), I ask you kindly to inspect those 100 pages, before I let the Bot take over and take care of the rest apr 25 000 pages belonging to Category:Taxonomists. Dan Koehl (talk) 00:55, 16 February 2017 (UTC)

@Dan Koehl: Still requires human intervention: see this edit putting a dash into an ISSN. —Justin (koavf)TCM 05:06, 16 February 2017 (UTC)
@Dan Koehl: It also missed a lot here. —Justin (koavf)TCM 05:22, 16 February 2017 (UTC)
@Koavf: I don't think he missed, the ranged he "missed" were already done with en-dashed rather than with hyphens. Mariusm (talk) 05:36, 16 February 2017 (UTC)
@Mariusm: Sorry, I think you are mistaken. For instance, his edit missed "1905-2001" and I changed it to "1905–2001" by hand. Of course, missing an edit is better than making an incorrect one. Either way, it's not the end of the world. —Justin (koavf)TCM 06:05, 16 February 2017 (UTC)
@Koavf: OK, you're speaking about year-ranges. I referred only to page ranges which were OK. Mariusm (talk) 10:07, 16 February 2017 (UTC)
@Mariusm: No, actually, look at the diff I linked above--there are many page ranges that did not get fixed. —Justin (koavf)TCM 17:37, 16 February 2017 (UTC)
@Dan Koehl: Yes there's a problem because apparently the ISSN is followed by a ":" rather than the usual space, but this isn't such a big problem since they aren't made here as links such as for example ISSN 1313-2989. Regarding the Category:Taxonomists pages, this is not what Franz meant. You changed the page hyphens which is OK, but Franz meant to change also the year ranges (birth-death). For example for Yasuhiko Asahina to change (1881-1975) to (1881–1975). Mariusm (talk) 05:31, 16 February 2017 (UTC)
OK, Ill try more, and better. Dan Koehl (talk) 08:11, 16 February 2017 (UTC)
@Dan Koehl: It's still a good start. And some of them are tricky because they are malformed (like p.12- 19). Did you see the script that I have on en.wp? Does that help any? —Justin (koavf)TCM 17:37, 16 February 2017 (UTC)
@Koavf:, Sorry, no, where can I find that script? I now modified the script, so on a second row I look for ISSN nrs with dash (find: ISSN: ([0-9]+)–([0-9,A-z]+) replace with ISSN: $1-$2), and convert them to hyphen, AFTER page nrs and such has been converted from hyphen to dash. A little bit stupid, but it may work, I tried so far on Masaru Baba and Dewanand Makhan (see this diff) where the page numbers gets converted to dash, while the ISSN remains with hyphen, and it looks OK now, I think. Ill make a couple of tests, less then 100, to try it out, but am interested to see your code. As for the files the code has missed, I havnt looked into that, can you explain once more, so I may try to cover that as well? Dan Koehl (talk) 18:02, 16 February 2017 (UTC)
@Dan Koehl:. I'm not sure if I understand your question but here is what I think you asked: you made ~100 edits automatically and asked us to review them. I saw some edits that were totally fine, some where an ndash was inserted that should not have been (false positives in ISSNs), and some where ndashes were not inserted that should have been (false negatives). Overall, better than before. It just can't be done automatically as it stands now but someone can do it semi-automatically if he keeps an eye on it (which is tiresome). —Justin (koavf)TCM 18:09, 16 February 2017 (UTC)
@Mariusm, Faendalimas, Franz Xaver, Koavf, and Andyboorman::
  1. I think I have it now, the year ranges (birth-death) now gets dash instead of hyphen. For example for Yasuhiko Asahina to change (1881-1975) to (1881–1975)if you look on Lujo Adamović, Adam Afzelius, Template:Popovici & Buhl, 2010, etc more files, as far as I see now the files gets corrected, now I will try to find out why AWB does NOT correct some files, like @Koavf: remarked. I think slowly it gets in a direction, where it may be automatically done by Bot, but still now needs inspection for ~100 edits or so? Dan Koehl (talk) 18:46, 16 February 2017 (UTC)
  2. When going through all taxonomists later, it would be easy to add those taxonomist article that doesn't have {{DEFAULTSORT|XXX, XXX}} to a category (category:taxonomist without defaultsort or something), so they can be taken care of later, IF you want?.Dan Koehl (talk) 18:52, 16 February 2017 (UTC)
  3. By the way, what do you prefer, ISSN(space)XXXX-XXXX, or ISSN: XXXX-XXXX, or any other (3rd) alternative? Dan Koehl (talk) 19:13, 16 February 2017 (UTC)

─────────────────────────ISSN /ISBN should be written ISSN: XXXX-XXXX it is customary and many people copy and past these into a library engine to find local copies of the artical. As such to make it easy we should have them as plain text no special characters and separated by spaces either side, noting that ISBN: for example is an identifier not part of the code. Cheers Scott Thomson (Faendalimas) talk 19:25, 16 February 2017 (UTC)

I cant see that the configuration does any harmful errors, seems to work OK, so Im trying with 100 test edits, which need to be evaluated before I let a Bot take care of the apr 55 000 files in category Taxon authorities (with subcategories).

Question: What is recomended for CEP; CEP XXXX with hyphen XXXX or CEP XXXX dash XXXX?

@Mariusm, Faendalimas, Franz Xaver, Koavf, and Andyboorman:: All the edits looks fine, I would like to transfer those edits to a Bot, any objections? Dan Koehl (talk) 11:34, 17 February 2017 (UTC)
@Dan Koehl: It looks fine to me. Question: Why do you concentrate only on author pages and do not run the bot on the entire WS? In a usual species page there are references too with page ranges. Your question on CEP - it is the Postal Addressing Code of Brazil, so better leave it with a hyphen. Mariusm (talk) 14:29, 17 February 2017 (UTC)
@Mariusm:, since this was just the test period for this configuration, I preferred to edit a limited number of pages, in one category. To be honest, I can not guarantee how the script will behave in new situations, so when trying other categories, a little bit of inspection will be needed, until entire WS can be included in a bot scanning. As for now, after you confirmed that it looks OK, I now let KoehlBot take care of the rest 45 000 pages, belonging to the present edit section. It may take 1-2 days.
The configuration will now change dash to hyphen in all CEPs. Dan Koehl (talk) 14:38, 17 February 2017 (UTC)
@Dan Koehl: Nice. I see ‎KoehlBot is churning out pages nicely. Mariusm (talk) 16:32, 17 February 2017 (UTC)
Yes, lets hope so, I believe its doing just OK, but will take some time with +45 000 pages. With that amount, Im sure we will hear from someone if something goes wrong, since when a user have any of the 45k pages on their watch, they will probably take a look? So when the 45 k test is passed, I think I can release the beast KoehlBot on all Files... Im Happy you are happy again! :) Dan Koehl (talk) 16:41, 17 February 2017 (UTC)
@Mariusm: I think it should be run across the main content namespaces such as Main:, Template:, and Category: plus documentation at Wikispecies: and Help: but not User: or the Wikispecies: pages which are discussions rather than policy. A policy page should be well-formatted and clear but the Pump is just talk and doesn't require perfect formatting (e.g. this page itself). —Justin (koavf)TCM 03:22, 18 February 2017 (UTC)
In response to the question the Código de Endereçamento Postal (CEP) in Brasil should be written like mine eg CEP 04263-000 it should consist of 8 numbers with the last 3 separated by a hyphen. The position of the numbers actually have a meaning and identify different levels of subzones. In an address you do not have to put CEP, eg my address at the museum is: Museu de Zoologia da Universidade de São Paulo, Avenida Nazaré, 481, Ipiranga, 04263-000, São Paulo, SP, Brasil. Cheers, Scott Thomson (Faendalimas) talk 03:38, 18 February 2017 (UTC)

For that matter, it would probably be wise for someone (me?) to make a list of Category:ISSNs using ndashes and redirect them to the proper hyphens--it would only help. —Justin (koavf)TCM 17:58, 18 February 2017 (UTC)

All files[edit]

@Mariusm, Faendalimas, Franz Xaver, Koavf, and Andyboorman:: After the ca 45 000 files were analyzed, and over 15 000 corrected, I now started manually to go through files with filter/search criteria ALL files, but while AWB only lists 25 000 each time, that is the selection to start with. I will once again kindly ask you to select a couple of the changes and inspect, and if you think it looks OK; I can let the Bot go through them later. I will start with the 100 files now. Dan Koehl (talk) 17:08, 18 February 2017 (UTC)

@Dan Koehl: Have you a quick link? Sorry to be a pain. Andyboorman (talk) 17:15, 18 February 2017 (UTC)
Yes, @Andyboorman:, the latest edits I did are here. I will now go through 100 of the latest edited files, and those edits will turn up on the same link. Dan Koehl (talk) 17:50, 18 February 2017 (UTC)
I have had a look through a sample of ISSN, Author and Ref Templates. All seems OK except for non-consensual ref formats, of course! Great improvements, well done. Andyboorman (talk) 19:38, 18 February 2017 (UTC)
Agree. Im doing a last test with new pages, and check what comes out there, and if I dont see any obvious errors, Ill let User:KoehlBot go through all pages. I have no idea how long time it will take... Dan Koehl (talk) 20:07, 18 February 2017 (UTC)
The test pages on my new Brassicaceae pages have worked fine. Hours or days? Go for it. Andyboorman (talk) 20:56, 18 February 2017 (UTC)
Days I believe. OK, I will let the Bot take over now. It will go through 25 000 pages on each run. Dan Koehl (talk) 21:05, 18 February 2017 (UTC)
Bot restarted, it had stopped after 5000 edits. Dan Koehl (talk) 10:58, 19 February 2017 (UTC)