All build-9.6 and build-master jobs failing due to bindist URL issues

mentioned in merge request !293 (merged)

mentioned in merge request ghc!10123 (merged)

mentioned in merge request ghc!9912 (closed)

I discovered that the master job is a red herring. The "error" is just a friendly notice that the first pipeline doesn't have the desired job. It successfully finds the job in the next pipeline it checks. Then, the runner ran out of disk space and the job died. Apparently I do not yet catch that particular spurious failure.

I suggest

diff --git a/ci/find-job.sh b/ci/find-job.sh
index 1178af6..db1b2e8 100755
--- a/ci/find-job.sh
+++ b/ci/find-job.sh
@@ -19,11 +19,11 @@ curl \
 
 job_id=$(jq ". | map(select(.name == \"$job_name\")) | .[0].id" < "$resp")
 if [ "$job_id" = "null" ]; then
-  echo "Error finding job $job_name for $pipeline_id in project $project_id:" >&2
-  cat "$resp" >&2
+  echo "Unable to find job $job_name in pipeline $pipeline_id in project $project_id" >&2
   rm "$resp"
   exit 1
 else
   rm "$resp"
+  >&2 echo "Found job $job_name ($job_id) in pipeline $pipeline_id"
   echo -n "$job_id"
 fi
diff --git a/ci/find-latest-job.sh b/ci/find-latest-job.sh
index cbf4db6..a20787a 100755
--- a/ci/find-latest-job.sh
+++ b/ci/find-latest-job.sh
@@ -29,6 +29,6 @@ else
       exit 0
     fi
   done
-  echo "Error finding job $job_name for $branch_name project $project_id:" >&2
+  echo "Error finding latest job $job_name for $branch_name project $project_id:" >&2
   exit 1
 fi

mentioned in commit ac62fb8c

mentioned in merge request !294 (closed)

IIRC, the last time the same problem happened with a build-$release job, the solution was to twiddle the job name, wait for the problem to happen again, then twiddle it back. A release cycle ritual.

I think Matt's suggestion was to just check two names, x86_64-linux-fedora33-release and x86_64-linux-fedora33.

My preferred solution would be to make the same job have the same name regardless of which pipeline it's in, so maybe I'll open a ticket.

Nope, that is also a red herring. It looks like what happened in this case is the 9.6 release tried to use a job whose artifacts had expired.

@bgamari said he's gonna look at it.

Now we actually know what happened.

GitLab keeps the artifacts for the last pipeline on a ref indefinitely.
The last pipeline on release-9.6 was a Release pipeline, which doesn't have the right job.
The second-to-last pipeline is a Validate pipeline, which does have the right job.
Being the second-to-last pipeline, the artifacts had expired after some number of weeks.

I see Ben has answered at the same time as me, but I'll still post this since I expand on why it matters that the most recent pipeline isn't a Validate pipeline.

The ghc-9.6 failures are due to the fact that the most recent pipeline is a release pipeline, which does not have the needed validate artifact. I have started a new pipeline, https://gitlab.haskell.org/ghc/ghc/-/pipelines/65578, which should resolve this.

@mpickering suggests that the we may want to introduce a weekly validate job on the each of the stable branches to ensure that this doesn't happen again in the future (or at least place an upper bound on the amount of time for which the state will persist).

JFTR: https://gitlab.haskell.org/ghc/ghc/-/pipelines/65578 has passed, which finally allowed the CI in !293 (merged) to pass. Should I leave this issue open to track the remaining task of introducing a weekly validate job?

I guess this message slipped through the cracks. I am curious to know if this has been completely fixed, or if the 9.8 branch will somehow cause this problem to resurface.

Hard to say. I would be fine with closing this issue under the optimistic assumption that it has been fixed, allowing for the possibility of reopening it should the issue resurface.

Let's set a due date a month from now and close if nothing happens.

Closing as planned.

Removing the Ivery high label in light of CI now passing once again.

removed Ivery high label

changed due date to July 28, 2023

closed

All build-9.6 and build-master jobs failing due to bindist URL issues

Designs

Child items ...

Activity