Research Article
Volume 7 Issue 2 - 2025
Quantifying Structural Similarity and Informational Diversity in Informed Consent Forms: A Statistical Analysis of Terminated versus Completed Clinical Trials
Institution: Bayezian Limited
*Corresponding Author: Author (Escientific Publishers)*, Institution: Bayezian Limited.
Received: May 08, 2025; Published: May 19, 2025
Abstract
Informed consent forms (ICFs) are critical to ethical clinical research, safeguarding participant autonomy and trust. Nevertheless, concerns remain that excessive standardisation and poor communication within ICFs may impair participant understanding and contribute to trial failure. This study analysed 85 interventional clinical trials registered on ClinicalTrials.gov, comprising 34 terminated and 51 completed trials, to explore the relationship between ICF characteristics and trial outcomes. Publicly available ICFs were processed using natural language processing techniques, including Sentence-BERT embeddings to quantify linguistic similarity, alongside manual evaluation of ethical clause richness and structural consistency measured via edit distance. Statistical analysis employed Mann-Whitney U tests and logistic regression. The results revealed that trials which were terminated exhibited significantly higher internal linguistic similarity and lower ethical clause richness compared to those that completed successfully, whereas structural formatting differences were not predictive of trial status. Logistic regression confirmed that elevated textual uniformity and diminished ethical diversity within consent documents were associated with greater odds of trial termination. These findings suggest that over-standardisation and informational sparsity in ICFs may negatively affect participant engagement and trial viability. Ethics committees and trial sponsors are urged to foster participant-centred, context-specific consent documentation to support both ethical obligations and operational success.
Keywords: Informed Consent Forms (ICFs); Clinical Trial Termination; Natural Language Processing (NLP); Document Similarity Metrics; Logistic Regression Analysis; Inferential Statistics
Introduction
Informed consent forms (ICFs) play a crucial ethical role in safeguarding participant autonomy and ensuring voluntary enrolment in clinical trials (Ssali et al., 2017). Despite the centrality of informed consent, persistent issues undermine its effectiveness, particularly in the context of complex biomedical research where linguistic complexity, over-standardisation, and poor communication are pervasive (Grant, 2021; Kadam, 2017). Multiple studies have noted that ICFs often exceed recommended readability levels, employ technical jargon, and prioritise legal protection over participant comprehension, ultimately challenging the core ethical mandate of informed consent (Licari and Manti, 2018; Rebers et al., 2016).
In developing and developed research environments alike, participants may formally consent without fully understanding the study procedures, risks, and benefits, thereby compromising the integrity of the process (Ssali et al., 2017; Bhupathi and Ravi, 2017). Notably, the focus on procedural adherence rather than communicative clarity has resulted in consent documents that may be signed without genuine informed engagement. Furthermore, while ethical guidelines such as the Declaration of Helsinki and the Belmont Report emphasise participant understanding as a cornerstone of research ethics, practical application remains inconsistent and often unverified (Rebers et al., 2016).
While existing literature has addressed the qualitative shortcomings of ICFs, there remains a critical gap: no comprehensive quantitative analysis has linked the structural and linguistic characteristics of ICFs to tangible clinical trial outcomes, such as trial termination. Previous efforts have described challenges related to participant literacy, cultural differences, and ethical oversight (Ssali et al., 2017; Bhupathi and Ravi, 2017; Fons-Martínez et al., 2022), but none has systematically assessed whether language structure in consent forms predicts the operational fate of clinical trials.
This study addresses this gap by applying advanced natural language processing (NLP) techniques to a curated set of ICFs from clinical trials that have either been completed or terminated. The research objectives are threefold: first, to assess textual similarity across trial outcomes using semantic embedding measures; second, to quantify ethical clause richness as an indicator of depth and breadth of ethical disclosures; and third, to test the predictive association between linguistic features of ICFs and trial termination through logistic regression modelling.
Our investigation is structured around three primary hypotheses:
- H1: Completed trials will exhibit higher semantic consistency across their ICFs, reflecting more uniform and comprehensible documentation practices.
- H2: Completed trials will demonstrate greater ethical clause consistency, suggesting better articulation of participants’ rights and obligations.
- H3: Structural variance, measured via edit distance, will be significantly associated with trial outcome, independently of document length and readability metrics.
By systematically interrogating the linguistic fabric of consent documentation, this study advances the ethical discourse on informed consent from descriptive critique to quantitative evaluation, providing actionable insights for improving both ethical standards and operational trial outcomes.
Methods
Study Design and Data Source
A retrospective cross-sectional study design was employed. Data were extracted from ClinicalTrials.gov, a publicly accessible registry maintained by the United States National Library of Medicine. Eligible trials were identified based on predefined criteria, and full-text informed consent forms (ICFs) were sourced either directly from trial records or through linked external repositories. All documents were screened to ensure completeness and relevance to study objectives.
A retrospective cross-sectional study design was employed. Data were extracted from ClinicalTrials.gov, a publicly accessible registry maintained by the United States National Library of Medicine. Eligible trials were identified based on predefined criteria, and full-text informed consent forms (ICFs) were sourced either directly from trial records or through linked external repositories. All documents were screened to ensure completeness and relevance to study objectives.
Trial Selection Criteria
Trials were included if they met the following criteria: (i) industry sponsorship, (ii) Phase II, III, or IV designation, (iii) recruitment status listed as "Terminated" or "Completed", and (iv) availability of a full-text ICF. Phase I studies, withdrawn or suspended trials, and trials lacking a complete ICF were excluded. The focus on later-phase, industry-sponsored trials ensured uniform regulatory context and maximised the relevance of informed consent content to mature clinical research practices.
Trials were included if they met the following criteria: (i) industry sponsorship, (ii) Phase II, III, or IV designation, (iii) recruitment status listed as "Terminated" or "Completed", and (iv) availability of a full-text ICF. Phase I studies, withdrawn or suspended trials, and trials lacking a complete ICF were excluded. The focus on later-phase, industry-sponsored trials ensured uniform regulatory context and maximised the relevance of informed consent content to mature clinical research practices.
Text Processing
Informed consent forms (ICFs) were extracted from PDF documents, followed by cleaning and text normalisation. Sentences were embedded using the Sentence-BERT model (all-MiniLM-L6-v2). Textual similarity between documents was computed using three metrics: cosine similarity (semantic closeness), Jaccard similarity (binary overlap of clause presence), and edit distance (token-level structural variation).
Informed consent forms (ICFs) were extracted from PDF documents, followed by cleaning and text normalisation. Sentences were embedded using the Sentence-BERT model (all-MiniLM-L6-v2). Textual similarity between documents was computed using three metrics: cosine similarity (semantic closeness), Jaccard similarity (binary overlap of clause presence), and edit distance (token-level structural variation).
Ethical Clause Scoring
Ethical clause presence was assessed using predefined regular expression patterns (Appendix B), covering domains such as purpose, risks, confidentiality, and voluntary participation. Each clause was checked for presence either manually or using automated pattern matching to ensure detection of key ethical disclosures.
Ethical clause presence was assessed using predefined regular expression patterns (Appendix B), covering domains such as purpose, risks, confidentiality, and voluntary participation. Each clause was checked for presence either manually or using automated pattern matching to ensure detection of key ethical disclosures.
Statistical Analysis
Descriptive statistics were calculated for word counts, similarity scores, and ethical clause counts. Mann-Whitney U tests compared terminated and completed trials across linguistic features. Logistic regression models were constructed to predict trial termination, with model evaluation based on Akaike Information Criterion (AIC), McFadden’s R², and the Hosmer-Lemeshow goodness-of-fit test.
Descriptive statistics were calculated for word counts, similarity scores, and ethical clause counts. Mann-Whitney U tests compared terminated and completed trials across linguistic features. Logistic regression models were constructed to predict trial termination, with model evaluation based on Akaike Information Criterion (AIC), McFadden’s R², and the Hosmer-Lemeshow goodness-of-fit test.
Software and Tools
Data extraction, text processing, and document similarity analyses were performed using Python (version 3.12.1). Statistical analyses, including Mann-Whitney U tests and logistic regression modelling, were conducted using R (version 4.3.2). Full scripts, software environment specifications, and reproducibility materials are available at the GitHub repository here.
Data extraction, text processing, and document similarity analyses were performed using Python (version 3.12.1). Statistical analyses, including Mann-Whitney U tests and logistic regression modelling, were conducted using R (version 4.3.2). Full scripts, software environment specifications, and reproducibility materials are available at the GitHub repository here.
Results
Sample Characteristics
A total of 85 clinical trials were included in the final analysis, comprising 34 terminated trials and 51 completed trials. Trials were drawn from four major therapeutic areas: oncology, infectious diseases, cardiology, and other therapeutics. Oncology was the most represented area, comprising approximately 51% of the analysed sample. Trials were excluded if the informed consent forms (ICFs) were defective or scanned images unsuitable for text extraction. Table 1 summarises the distribution of the analysed trials by status and therapeutic area.
A total of 85 clinical trials were included in the final analysis, comprising 34 terminated trials and 51 completed trials. Trials were drawn from four major therapeutic areas: oncology, infectious diseases, cardiology, and other therapeutics. Oncology was the most represented area, comprising approximately 51% of the analysed sample. Trials were excluded if the informed consent forms (ICFs) were defective or scanned images unsuitable for text extraction. Table 1 summarises the distribution of the analysed trials by status and therapeutic area.
Trial Status | Oncology | Infectious | Cardiology | Other Therapeutics | Total |
Terminated | 16 | 3 | 2 | 13 | 34 |
Completed | 27 | 9 | 8 | 7 | 51 |
Table 1: Distribution of Clinical Trials by Trial Status and Therapeutic Area.
Linguistic Characteristics and Similarity Analysis
The linguistic characteristics of informed consent forms (ICFs) were compared between terminated and completed trials using mean cosine similarity scores, mean ethical clause counts, and mean readability scores.
The linguistic characteristics of informed consent forms (ICFs) were compared between terminated and completed trials using mean cosine similarity scores, mean ethical clause counts, and mean readability scores.
Semantic Similarity (Cosine Similarity)
The mean cosine similarity for completed trials was higher (Mean = 0.607) than for terminated trials (Mean = 0.539), suggesting greater internal linguistic coherence among ICFs from successfully completed studies. Boxplots illustrated an upward shift in similarity distribution for completed trials relative to terminated trials. A Mann-Whitney U test confirmed that the difference was statistically significant (p = 0.0000173).
The mean cosine similarity for completed trials was higher (Mean = 0.607) than for terminated trials (Mean = 0.539), suggesting greater internal linguistic coherence among ICFs from successfully completed studies. Boxplots illustrated an upward shift in similarity distribution for completed trials relative to terminated trials. A Mann-Whitney U test confirmed that the difference was statistically significant (p = 0.0000173).
Ethical Clause Coverage
Completed trials exhibited a higher mean number of ethical clauses identified within their ICFs compared to terminated trials. Visualisations indicated greater clause inclusion frequency among completed trials, supporting the hypothesis that richer ethical disclosures may enhance trial success. This difference was statistically significant (p = 0.0000000598).
Completed trials exhibited a higher mean number of ethical clauses identified within their ICFs compared to terminated trials. Visualisations indicated greater clause inclusion frequency among completed trials, supporting the hypothesis that richer ethical disclosures may enhance trial success. This difference was statistically significant (p = 0.0000000598).
Readability Analysis
Preliminary readability analysis based on Flesch-Kincaid scores revealed no substantial difference between groups. Both terminated and completed trials exhibited readability levels above the recommended eighth-grade threshold, suggesting a general need for simplified language across all ICFs.
Preliminary readability analysis based on Flesch-Kincaid scores revealed no substantial difference between groups. Both terminated and completed trials exhibited readability levels above the recommended eighth-grade threshold, suggesting a general need for simplified language across all ICFs.

Figure 1: Group comparisons of linguistic characteristics in informed consent forms (ICFs) by trial outcome.
Hypothesis | Variable | p-value | Result | Interpretation |
H1 | group_mean_cosine | 0.0000173 | Significant (****) | Completed trials have significantly higher semantic similarity |
H2 | group_mean_jaccard | 0.0000000598 | Significant (****) | Completed trials have significantly higher ethical clause overlap |
H3 | group_mean_edit_dist | 0.165 | Not significant (ns) | No significant difference in structural edit distance |
Table 2: Statistical Test Results for Group Differences in Linguistic Characteristics.
Note: Statistical significance thresholds were defined as p < 0.05 (), p < 0.01 (), p < 0.001 (), and p < 0.0001 (**).
Regression Results
A logistic regression analysis was conducted to examine the association between document linguistic features and clinical trial outcomes (completed versus terminated), specifically to test Hypotheses 1–3. The predictors included group mean cosine similarity (H1: linguistic consistency), group mean Jaccard similarity (H2: ethical clause consistency), group mean edit distance (H3: structural consistency), and clause count (document richness).
A logistic regression analysis was conducted to examine the association between document linguistic features and clinical trial outcomes (completed versus terminated), specifically to test Hypotheses 1–3. The predictors included group mean cosine similarity (H1: linguistic consistency), group mean Jaccard similarity (H2: ethical clause consistency), group mean edit distance (H3: structural consistency), and clause count (document richness).
The model demonstrated excellent overall fit, with a McFadden’s pseudo-R² of 0.484, indicating that approximately 48.4% of the variance in trial outcome was explained by the predictors. The Akaike Information Criterion (AIC) was 68.55, suggesting strong model parsimony. The Hosmer-Lemeshow goodness-of-fit test was non-significant (χ²(8) = 5.16, p = 0.741), indicating no evidence of model miscalibration.
As shown in Table 3, group mean cosine similarity was a significant positive predictor of trial completion, supporting Hypothesis 1 (H1). Trials with higher internal linguistic coherence were substantially more likely to be completed. Similarly, group mean Jaccard similarity emerged as a significant positive predictor (H2), indicating that greater consistency in ethical clause coverage increased the likelihood of trial success.
In contrast, group mean edit distance did not significantly predict trial outcome (p = 0.113), providing no support for Hypothesis 3 (H3). Finally, clause count was significantly negatively associated with trial completion. Higher document complexity, reflected in more clauses, was linked to a reduced probability of successful trial completion.
Predictor | Estimate | Standard Error (SE) | z-value | p-value | Odds Ratio | 95% CI Lower | 95% CI Upper |
Group Mean Cosine Similarity (H1) | 25.30 | 8.15 | 3.10 | 0.0019 | 9.67 × 1010 | 6.21 × 104 | 9.67 × 1018 |
Group Mean Jaccard Similarity (H2) | 34.68 | 8.40 | 4.13 | <0.0001 | 1.16 × 1015 | 7.37 × 108 | 2.61 × 1023 |
Group Mean Edit Distance (H3) | 51.06 | 32.24 | 1.58 | 0.113 | 1.50 × 1022 | 6.34 × 10-7 | 2.47 × 1050 |
Clause Count | -1.75 | 0.45 | -3.85 | <0.0001 | 0.17 | 0.06 | 11 |
Table 3: Logistic Regression Results Predicting Trial Completion from Linguistic Characteristics.
Predicted Probability of Trial Completion Across Key Document Similarity Metrics
Logistic regression model predictions were plotted to examine the relationship between key document similarity measures and the probability of clinical trial completion, corresponding to Hypotheses 1–3.
Logistic regression model predictions were plotted to examine the relationship between key document similarity measures and the probability of clinical trial completion, corresponding to Hypotheses 1–3.
H1: Group Mean Cosine Similarity (Linguistic Consistency)
The predicted probability of trial completion increased sharply with rising group mean cosine similarity, following a sigmoidal logistic pattern with an inflection around 0.55–0.60.
The predicted probability of trial completion increased sharply with rising group mean cosine similarity, following a sigmoidal logistic pattern with an inflection around 0.55–0.60.
Higher internal linguistic coherence across informed consent forms (ICFs) was associated with substantially greater trial completion probabilities, supporting Hypothesis 1.
H2: Group Mean Jaccard Similarity (Ethical Clause Consistency)
Similarly, the probability of trial completion rose markedly with increasing group mean Jaccard similarity, with the steepest increase observed between values of 0.70 and 0.80.
Similarly, the probability of trial completion rose markedly with increasing group mean Jaccard similarity, with the steepest increase observed between values of 0.70 and 0.80.
This pattern visually supports Hypothesis 2, indicating that greater ethical clause consistency is associated with improved trial success.
H3: Group Mean Edit Distance (Structural Consistency)
In contrast, group mean edit distance demonstrated only a modest, near-linear relationship with trial outcome, with minimal impact on completion probability.
In contrast, group mean edit distance demonstrated only a modest, near-linear relationship with trial outcome, with minimal impact on completion probability.
This finding corroborates the earlier statistical result, providing no strong support for Hypothesis 3.
Odds Ratios and Confidence Intervals for Predicting Trial Completion
Odds ratios (ORs) and corresponding 95% confidence intervals (CIs) were computed from the logistic regression model to assess the strength and direction of association between document similarity metrics, clause count, and trial outcome (completed versus terminated).
Odds ratios (ORs) and corresponding 95% confidence intervals (CIs) were computed from the logistic regression model to assess the strength and direction of association between document similarity metrics, clause count, and trial outcome (completed versus terminated).
Predictor | Odds Ratio | 95% CI Lower | 95% CI Upper |
(Intercept) | 1.07 × 10-31 | 1.25 × 10-62 | 3.34 × 10-2 |
Group Mean Cosine Similarity (H1) | 9.67 × 1010 | 6.21 × 104 | 9.67 × 1018 |
Group Mean Jaccard Similarity (H2) | 1.16 × 1015 | 7.37 × 108 | 2.61 × 1023 |
Group Mean Edit Distance (H3) | 1.50 × 1022 | 6.34 × 10-7 | 2.47 × 1050 |
Clause Count | 0.17 | 0.06 | 0.38 |
Table 4: Odds Ratios and 95% Confidence Intervals for Predictors of Clinical Trial Completion.
Group mean cosine similarity demonstrated an extremely large odds ratio (OR ≈ 9.67 × 10¹?), with a confidence interval entirely above 1, confirming a strong and statistically significant association with trial completion (H1). Similarly, group mean Jaccard similarity showed a very large odds ratio (OR ≈ 1.16 × 10¹?), also supporting a significant positive effect (H2). In contrast, group mean edit distance, despite yielding a large odds ratio (OR ≈ 1.50 × 10²²), exhibited a wide and unstable confidence interval crossing 1, indicating no statistically reliable association with trial outcome (H3).
Discussion
Principal Findings
This study demonstrated that internal linguistic consistency and ethical clause richness within informed consent forms (ICFs) are significantly associated with clinical trial success. Trials with higher semantic similarity and greater ethical disclosure were more likely to reach completion, while structural formatting consistency alone showed no predictive value. Additionally, greater document complexity, reflected in increased clause counts, was associated with reduced likelihood of trial completion.
This study demonstrated that internal linguistic consistency and ethical clause richness within informed consent forms (ICFs) are significantly associated with clinical trial success. Trials with higher semantic similarity and greater ethical disclosure were more likely to reach completion, while structural formatting consistency alone showed no predictive value. Additionally, greater document complexity, reflected in increased clause counts, was associated with reduced likelihood of trial completion.
Interpretation
These findings suggest that excessive standardisation and structural uniformity may undermine the communicative purpose of informed consent, potentially disengaging participants. High clause counts and rigid formatting do not guarantee better understanding and may instead contribute to cognitive overload. By contrast, meaningful ethical communication, emphasising clarity, semantic coherence, and accessible ethical disclosures appears critical for participant engagement and trial retention. Optimising ICFs for linguistic consistency and ethical clarity, rather than mere procedural standardisation, could thus enhance both ethical standards and operational outcomes in clinical research.
These findings suggest that excessive standardisation and structural uniformity may undermine the communicative purpose of informed consent, potentially disengaging participants. High clause counts and rigid formatting do not guarantee better understanding and may instead contribute to cognitive overload. By contrast, meaningful ethical communication, emphasising clarity, semantic coherence, and accessible ethical disclosures appears critical for participant engagement and trial retention. Optimising ICFs for linguistic consistency and ethical clarity, rather than mere procedural standardisation, could thus enhance both ethical standards and operational outcomes in clinical research.
Comparison with Prior Work
Prior research has extensively examined issues surrounding informed consent, including readability, ethical transparency, and participant engagement. Table 5 compares major previous findings with the present study, highlighting key differences and methodological advancements.
Prior research has extensively examined issues surrounding informed consent, including readability, ethical transparency, and participant engagement. Table 5 compares major previous findings with the present study, highlighting key differences and methodological advancements.
Study | Focus | Key Findings | Limitations Identified | How Our Study Advances |
Rebers et al., 2016 | Exceptions and ethics in consent | Under-reporting of informed consent exceptions | No text structure analysis | Linked linguistic consistency to trial success |
Grant, 2021 | Readability burden in ICFs | Forms too complex for average patients | No outcome modelling | Quantitative similarity tied to trial completion |
Licari and Manti, 2018 | Consent for vulnerable groups | Need for simplified language and respect for autonomy | No predictive linkage to trial success | Measured ethical clause richness statistically |
Trung et al., 2021 | Underreporting of ethics and incentives | Major gaps in ethics reporting in trials | No consent text feature analysis | Applied structural and semantic measures to success prediction |
Kadam, 2017 | Informed consent comprehension issues | Highlighted barriers in low-literacy populations | No structural document analysis | Introduced edit distance and clause presence metrics |
Schwarz, 2021 | Patient-centric consent models | Emphasised patient understanding and engagement | Lacked systematic NLP evaluation | Used Sentence-BERT to model document-level coherence |
Lee et al., 2021 | Public perspectives on trial consent | Importance of trust, clarity, transparency | No formal language structure evaluation | Connected text metrics to operational success |
Koonrungsesomboon et al., 2017 | Empowerment through better consent | Focus on autonomy and informed decision making | Descriptive focus, no predictive modelling | Quantified semantic consistency as empowerment proxy |
Fons-Martínez et al., 2022 | Participatory design in consent forms | User-tailored information delivery | Lacked measurable outcome validation | Linked participant comprehension improvements to trial endpoints |
Ssali et al., 2017 | Cultural factors in informed consent | Highlighted cultural misunderstandings affecting trials | No linguistic quantitative evaluation | Modelled document consistency across trial cultures |
Table 5: Comparison of Prior Studies and Current Study Focus.
Strengths and Limitations
Strengths
This study represents the first large-scale, systematic application of natural language processing (NLP) techniques to informed consent forms (ICFs), quantitatively linking linguistic characteristics to clinical trial outcomes. By combining advanced semantic embedding models (Sentence-BERT) with manual ethical clause detection, it offers a novel integration of machine learning and bioethical analysis, contributing uniquely to both clinical trial operations and research ethics literature. The use of a real-world, publicly available dataset enhances transparency and reproducibility. Furthermore, rigorous statistical methodologies, including Mann-Whitney U tests, logistic regression modelling, and model calibration checks substantially strengthen the internal validity of the findings.
Strengths
This study represents the first large-scale, systematic application of natural language processing (NLP) techniques to informed consent forms (ICFs), quantitatively linking linguistic characteristics to clinical trial outcomes. By combining advanced semantic embedding models (Sentence-BERT) with manual ethical clause detection, it offers a novel integration of machine learning and bioethical analysis, contributing uniquely to both clinical trial operations and research ethics literature. The use of a real-world, publicly available dataset enhances transparency and reproducibility. Furthermore, rigorous statistical methodologies, including Mann-Whitney U tests, logistic regression modelling, and model calibration checks substantially strengthen the internal validity of the findings.
Limitations
However, several limitations must be acknowledged. First, reliance on publicly available ICFs introduces potential selection bias, as not all trials disclose full consent documentation, and available forms may differ systematically from the broader trial population.
However, several limitations must be acknowledged. First, reliance on publicly available ICFs introduces potential selection bias, as not all trials disclose full consent documentation, and available forms may differ systematically from the broader trial population.
Second, the focus exclusively on English-language documents may limit the generalisability of findings to non-English-speaking contexts; cross-cultural variation in consent practices remains unexplored.
Third, site-specific modifications and verbal consent processes were not captured, which may affect linguistic consistency in practice.
Fourth, while statistically powered, the sample size remains modest relative to complex multivariate analysis, limiting the exploration of interaction effects across therapeutic areas or sponsor types.
Finally, the observational nature of the study precludes firm causal inferences; associations between ICF characteristics and trial outcomes must be interpreted cautiously.
Future Directions
Future research should expand datasets to include non-public and non-English ICFs, enabling broader cross-cultural validation. Prospective trials embedding semantic evaluation tools into the consent design process are recommended to enhance document quality systematically. Further, participant-centred metrics such as comprehension scores and satisfaction surveys should be incorporated to investigate mediating pathways between document characteristics and trial outcomes. Stratified analyses by therapeutic area are warranted, as differing disease contexts may moderate the relationship between consent quality and trial success. Lastly, examining the impact of culturally tailored ethical clauses on participant engagement and retention would provide valuable insights for global clinical research practices.
Future research should expand datasets to include non-public and non-English ICFs, enabling broader cross-cultural validation. Prospective trials embedding semantic evaluation tools into the consent design process are recommended to enhance document quality systematically. Further, participant-centred metrics such as comprehension scores and satisfaction surveys should be incorporated to investigate mediating pathways between document characteristics and trial outcomes. Stratified analyses by therapeutic area are warranted, as differing disease contexts may moderate the relationship between consent quality and trial success. Lastly, examining the impact of culturally tailored ethical clauses on participant engagement and retention would provide valuable insights for global clinical research practices.
Implications for Practice
The findings highlight the need for sponsors and investigators to prioritise semantic clarity and ethical transparency in ICF development, beyond mere compliance with regulatory templates. Ethics committees should evaluate not only content completeness but also linguistic coherence. Regulators could consider guidance revisions to emphasise participant-centred readability, recognising its potential operational impact on trial success.
The findings highlight the need for sponsors and investigators to prioritise semantic clarity and ethical transparency in ICF development, beyond mere compliance with regulatory templates. Ethics committees should evaluate not only content completeness but also linguistic coherence. Regulators could consider guidance revisions to emphasise participant-centred readability, recognising its potential operational impact on trial success.
Conclusion
- Higher internal linguistic similarity within informed consent forms (ICFs) is associated with increased trial completion rates, highlighting the operational importance of document coherence.
- Greater ethical clause richness significantly predicts trial success, underscoring the critical role of comprehensive participant information.
- Over-standardisation and excessive document complexity may undermine participant engagement, indicating the need for improved tailoring of ICFs to diverse populations.
- Ethics committees and regulators should adopt more nuanced review standards, focusing on linguistic clarity and ethical transparency rather than procedural compliance alone.
This study demonstrates that data-driven, language-based evaluation of trial documentation is both feasible and necessary to enhance clinical research quality and participant protection. Future improvements in informed consent practices will be critical for rebuilding participant trust, ensuring ethical engagement, and advancing the transparency and integrity of global clinical research.
Data and Code Availability
The datasets analysed during the current study and the full natural language processing (NLP) codebase, including text pre-processing scripts, Sentence-BERT model specifications, and similarity computation methods, are publicly available at here.
The datasets analysed during the current study and the full natural language processing (NLP) codebase, including text pre-processing scripts, Sentence-BERT model specifications, and similarity computation methods, are publicly available at here.
Access is unrestricted and provided to promote transparency, reproducibility, and further research development.
References
- Adrian A. M., Norwood S. H., Mask P. L. (2005). Producers’ perceptions and attitudes toward precision agriculture technologies. Computers and Electronics in Agriculture. 48(3): 256–271.
- Ssali, A., Poland, F., & Seeley, J. (2017). Exploring informed consent in HIV clinical trials: A case study in Uganda. Heliyon, 3(2): e00196.
- Rebers, S., Aaronson, N. K., van Leeuwen, F. E., & Schmidt, M. K. (2016). Exceptions to the rule of informed consent for research with an intervention. BMC Medical Ethics, 17(9).
- Association of Clinical Research Professionals (ACRP). (2013). The Process of Informed Consent. ACRP White Paper.
- Manti, S., & Licari, A. (2018). How to obtain informed consent for research. Breathe, 14(2): 145–152.
- Grant, S. C. (2021). Informed Consent—We Can and Should Do Better. JAMA Network Open, 4(4), e2110848.
- Fons-Martinez, J., Ferrer-Albero, C., & Diez-Domingo, J. (2022). Keys to improving the informed consent process in research: Highlights of the i-CONSENT project. Health Expectations, 25(3): 1183–1185.
- International Conference on Harmonisation (ICH). (n.d.). Informed Consent of Trial Subjects (ICH-GCP). Retrieved from https://ichgcp.net/publications
- Bhupathi, P. A., & Ravi, G. R. (2017). Comprehensive Format of Informed Consent in Research and Practice: A Tool to Uphold the Ethical and Moral Standards. International Journal of Clinical Pediatric Dentistry, 10(1): 73–81.
- Trung L. Q., Morra M. E., Truong N. D., Turk T., Elshafie A., Foly A., Tam D. N. H., Iraqi A., Van T. T. H., Elgebaly A., Ngoc T. N., Vu T. L. H., Chu N. T., Hirayama K., Karbwang J., Huy N. T. (2017). A systematic review finds underreporting of ethics approval, informed consent, and incentives in clinical trials. Journal of Clinical Epidemiology, 92, 1–7.
- Koonrungsesomboon N., Laothavorn J., Karbwang J. (2015). Understanding of essential elements required in informed consent form among researchers and institutional review board members. Tropical Medicine and Health, 43(2): 117–122.
- Coleman E., O’Sullivan L., Crowley R., Hanbidge M., Driver S., Kroll T., Kelly A., Nichol A., McCarthy O., Sukumar P., Doran P. (2021). Preparing accessible and understandable clinical research participant information leaflets and consent forms: a set of guidelines from an expert consensus conference. Research Involvement and Engagement. 7, 31.
- Health Research Authority (HRA). (2018). Applying a proportionate approach to the process of seeking consent: HRA guidance. Version 1.02.
- Schwarz G. (2025). Informed Consent Considerations. ACT-EU Workshop, European Medicines Agency.
- Geetter J. S., Siegfried S. (2023). FDA Issues Final Guidance on Informed Consent in Clinical Investigations. McDermott Will & Emery.
- Enpr-EMA Working Group on Ethics. (2021). Assent / Informed Consent Guidance for Paediatric Clinical Trials with Medicinal Products in Europe. European Medicines Agency (EMA). EMA/671028/2019.
- Kadam, R. A. (2017). Informed consent process: A step further towards making it meaningful. Perspectives in Clinical Research, 8(3): 107–112.
Appendix A: Full Trial Dataset
The full list of clinical trials included in the analysis, including NCT ID, Phase, status, and sponsor information, is available via the study's public GitHub repository: [here]
The full list of clinical trials included in the analysis, including NCT ID, Phase, status, and sponsor information, is available via the study's public GitHub repository: [here]
Appendix B: Ethical Clause Definitions and Detection Patterns
Ethical clauses were defined based on standard requirements for informed consent disclosures. Each clause was operationalised through targeted regular expressions (regex) to automate presence detection within the consent documents. Table B1 lists the clause categories and their corresponding search patterns.
Ethical clauses were defined based on standard requirements for informed consent disclosures. Each clause was operationalised through targeted regular expressions (regex) to automate presence detection within the consent documents. Table B1 lists the clause categories and their corresponding search patterns.
Clause | Regex Pattern (Key Terms) |
Purpose | `\b(purpose |
Voluntary Participation | `\b(voluntary |
Withdrawal Rights | `\b(withdraw |
Study Procedures | `\b(study procedures |
Risks | `\b(risk |
Benefits | `\b(benefit |
Alternatives | `\b(alternative |
Confidentiality | `\b(confidential |
Compensation | `\b(compensation |
Expenses/Payments | `\b(expense |
Contact Information | `\b(contact |
Consent Statement | `\b(understand |
Data Sharing | `\b(data sharing |
Future Use | `\b(store |
Ethics Approval | `\b(irb |
Each ICF was scanned for the presence of these clauses using case-insensitive regex matching. Detection was considered binary (presence/absence) for each clause per document.
Citation: Francis Osei. (2025). Quantifying Structural Similarity and Informational Diversity in Informed Consent Forms: A Statistical Analysis of Terminated versus Completed Clinical Trials. Journal of Pharmacy and Drug Development 7(2).
Copyright: © 2025 Francis Osei. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.