
Data Science/AI Intern - Literature Mining & Graph Modeling
AstraZeneca,
Duration
- 10-week internship (June 01, 2026- August 07, 2026).
The Position
AstraZeneca is seeking Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline for a 10-week internship role at our site in Waltham, MA from June 01, 2026- August 07, 2026. This internship sits at the intersection of data engineering, biomedical NLP, and translational science, enabling faster insight generation for R&D teams.
Responsibilities
- Build an end-to-end pipeline turning literature (papers, abstracts, patents) into a standardized knowledge graph with contextualized evidence.
- Handle source selection, inclusion/exclusion criteria, updates, and data snapshots.
- Develop NLP for entity recognition, relation extraction, assertion detection, and context tagging (drug, indication, resistance, biomarker, outcome).
- Encode domain relations (e.g., Drug–mechanism→Gene/Pathway; Biomarker–modulates→Outcome; ADC–targets→Antigen).
- Map entities to controlled vocabularies; manage synonyms, disambiguation, and canonical IDs.
- Implement edge-level confidence scoring (source quality, claim type, co-occurrence, citations, model certainty) with full evidence provenance.
- Build graph storage (property graph or RDF) and queryable APIs.
- Deliver interactive visualization (UI or notebook) with filters, context toggles, and evidence drill-down.
- Define metrics, run error analyses, and validate with scientific stakeholders.
- Ensure reproducibility and documentation: version models/data; record architecture, assumptions, benchmarks; provide user guides.
- Present outcomes to data science, oncology, and translational medicine teams.
Requirements
- Master’s and PhD students studying Biology, Computer Science, Chemistry, Physics, Engineering, Biomedical Science, Pharmacology, Data Science, Bioinformatics, or a related discipline.
- Candidates must have an expected graduation date after August 2026.
- US Work Authorization is required at time of application.
- This role will not be providing OPT support.
- NLP and ML: NER, relation extraction, transformers; Python-based workflows.
- Graph/data modeling: experience with Neo4j, NetworkX, or RDF/SPARQL.
- Domain knowledge: genes, pathways, biomarkers, therapeutic modalities (incl. ADCs) preferred.
- Reproducibility: version control, environment management, documentation.
- Soft skills: problem-solving, communication, collaboration.
- Tech stack: Python (spaCy, Hugging Face), scikit-learn; PyTorch or TensorFlow.
- Data & viz: pandas; PySpark or Dask; Plotly/Dash, D3.js, Neo4j Bloom.
- Dev practices: Git, Conda/Poetry, Docker, experiment tracking.
- Ability to report onsite to Waltham, MA site 3-5 days per week.
Company
AstraZeneca is a global, science-led, patient-focused biopharmaceutical company. We focus on discovering, developing and commercialising prescription medicines for some of the world’s most serious diseases. But we are more than one of the world’s leading pharmaceutical companies.
