Summary of medical transcripts using Langchian and GPT-4 | by Sikorana | Jan, 2024

Photo by Scott Graham on Unsplash


Imagine managing a medical facility where a large part of your 15-minute patient visits is consumed by writing summaries of their ailments. This routine task consumes a third of your consultation time that could have been better used discussing more in-depth about the patient’s health. Is there a more efficient solution? Perhaps summarizing the transcripts of those conversations might come in handy.

This is where Langchain’s robust summarization abilities come into play.

In this tutorial, we will guide you through the process of utilizing the powerful Langchain and GPT-4 model (or any other OpenAI model) to simplify the task of summarizing medical transcripts. We will tap into a dataset containing sample medical transcriptions, sourced from Kaggle (Medical Transcriptions), to provide you an insight into crafting summaries quickly and accurately. This guide could be a potential game-changer for any professionals dealing with transcriptions, not just the medical.

Let’s start!

Before we dive into the main content, let’s prepare our workspace. Here’s what you’ll need to get started:

  1. Python 3.6 or higher
  2. Install the following libraries:
pip install langchain openai python-dotenv

3. Create OpenAI API Key. If you don’t have an existing OpenAI account, feel free to create a new one. Once you are logged in, you can generate your unique API key through the following link: API keys — OpenAI API.

Then, inside of your project directory create an .env file and pass your API Key:


Now we are ready to write our script!

Firstly we need to download the dataset from Kaggle (Medical Transcriptions) and save it in our project directory.

Next, proceed with installing neccessary libraries and reading our dataset into Pandas DataFrame:

from langchain.text_splitter import CharacterTextSplitter
from langchain.schema.document import Document
from langchain.prompts import PromptTemplate
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
import textwrap
import pandas as pd
import dotenv

transcriptions = pd.read_csv(r'C:\Users\sikor\OneDrive\Pulpit\PROGRAMMING_2023\medical_transcriptions_summary\data\mtsamples.csv')

Since, OpenAI charges you for each token sent to the LLM, we will limit our data to use only a part containg trascriptions regarding ‘ Cardiovascular / Pulmonary’ illnesses. If you want, you can pass the whole dataset but this would result in a much larger cost.

cardio = transcriptions.loc[transcriptions['medical_specialty']==' Cardiovascular / Pulmonary', :]

We’ve filtered the dataset to include only one category and now we will proceed to the Langchain part. Firstly we load the API Key saved in the .env file usng load_dotenv(). After that, we define our LLM and provide it with a prompt. This prompt gives instructions to the model on how it should create a summary:


llm = OpenAI(temperature=0)
model_name = "gpt-4-turbo"

prompt_template = """Write a concise summary of the following: {text}"""

Then for each value in the column containg transcriptions we do the following steps:

  1. Moving ahead, we construct a prompt and pass the created Document into it.
  2. We create a chain, to connect our Document with the Language Model. This connection allows our Document to be fed into the prompt and sent to the Language Model for response generation.
  3. We run our chain and print the summary.
  4. We add the generated summary to the DataFrame.
  5. Next we can save our DataFrame to an excel file.
for i, val in enumerate(cardio.loc[:, "transcription"]):

# Split document
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = [Document(page_content=x) for x in text_splitter.split_text(val)]

# Create a prompt
prompt = PromptTemplate(template=prompt_template, input_variables=[f"{docs}"])

# Create a chain
chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=True)

# Run chain with the document
summary =
print(f"Summary: {textwrap.fill(summary, width=100)}")

# Add summary to the dataframe
cardio.loc[i, "transcription_summary"] = summary

cardio.to_excel(r'cardio_summaries.xlsx', index=False)

And voila! You have now successfully learned how to condense lengthy transcriptions into short summaries. Time to put this knowledge into action and discover hidden insights in your data.

Let me share an example summary with you. Now, keep in mind I am not a medical professional, but from my perspective, it looks good. So let’s have a look and enjoy! :).


Now you know how to run to generate the summaries for your transcriptions and save it to a DataFrame. You can do it with any other sort of text information of your choice and modify the prompt. Hope you found this tutorial informative and useful. Happy summarizing! 🙂

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button