Skip to the content.

Introduction

Literature Review

A small number of companies dominate the ETF (exchange-traded fund) market. The top five issuers—BlackRock (iShares), Vanguard, State Street (SPDR), Invesco, and Charles Schwab—control the vast majority of ETF assets, with each managing over $100 billion and the three largest (BlackRock, Vanguard, State Street) holding trillions in assets under management [4]. In fact, all 50 of the largest ETFs are managed by these five firms, and the SEC has expressed concerns that this concentration could stifle competition and limit opportunities for new entrants in the ETF space [5].

The Sharpe Ratio is a widely used metric for measuring risk-adjusted returns [1]. Given two stocks with identical annual returns, the one with lower volatility has a higher Sharpe Ratio, indicating better risk-adjusted performance. Clustering algorithms like K-means group stocks based on historical price movements, but investors often prefer sector-based diversification [2]. We aim to resolve this by using hierarchical clustering to group stocks by industry while maximizing the Sharpe Ratio.

Datasets Description

  1. Stock Market Data (used to calculate Sharpe ratio) → Yahoo Finance API (yfinance), Kaggle (Stock Market Dataset - Link)
    1. Features: OHLC (Open, High, Low, Close) prices; we only use daily close prices
  2. Summaries Data for NLP → Yahoo Finance API
    1. Features: Company descriptions based on tickers

Problem Definition

ETFs are managed by a few firms (Vanguard, BlackRock, Invesco) that charge fees, may not optimize stock distributions, and are not developed fast enough to reflect ever changing consumer desire and markets. Investors want customized ETFs that reflect evolving industries while maximizing returns and minimizing risk. Our goal is to enable investors to generate and choose ETFs based on their varied and specific interests.

Methods

We made a model that clusters companies based on similarities, and allows consumers to choose a group of desired specificity and determine the optimal portfolio weighting for that grouping.

Presented in sequential order.

Preprocessing

Vector Embeddings + PCA

PCA Elbow Method Graph

Hierarchical Clustering

Hierarchical Clustering Example Dendrogram Visualization

Title Generation

Optimization

Simulated Price Paths for Different Sharpe Ratios

Results and Discussion

We use 3 distinct metrics to evaluate the correctness of our hierarchical clustering algorithm:

  1. Beta-cv
    1. Measures cluster separation vs compactness.
    2. A higher value means clusters are well-separated relative to how tight they are internally – essentially a measure of the variance retained
  2. Normalized-cut
    1. Measures the cost of cutting a graph into clusters, considering both inter-cluster connections and cluster volume
    2. A lower score means there's better separation — fewer cross-cluster connections
  3. Silhouette score
    1. Checks how similar a point is to its own assigned cluster versus the other clusters that exist
    2. A value close to 1 means that on average, points are assigned to their true cluster
    3. A value close to 0 means that there are a lot of overlapping clusters
    4. A negative value means that on average, points are not correctly assigned to their true cluster
Metric Complete Average Single
Beta-cv 0.53 0.57 0.82
Normalized Cut 8.13e-11 2.80e-12 7.60e-40
Silhouette Score 0.14 0.21 0.04

After clustering, we began optimization on how well our ETF would perform. It would be infeasible to show all of our results, but for our purposes, we can show how one particular cluster would fare as an ETF. We can see that after finding that these companies belonged to the same cluster, our Gemini prompt gave us a name for our ETF – Emerging BioHealth Innovators. It also outputs the ideal weights for each company across our time frame and the Sharpe ratio from this investment instrument. Financially speaking, the S&P usually has a Sharpe of around 0.8 to 0.9, and anything above a 2.0 Sharpe is considered an amazing investment.

Companies in this cluster:

ACADIA Pharmaceuticals Inc.
Ayala Pharmaceuticals, Inc.
Aldeyra Therapeutics, Inc.
Annovis Bio, Inc.
CorMedix Inc.
Citius Pharmaceuticals, Inc.
Entera Bio Ltd.
FibroGen, Inc.
Helius Medical Technologies, Inc.
Seres Therapeutics, Inc.
Mereo BioPharma Group plc
Pulmatrix, Inc.
scPharmaceuticals Inc.
Salarius Pharmaceuticals, Inc.
Titan Pharmaceuticals, Inc.
Verona Pharma plc
Vertex Pharmaceuticals Incorporated
Xenetic Biosciences, Inc.

Generated ETF Title:

Emerging BioHealth Innovators ETF

Optimized Portfolio Weights (Sharpe: 2.9872)

ACAD: 0.00%
ADXS: 0.93%
ALDX: 2.94%
ANVS: 0.00%
CRMD: 12.70%
CTXR: 0.00%
ENTX: 15.09%
FGEN: 3.70%
HSDT: 0.00%
MCRB: 0.00%
MREO: 10.69%
PULM: 16.20%
SCPH: 0.00%
SLRX: 0.08%
TTNP: 0.03%
VRNA: 20.85%
VRTX: 8.86%
XBIO: 7.92%

Lastly, we've included some visualization on the breakdown of our ETF and how it would perform compared to the S&P over the 2024 calendar year.

Portfolio vs. S&P 500

Growth of $1 Investment - Portfolio vs S&P 500

Weight Breakdown

Optimized Portfolio Weights Pie Chart

Next Steps

To improve the UX of consumers looking for stock groupings, we will add the ability to store past generated clusters and query based on their titles. We also are working on the ability to switch between clustering methods to suit desires. A way to improve the clustering itself, is to test with more thorough descriptions with an embedding model with more context. More thorough descriptions can come from SEC filings, and we can test the Gemini embedding model.

From an optimization perspective, its somewhat clear that if we train on the stock's performance over a specific year and then test it over the same year, we'll beat the market. In the future, we would like to implement an optimizer which predicts future values for the stocks and creates a good investment instrument based off of previous values. Ideally, we want to make some sort of regression model which can maintain a Sharpe higher than 1 for 2024, but only be trained on data until 2023.

References

  1. J. Cvitanić, A. Lazrak, and T. Wang, "Implications of the Sharpe ratio as a performance measure in multi-period settings," Journal of Economic Dynamics and Control, vol. 32, no. 5, pp. 1622–1649, May 2008, doi: https://doi.org/10.1016/j.jedc.2007.06.009.
  2. S. Bin, "K-Means Stock Clustering Analysis Based on Historical Price Movements and Financial Ratios," CMC Senior Theses, Jan. 2020, Available: https://scholarship.claremont.edu/cmc_theses/2435/?trk=public_profile_project-title.
  3. N. Jaroonchokanan, T. Termsaithong, and S. Suwanna, "Dynamics of hierarchical clustering in stocks market during financial crises," Physica A: Statistical Mechanics and its Applications, vol. 607, p. 128183, Dec. 2022, doi: https://doi.org/10.1016/j.physa.2022.128183.
  4. M. Johnson, "The 5 Biggest ETF Companies," Investopedia. https://www.investopedia.com/articles/investing/080415/5-biggest-etf-companies.asp
  5. M. Kolakowski, "Who Are the ETF Giants?" Investopedia. https://www.investopedia.com/who-are-the-etf-giants-4691723

We would also like to be considered for the "Outstanding Project" award.

Our github repo https://github.gatech.edu/asingh3014/GenETF