Introduction
Literature Review
A small number of companies dominate the ETF (exchange-traded fund) market. The top five issuers—BlackRock (iShares), Vanguard, State Street (SPDR), Invesco, and Charles Schwab—control the vast majority of ETF assets, with each managing over $100 billion and the three largest (BlackRock, Vanguard, State Street) holding trillions in assets under management [4]. In fact, all 50 of the largest ETFs are managed by these five firms, and the SEC has expressed concerns that this concentration could stifle competition and limit opportunities for new entrants in the ETF space [5].
The Sharpe Ratio is a widely used metric for measuring risk-adjusted returns [1]. Given two stocks with identical annual returns, the one with lower volatility has a higher Sharpe Ratio, indicating better risk-adjusted performance. Clustering algorithms like K-means group stocks based on historical price movements, but investors often prefer sector-based diversification [2]. We aim to resolve this by using hierarchical clustering to group stocks by industry while maximizing the Sharpe Ratio.
Datasets Description
- Stock Market Data (used to calculate Sharpe ratio) → Yahoo Finance API (yfinance),
Kaggle (Stock Market Dataset - Link)
- Features: OHLC (Open, High, Low, Close) prices; we only use daily close prices
- Summaries Data for NLP → Yahoo Finance API
- Features: Company descriptions based on tickers
Problem Definition
ETFs are managed by a few firms (Vanguard, BlackRock, Invesco) that charge fees, may not optimize stock distributions, and are not developed fast enough to reflect ever changing consumer desire and markets. Investors want customized ETFs that reflect evolving industries while maximizing returns and minimizing risk. Our goal is to enable investors to generate and choose ETFs based on their varied and specific interests.
Methods
We made a model that clusters companies based on similarities, and allows consumers to choose a group of desired specificity and determine the optimal portfolio weighting for that grouping.
Presented in sequential order.
Preprocessing
- For data preprocessing, we used natural language processing (NLP) techniques including text normalization, lowercasing, punctuation removal, stopword removal, tokenization, lemmatization, stemming, and data deduplication to clean and standardize company descriptions.
- These preprocessing steps were taken to remove noisy and missing data, ensuring high-quality inputs are passed on to the embedding and clustering models described below.
Vector Embeddings + PCA
- We use the ModernBERT-large embedding model to convert descriptions into vectors. These vectors capture the semantic meaning of these descriptions so that similar descriptions are close in the vector space. We used ModernBERT because it is a powerful and modern embedding model. It outputs vectors with 1024 features.
- 1024 features is a lot, features could be noisy and there are way too many; slowing down our models and lowering the quality of clusters. We used Principal Component Analysis (PCA) to reduce the number of features.
- Using the elbow method, which varies the number of output features of PCA along with the variance of the original feature set captured, we determined that 349 output features would still maintain 95% of the variance and be a good balance in feature reduction and variance.
Hierarchical Clustering
- We use hierarchical clustering on vectorized descriptions to create a dendrogram. Hence, companies that are in similar markets & industries will be in the same cluster. We use hierarchical clustering so that one company can appear in many different groups, giving the user flexibility to pick as generic or as specific of a cluster as they want. The "optimal" cut length is up to the user – they can decide the cuts on the granularity of clusters. We use "cut_tree()" to slice our hierarchical clustering dendrogram at a point that yields a set number of clusters. Essentially, if the user wants 100 clusters, each cluster will be more dense and probabilistically include more stocks; hence, the cluster will be more generic and include a wider range of companies. On the other hand, if the user selects 1000 clusters, each cluster will be sparse and include only a few companies. Hence, the ETF generated from these stocks will be more specific industry-wise.
- We use 3 different linkage methods to generate the hierarchical cluster: single, complete, and average. Single linkage is sensitive to outliers. Complete linkage is more robust to noise since it will find the distance between the farthest two points in two given clusters, and use that as the distance. Average linkage strikes a balance between these because it computes the average pairwise distances between all the points in the two clusters.
Title Generation
- After a cluster of companies is chosen, we want an automatic way to label the ETF.
- We prompt the Gemini 2.0 Flash API with the list of companies and a thorough prompt to generate a descriptive and concise title.
- The Gemini 2.0 Flash API was chosen because
- Speed - under high query volumes speed is important to prevent bottlenecks. With speed comes a quality tradeoff, but the nature of the prompt is pretty simple, the quality is still overkill.
- Cost - this API is particularly low cost compared to competitors, and satisfies our needs.
Optimization
- Our goal is to maximize the Sharpe Ratio while applying L2 regularization, using SLSQP (Sequential Least Squares Programming).
- Constraints
- The sum of our weights (percentages makeup of each stock) should be 1.0.
- The weight of each stock should be between 0.0 and 1.0 (no short selling).
- Sharpe Ratio
- Defined as returns in excess of the risk-free rate divided by the standard deviation of the asset.
- In simple terms, makes us focus not only on percent returns, but also reliability as an investment. Investors in our ETFs should enjoy great returns, but not at the expense of the asset wildly swinging throughout trading hours.
- L2 Regularization
- If we include, say, Nvidia (which had a great year in 2024), then other stocks won't get any weightage.
- The problem with this is that the goal of an ETF is to diversify the assets you own – there's no point in owning an ETF that's 100% Nvidia.
- Therefore, we choose to penalize lower weights by making a penalty of the negative sum of the squared values of the weights.
- SLSQP (Sequential Least Squares Programming)
- Goal: minimize our custom objective; negative Sharpe ratio with L2 regularization
- Define constraints: nonnegative weights that sum to 1
- We iteratively approximate the problem with quadratic programs, updating the weights to find a better portfolio each time
- We do this 1000 times, which ends up converging our objective function
Results and Discussion
We use 3 distinct metrics to evaluate the correctness of our hierarchical clustering algorithm:
- Beta-cv
- Measures cluster separation vs compactness.
- A higher value means clusters are well-separated relative to how tight they are internally – essentially a measure of the variance retained
- Normalized-cut
- Measures the cost of cutting a graph into clusters, considering both inter-cluster connections and cluster volume
- A lower score means there's better separation — fewer cross-cluster connections
- Silhouette score
- Checks how similar a point is to its own assigned cluster versus the other clusters that exist
- A value close to 1 means that on average, points are assigned to their true cluster
- A value close to 0 means that there are a lot of overlapping clusters
- A negative value means that on average, points are not correctly assigned to their true cluster
| Metric | Complete | Average | Single |
|---|---|---|---|
| Beta-cv | 0.53 | 0.57 | 0.82 |
| Normalized Cut | 8.13e-11 | 2.80e-12 | 7.60e-40 |
| Silhouette Score | 0.14 | 0.21 | 0.04 |
After clustering, we began optimization on how well our ETF would perform. It would be infeasible to show all of our results, but for our purposes, we can show how one particular cluster would fare as an ETF. We can see that after finding that these companies belonged to the same cluster, our Gemini prompt gave us a name for our ETF – Emerging BioHealth Innovators. It also outputs the ideal weights for each company across our time frame and the Sharpe ratio from this investment instrument. Financially speaking, the S&P usually has a Sharpe of around 0.8 to 0.9, and anything above a 2.0 Sharpe is considered an amazing investment.
Companies in this cluster:
ACADIA Pharmaceuticals Inc.
Ayala Pharmaceuticals, Inc.
Aldeyra Therapeutics, Inc.
Annovis Bio, Inc.
CorMedix Inc.
Citius Pharmaceuticals, Inc.
Entera Bio Ltd.
FibroGen, Inc.
Helius Medical Technologies, Inc.
Seres Therapeutics, Inc.
Mereo BioPharma Group plc
Pulmatrix, Inc.
scPharmaceuticals Inc.
Salarius Pharmaceuticals, Inc.
Titan Pharmaceuticals, Inc.
Verona Pharma plc
Vertex Pharmaceuticals Incorporated
Xenetic Biosciences, Inc.
Generated ETF Title:
Emerging BioHealth Innovators ETF
Optimized Portfolio Weights (Sharpe: 2.9872)
ACAD: 0.00%
ADXS: 0.93%
ALDX: 2.94%
ANVS: 0.00%
CRMD: 12.70%
CTXR: 0.00%
ENTX: 15.09%
FGEN: 3.70%
HSDT: 0.00%
MCRB: 0.00%
MREO: 10.69%
PULM: 16.20%
SCPH: 0.00%
SLRX: 0.08%
TTNP: 0.03%
VRNA: 20.85%
VRTX: 8.86%
XBIO: 7.92%
Lastly, we've included some visualization on the breakdown of our ETF and how it would perform compared to the S&P over the 2024 calendar year.
Portfolio vs. S&P 500
Weight Breakdown
Next Steps
To improve the UX of consumers looking for stock groupings, we will add the ability to store past generated clusters and query based on their titles. We also are working on the ability to switch between clustering methods to suit desires. A way to improve the clustering itself, is to test with more thorough descriptions with an embedding model with more context. More thorough descriptions can come from SEC filings, and we can test the Gemini embedding model.
From an optimization perspective, its somewhat clear that if we train on the stock's performance over a specific year and then test it over the same year, we'll beat the market. In the future, we would like to implement an optimizer which predicts future values for the stocks and creates a good investment instrument based off of previous values. Ideally, we want to make some sort of regression model which can maintain a Sharpe higher than 1 for 2024, but only be trained on data until 2023.
References
- J. Cvitanić, A. Lazrak, and T. Wang, "Implications of the Sharpe ratio as a performance measure in multi-period settings," Journal of Economic Dynamics and Control, vol. 32, no. 5, pp. 1622–1649, May 2008, doi: https://doi.org/10.1016/j.jedc.2007.06.009.
- S. Bin, "K-Means Stock Clustering Analysis Based on Historical Price Movements and Financial Ratios," CMC Senior Theses, Jan. 2020, Available: https://scholarship.claremont.edu/cmc_theses/2435/?trk=public_profile_project-title.
- N. Jaroonchokanan, T. Termsaithong, and S. Suwanna, "Dynamics of hierarchical clustering in stocks market during financial crises," Physica A: Statistical Mechanics and its Applications, vol. 607, p. 128183, Dec. 2022, doi: https://doi.org/10.1016/j.physa.2022.128183.
- M. Johnson, "The 5 Biggest ETF Companies," Investopedia. https://www.investopedia.com/articles/investing/080415/5-biggest-etf-companies.asp
- M. Kolakowski, "Who Are the ETF Giants?" Investopedia. https://www.investopedia.com/who-are-the-etf-giants-4691723
We would also like to be considered for the "Outstanding Project" award.
Our github repo https://github.gatech.edu/asingh3014/GenETF