To process a large CSV file in .NET safely, the application should stream records progressively instead of loading the complete file into memory.
A 10 GB, 15 GB, or 20 GB file can create serious memory pressure when an application uses methods such as ReadAllLinesAsync, ReadToEnd, or converts every parsed record into a collection.
The central principle is simple: file size should not determine application memory usage. Memory should instead be controlled by the buffer size, batch size, record size, channel capacity, and number of workers.
Quick recommendation
Use a sequential FileStream, parse records with CsvHelper, validate one record at a time, and write controlled batches with SqlBulkCopy.
How to Process a Large CSV File in .NET Without Loading It into Memory
The recommended architecture reads only a small portion of the file at a time. Each record is parsed, validated, transformed, and persisted before the application continues through the source file.
Why Large CSV Files Cause Memory Problems
The most common problem is not the size of the file itself. The problem is how the application reads and represents it.
Consider the following implementation:
string[] lines = await File.ReadAllLinesAsync( filePath, cancellationToken);
Although the method is asynchronous, it still returns an array containing every line in the file. Asynchronous I/O prevents a thread from being blocked while waiting, but it does not automatically make an operation memory-efficient.
The situation can become worse after the data is:
- decoded from bytes into .NET strings;
- split into fields;
- converted into objects;
- stored inside arrays, lists, or a
DataTable; - duplicated during transformation;
- retained until a database operation completes.
A file that occupies 20 GB on disk can therefore require substantially more memory if the entire dataset is materialized.
Large-File Processing Approaches Compared
| Approach | Memory behavior | CSV correctness | Best use |
|---|---|---|---|
ReadAllLinesAsync | Loads all lines into memory | Does not parse CSV fields | Small, controlled files |
StreamReader.ReadLineAsync | Keeps approximately one line in application code | Insufficient for multiline quoted fields | Logs, NDJSON, and simple delimited text |
| Streaming CsvHelper records | Processes records incrementally | Handles CSV quoting and escaping rules | Real production CSV files |
| Streaming plus bounded batches | Memory is limited by batch size | Depends on the parser | Database imports and ETL |
| Bounded Channel pipeline | Memory is limited by channel capacity and worker count | Depends on the parser | CPU-heavy validation and transformation |
The Recommended .NET Architecture
A production-ready file ingestion pipeline can be organized into the following stages:
Large CSV file or cloud-storage stream ↓ Sequential FileStream or network stream ↓ CSV parser ↓ Record validation and transformation ↓ Optional bounded Channel ↓ Controlled batch ↓ Staging table or destination system ↓ Validation, merge, checkpoint and completion status Each stage should have a clear memory boundary. The reader should not be allowed to produce unlimited records when the database, API, or downstream service cannot consume them at the same speed.
Anti-Patterns to Avoid
The following operations are convenient for small files but risky for very large imports:
// Loads the complete file into one string. string content = await File.ReadAllTextAsync(filePath); // Loads every line into an array. string[] lines = await File.ReadAllLinesAsync(filePath); // Materializes all parsed records. List<CustomerImportRow> records = csv.GetRecords<CustomerImportRow>().ToList(); // Builds a DataTable containing the complete file. DataTable completeImport = BuildDataTableForEntireFile(filePath); // Creates unrestricted parallel work. Task[] tasks = records .Select(record => Task.Run(() => Process(record))) .ToArray(); Other common mistakes include using an unbounded queue, inserting one database row at a time, keeping every rejected record in memory, or assuming that increasing parallelism will always improve disk or database throughput.
Step 1: Read the File Sequentially
For line-oriented text, FileStream and StreamReader provide a straightforward starting point.
public static async Task ProcessTextFileAsync( string filePath, CancellationToken cancellationToken = default) { var options = new FileStreamOptions { Mode = FileMode.Open, Access = FileAccess.Read, Share = FileShare.Read, Options = FileOptions.Asynchronous | FileOptions.SequentialScan, BufferSize = 64 * 1024 }; await using var stream = new FileStream(filePath, options); using var reader = new StreamReader(stream); long lineNumber = 0; while (true) { string? line = await reader.ReadLineAsync(cancellationToken); if (line is null) { break; } lineNumber++; await ProcessLineAsync( line, lineNumber, cancellationToken); } } private static Task ProcessLineAsync( string line, long lineNumber, CancellationToken cancellationToken) { // Validate, transform or write the current line. return Task.CompletedTask; } FileOptions.SequentialScan communicates that the file will primarily be read from beginning to end. It is appropriate for large imports that do not require random access.
This implementation keeps memory usage bounded, but it should not be treated as a complete CSV parser.
Why string.Split Is Not a Reliable CSV Parser
A CSV file is not always equivalent to one record per physical line with fields separated by commas.
For example:
1024,"Valluri, Nirmala","Cary, NC","Customer said ""please call""." A CSV file can contain:
- commas inside quoted fields;
- escaped quotation marks;
- line breaks inside quoted fields;
- different delimiters;
- empty or missing columns;
- different date and numeric formats;
- headers that do not exactly match property names.
The following code can therefore produce incorrect results:
string[] columns = line.Split(','); For real CSV files, use a parser that understands CSV quoting and escaping rules.
Step 2: Stream Records with CsvHelper
Install the required packages:
dotnet add package CsvHelper dotnet add package Microsoft.Data.SqlClient Create a model representing one record:
public sealed class CustomerImportRow { public long CustomerId { get; set; } public string Name { get; set; } = string.Empty; public string Email { get; set; } = string.Empty; public decimal Balance { get; set; } } The following importer reads records progressively, validates them, and retains only a bounded batch.
using System.Data; using System.Globalization; using CsvHelper; using CsvHelper.Configuration; using Microsoft.Data.SqlClient; public sealed record ImportResult( long RowsRead, long RowsImported, long RowsRejected); public sealed class LargeCsvImporter { private const int BatchSize = 5_000; private readonly string _connectionString; public LargeCsvImporter(string connectionString) { _connectionString = connectionString; } public async Task<ImportResult> ImportAsync( string filePath, CancellationToken cancellationToken = default) { long rowsRead = 0; long rowsImported = 0; long rowsRejected = 0; var batch = new List<CustomerImportRow>(BatchSize); var fileOptions = new FileStreamOptions { Mode = FileMode.Open, Access = FileAccess.Read, Share = FileShare.Read, Options = FileOptions.Asynchronous | FileOptions.SequentialScan, BufferSize = 128 * 1024 }; var csvConfiguration = new CsvConfiguration(CultureInfo.InvariantCulture) { HasHeaderRecord = true, TrimOptions = TrimOptions.Trim, DetectColumnCountChanges = true }; await using var fileStream = new FileStream(filePath, fileOptions); using var textReader = new StreamReader(fileStream); using var csv = new CsvReader(textReader, csvConfiguration); await using var connection = new SqlConnection(_connectionString); await connection.OpenAsync(cancellationToken); await foreach ( CustomerImportRow row in csv.GetRecordsAsync<CustomerImportRow>() .WithCancellation(cancellationToken)) { rowsRead++; if (!TryValidate(row, out string validationError)) { rowsRejected++; // In production, write the row number, record identifier // and validationError to a rejected-record destination. continue; } batch.Add(row); if (batch.Count < BatchSize) { continue; } await WriteBatchAsync( connection, batch, cancellationToken); rowsImported += batch.Count; batch.Clear(); } if (batch.Count > 0) { await WriteBatchAsync( connection, batch, cancellationToken); rowsImported += batch.Count; batch.Clear(); } return new ImportResult( RowsRead: rowsRead, RowsImported: rowsImported, RowsRejected: rowsRejected); } private static bool TryValidate( CustomerImportRow row, out string validationError) { if (row.CustomerId <= 0) { validationError = "CustomerId must be greater than zero."; return false; } if (string.IsNullOrWhiteSpace(row.Name)) { validationError = "Name is required."; return false; } if (string.IsNullOrWhiteSpace(row.Email) || !row.Email.Contains('@')) { validationError = "Email is invalid."; return false; } validationError = string.Empty; return true; } private static async Task WriteBatchAsync( SqlConnection connection, IReadOnlyCollection<CustomerImportRow> batch, CancellationToken cancellationToken) { using DataTable table = BuildDataTable(batch); using var bulkCopy = new SqlBulkCopy(connection) { DestinationTableName = "dbo.CustomerImportStaging", BatchSize = batch.Count, BulkCopyTimeout = 300 }; bulkCopy.ColumnMappings.Add( "CustomerId", "CustomerId"); bulkCopy.ColumnMappings.Add( "Name", "Name"); bulkCopy.ColumnMappings.Add( "Email", "Email"); bulkCopy.ColumnMappings.Add( "Balance", "Balance"); await bulkCopy.WriteToServerAsync( table, cancellationToken); } private static DataTable BuildDataTable( IEnumerable<CustomerImportRow> rows) { var table = new DataTable(); table.Columns.Add( "CustomerId", typeof(long)); table.Columns.Add( "Name", typeof(string)); table.Columns.Add( "Email", typeof(string)); table.Columns.Add( "Balance", typeof(decimal)); foreach (CustomerImportRow row in rows) { table.Rows.Add( row.CustomerId, row.Name, row.Email, row.Balance); } return table; } } This example deliberately creates a DataTable for only one batch—not for the entire file. After the bulk write completes, the batch and its temporary table can be released.
For extremely high-throughput systems, a custom IDataReader can stream records directly into SqlBulkCopy and avoid the intermediate DataTable. However, bounded batch tables are often simpler to implement and sufficiently efficient when their size is measured and controlled.
How to Choose a Batch Size
There is no universally correct batch size. A useful starting point might be 1,000 to 10,000 records, but the correct value depends on:
- the average size of each record;
- the number and type of columns;
- validation and transformation complexity;
- available application memory;
- database transaction-log behavior;
- network latency;
- indexes and constraints on the destination table;
- how quickly a failed batch must be retried.
Benchmark several batch sizes using representative production data. A larger batch can improve throughput, but it also increases memory usage, transaction size, retry cost, and the amount of work lost when a batch fails.
Step 3: Add Backpressure with a Bounded Channel
A streaming reader can still create memory pressure if it reads faster than downstream processing can complete.
Consider this situation:
Fast file reader ↓ Unlimited in-memory queue ↓ Slow API or database writer ↓ Memory continues growing A bounded Channel<T> provides backpressure. When the channel reaches its configured capacity, the producer waits for consumers to create space.
using System.Threading.Channels; Channel<CustomerImportRow> channel = Channel.CreateBounded<CustomerImportRow>( new BoundedChannelOptions(capacity: 2_000) { FullMode = BoundedChannelFullMode.Wait, SingleWriter = true, SingleReader = false }); A more advanced pipeline can use:
CSV reader ↓ Bounded input channel ↓ Validation and transformation workers ↓ Bounded output channel ↓ Single batch writer ↓ Database The database writer is usually kept controlled because unrestricted concurrent bulk writes can create contention, lock pressure, connection pressure, and unpredictable transaction-log growth.
Should the CSV File Be Read in Parallel?
Reading different portions of a CSV file in parallel is more complicated than splitting a byte range into equal pieces.
A partition boundary could occur:
- in the middle of a UTF-8 character;
- in the middle of a quoted field;
- inside a record containing embedded line breaks;
- between an opening and closing quotation mark.
For most systems, a safer design is:
- one sequential parser per file;
- bounded parallelism for independent record transformations;
- controlled batching for the destination;
- multiple files processed concurrently with a global concurrency limit.
Partitioned reading is reasonable only when the format guarantees safe boundaries or when a preprocessing step creates those boundaries.
Error Handling and Rejected Records
A production import should distinguish between several kinds of failure:
| Failure type | Example | Recommended action |
|---|---|---|
| Business validation | Missing customer identifier | Reject the record and continue |
| Type conversion | Invalid decimal or date | Capture row details and continue when safe |
| CSV structure | Broken quotation or inconsistent columns | Quarantine the file or apply an approved recovery rule |
| Transient destination failure | Temporary database or network failure | Retry the current batch with limits |
| Permanent system failure | Schema mismatch or missing destination | Stop the import and mark it failed |
Rejected records should be written to a separate file, table, or object-storage location as they are encountered. Avoid storing millions of error objects in a list until the import finishes.
A useful rejected-record entry can contain:
- import identifier;
- record or row number;
- source record identifier;
- validation or conversion message;
- sanitized source data;
- timestamp.
Make the Import Restartable
If a process fails after reading 18 GB of a 20 GB file, restarting from the beginning may be expensive and operationally frustrating.
Store import metadata such as:
ImportId SourceFileName SourceFileHash SourceFileSize StartedAt LastUpdatedAt RowsRead RowsImported RowsRejected LastCommittedBatch CurrentStatus FailureReason Checkpoint only after a batch has been committed successfully. This prevents the checkpoint from advancing beyond durable data.
Byte-offset recovery is possible for some formats, but it must be designed carefully. A byte offset can point into the middle of a quoted CSV record. Batch numbers, source record keys, staging tables, or precomputed safe boundaries may be easier to reason about.
Use Idempotency to Prevent Duplicate Imports
A restartable process must also be safe to execute more than once.
Common approaches include:
- assigning every import a unique identifier;
- loading data into a staging table first;
- creating unique constraints on business keys;
- using a file hash to identify repeated uploads;
- merging staged records into the final tables;
- tracking which batches were committed;
- making downstream operations idempotent.
A staging-table design is often safer than inserting directly into business tables because it separates ingestion, validation, reconciliation, and final merge operations.
Do Not Process a 20 GB Upload Inside One HTTP Request
An ASP.NET Core endpoint should generally accept or coordinate the upload, but it should not keep a request open while a multi-gigabyte import is parsed and written to a database.
A more resilient cloud architecture is:
Client upload ↓ Object storage ↓ Import metadata record ↓ Queue or job request ↓ Background .NET worker ↓ Streaming parser and batch writer ↓ Progress and completion status ↓ Client polls or receives notification This separates upload availability from processing time and makes retries, cancellation, progress reporting, and horizontal scaling easier to manage.
Design the Processor Around Stream Instead of File Paths
A reusable ingestion service should ideally accept a Stream rather than requiring a local file path.
public Task<ImportResult> ImportAsync( Stream source, CancellationToken cancellationToken = default) { // Build the parser around the supplied stream. } The caller can then supply:
- a local
FileStream; - an Azure Blob Storage stream;
- an Amazon S3 response stream;
- a decompression stream;
- a secure network stream;
- a test stream created from known sample data.
This design avoids unnecessary temporary copies and makes the processing logic easier to test.
Applying the Same Pattern to Other File Types
| Format | Memory-efficient .NET approach | Important consideration |
|---|---|---|
| CSV or TSV | CsvHelper with a stream | Quoted fields can span lines |
| NDJSON | Read one line and deserialize one object | Each line must be an independent JSON value |
| Large JSON array | DeserializeAsyncEnumerable<T> or Utf8JsonReader | Avoid loading the complete JSON document model |
| XML | XmlReader | Use forward-only processing instead of loading a complete document |
| Fixed-width text | StreamReader with span-based field parsing | Validate line length before slicing |
| Application logs | Line-by-line streaming | Multiline exceptions may need grouping logic |
| Binary files | ReadAsync with a reusable byte buffer | Chunks may not align with logical records |
| GZip or ZIP | Wrap the source in a decompression stream | Do not extract the entire archive when streaming is possible |
| PDF or Office documents | Use format-specific extraction APIs | Some libraries build a complete in-memory document model |
Streaming a Large JSON Array
await foreach ( EventRecord? item in JsonSerializer.DeserializeAsyncEnumerable<EventRecord>( sourceStream, cancellationToken: cancellationToken)) { if (item is null) { continue; } await ProcessEventAsync(item, cancellationToken); } Streaming XML with XmlReader
var settings = new XmlReaderSettings { Async = true, DtdProcessing = DtdProcessing.Prohibit }; using XmlReader reader = XmlReader.Create(sourceStream, settings); while (await reader.ReadAsync()) { cancellationToken.ThrowIfCancellationRequested(); if (reader.NodeType == XmlNodeType.Element && reader.Name == "customer") { // Read and process the current customer element. } } .NET Use Cases for Large-File Streaming
This architecture is useful for:
- customer and account migrations;
- financial transaction imports;
- insurance or healthcare data exchanges;
- product catalog and inventory imports;
- application-log analysis;
- telemetry ingestion;
- data cleaning and normalization;
- legacy-system modernization;
- ETL and reporting pipelines;
- document preparation for AI applications;
- RAG ingestion and embedding workflows;
- scheduled cloud data-processing jobs.
How Large-File Processing Connects to AI and RAG
Many AI systems begin with a data-ingestion problem rather than a model problem.
Large source files ↓ Streaming ingestion ↓ Validation and data cleaning ↓ Text or field extraction ↓ Chunking ↓ Embedding generation ↓ Vector index ↓ RAG application Loading the complete dataset before generating embeddings can create unnecessary memory pressure and reduce recoverability. A bounded pipeline can read, clean, chunk, and submit a controlled number of items while respecting downstream API limits.
Costs and Scaling Considerations
Streaming reduces memory requirements, but the overall design still affects cloud and infrastructure costs.
| Cost area | What drives it | How to control it |
|---|---|---|
| Memory | Batch size, channel capacity, record size and worker count | Keep every boundary bounded and measure peak working set |
| Compute | Parsing, validation, decompression and transformation | Scale workers only when CPU is the actual bottleneck |
| Storage | Source files, staging data, rejected records and retained copies | Define retention and lifecycle policies |
| Network | Moving large files between regions or services | Process close to the storage and destination systems |
| Database | Indexes, logging, locking, concurrency and batch size | Use staging tables, bulk operations and controlled concurrency |
| Failure recovery | Repeatedly processing an entire file | Use checkpoints, idempotency and bounded retries |
Scaling should be based on measurements. Adding workers will not help when the database, storage service, network, or third-party API is already saturated.
Observability for Large Imports
At minimum, capture:
- bytes processed;
- records read;
- records imported;
- records rejected;
- records processed per second;
- bytes processed per second;
- current batch duration;
- database-write duration;
- channel backlog;
- retry count;
- process working set;
- garbage-collection activity;
- estimated completion percentage;
- total elapsed time.
Progress should be reported periodically rather than after every record. Excessive logging can become its own performance bottleneck.
Decision Framework
| Situation | Recommended design |
|---|---|
| Small internal file with controlled format | Simple streaming reader and validation |
| Real CSV with quoting and escaped values | CsvHelper or another standards-aware CSV parser |
| Large CSV imported into SQL Server | Streaming parser, bounded batches, staging table and SqlBulkCopy |
| CPU-intensive transformation | Bounded Channel with a measured number of workers |
| Several large files | One sequential parser per file with a global concurrency limit |
| Large cloud upload | Object storage, queue or job request, and a background worker |
| Import must resume after failure | Committed batch checkpoints, staging and idempotency |
| JSON or XML instead of CSV | Forward-only format-specific streaming reader |
Final Recommendations
A 20 GB file does not require a 20 GB in-memory collection. A reliable .NET solution should:
- open the source as a stream;
- read sequentially;
- use a parser appropriate for the format;
- avoid
ToList,ToArray, and full-file document models; - validate and transform records incrementally;
- keep queues, batches, and worker counts bounded;
- use bulk destination operations;
- write rejected records progressively;
- support cancellation and restart;
- make repeated execution idempotent;
- measure memory, throughput, and downstream latency.
The most important architectural rule is not a particular buffer size or library. It is maintaining a clear upper boundary on how much data every stage is allowed to retain.
Frequently Asked Questions
Can .NET process a 20 GB CSV file with limited memory?
Yes. The application can process a 20 GB CSV file with limited memory when it reads the source as a stream and retains only the current record, a controlled batch, and any bounded pipeline buffers. Memory usage depends on those boundaries rather than the complete file size.
Does async file reading reduce memory usage?
Not automatically. Asynchronous I/O reduces thread blocking while waiting for I/O. A method such as ReadAllLinesAsync is asynchronous but still loads every line into memory. Streaming and asynchrony solve different problems.
What buffer size should I use for a large file?
Start with a moderate buffer such as 64 KB or 128 KB and benchmark with representative files and storage. Very large buffers do not automatically produce better throughput and increase memory usage when many files are processed concurrently.
What is the best database batch size?
There is no universal best size. Values between 1,000 and 10,000 records are reasonable starting points, but record width, database indexes, transaction logging, latency, retry requirements, and available memory should determine the final value.
Can I parse a CSV file with string.Split?
Only when the format is strictly controlled and fields can never contain delimiters, quotation marks, or line breaks. For general CSV files, use a parser that supports CSV quoting and escaping rules.
Does CsvHelper load the complete file into memory?
CsvHelper can enumerate records progressively when its record sequence is consumed directly. Calling ToList, ToArray, or another materializing operation will load all resulting records into a collection.
Should I read one large CSV file using multiple threads?
Usually not. CSV partition boundaries can fall inside quoted or multiline fields. A safer design uses one sequential parser and parallelizes independent downstream transformations with bounded concurrency.
How should a large CSV upload be handled in ASP.NET Core?
Store the uploaded file in durable storage, create an import job, and process it with a background worker. The client can query job status or receive a completion notification instead of keeping one HTTP request open during the entire import.
How can large JSON and XML files be processed with low memory?
For JSON, use an asynchronous enumerable or a forward-only UTF-8 reader. For XML, use XmlReader. Avoid loading a complete JSON or XML document model when only progressive processing is required.
What happens if one record is extremely large?
Streaming limits memory relative to the amount of data currently being processed, but one exceptionally large record can still create a memory spike. Apply maximum record and field sizes when the business format allows them.
Related AinexArch Guides
- Build a Simple RAG API in ASP.NET Core
- .NET Aspire Explained for Cloud-Native Teams
- Azure Functions vs Container Apps: Which Should You Choose?
- Build AI Applications in .NET with Microsoft.Extensions.AI
